Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4843
Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha (Eds.)
Computer Vision – ACCV 2007 8th Asian Conference on Computer Vision Tokyo, Japan, November 18-22, 2007 Proceedings, Part I
13
Volume Editors Yasushi Yagi Osaka University The Institute of Scientific and Industrial Research 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan E-mail:
[email protected] Sing Bing Kang Microsoft Corporation 1 Microsoft Way, Redmond WA 98052, USA E-mail:
[email protected] In So Kweon KAIST School of Electrical Engineering and Computer Science 335 Gwahag-Ro Yusung-Gu, Daejeon, Korea E-mail:
[email protected] Hongbin Zha Peking University Department of Machine Intelligence Beijing, 100871, China E-mail:
[email protected]
Library of Congress Control Number: 2007938408 CR Subject Classification (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-76385-6 Springer Berlin Heidelberg New York 978-3-540-76385-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12183654 06/3180 543210
Preface
It is our great pleasure to welcome you to the Proceedings of the Eighth Asian Conference on Computer Vision (ACCV07), which held November 18–22, 2007 in Tokyo, Japan. ACCV07 was sponsored by the Asian Federation of Computer Vision. We received 640 abstracts by the abstract submission deadline, 551 of which became full submissions. This is the largest number of submissions in the history of ACCV. Out of these 551 full submissions, 46 were selected for oral presentation and 130 as posters, yielding an acceptance rate of 31.9%. Following the tradition of previous ACCVs, the reviewing process was double blind. Each of the 31 Area Chairs (ACs) handled about 17 papers and nominated five reviewers for each submission (from 204 Program Committee members). The final selection of three reviewers per submission was done in such a way as to avoid conflict of interest and to evenly balance the load among the reviewers. Once the reviews were done, each AC wrote summary reports based on the reviews and their own assessments of the submissions. For conflicting scores, ACs consulted with reviewers, and at times had us contact authors for clarification. The AC meeting was held in Osaka on July 27 and 28. We divided the 31 ACs into 8 groups, with each group having 3 or 4 ACs. The ACs can confer within their respective groups, and are permitted to discuss with pre-approved “consulting” ACs outside their groups if needed. The ACs were encouraged to rely more on their perception of paper vis-a-vis reviewer comments, and not strictly based on numerical scores alone. This year, we introduced the category “conditional accept;” this category is targeted at papers with good technical content but whose writing requires significant improvement. Please keep in mind that no reviewing process is perfect. As with any major conference, reviewer quality and timeliness of reviews varied. To minimize the impact of variation of these factors, we chose highly qualified and dependable people as ACs to shepherd the review process. We all did the best we could given the large number of submissions and the limited time we had. Interestingly, we did not have to instruct the ACs to revise their decisions at the end of the AC meeting—all the ACs did a great job in ensuring the high quality of accepted papers. That being said, it is possible there were good papers that fell through the cracks, and we hope such papers will quickly end up being published at other good avenues. It has been a pleasure for us to serve as ACCV07 Program Chairs, and we can honestly say that this has been a memorable and rewarding experience. We would like to thank the ACCV07 ACs and members of the Technical Program Committee for their time and effort spent reviewing the submissions. The ACCV Osaka team (Ryusuke Sagawa, Yasushi Makihara, Tomohiro Mashita, Kazuaki Kondo, and Hidetoshi Mannami), as well as our conference secretaries (Noriko
VI
Preface
Yasui, Masako Kamura, and Sachiko Kondo), did a terrific job organizing the conference. We hope that all of the attendees found the conference informative and thought provoking. November 2007
Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha
Organization
General Chair General Co-chairs
Program Chair Program Co-chairs
Workshop/Tutorial Chair Finance Chair Local Arrangements Chair Publication Chairs Technical Support Staff
Area Chairs
Katsushi Ikeuchi (University of Tokyo, Japan) Naokazu Yokoya (NAIST, Japan) Rin-ichiro Taniguchi (Kyuushu University, Japan) Yasushi Yagi (Osaka University, Japan) In So Kweon (KAIST, Korea) Sing Bing Kang (Microsoft Research, USA) Hongbin Zha (Peking University, China) Kazuhiko Sumi (Mitsubishi Electric, Japan) Keiji Yamada (NEC, Japan) Yoshinari Kameda (University of Tsukuba, Japan) Hideo Saito (Keio University, Japan) Daisaku Arita (ISIT, Japan) Atsuhiko Banno (University of Tokyo, Japan) Daisuke Miyazaki (University of Tokyo, Japan) Ryusuke Sagawa (Osaka University, Japan) Yasushi Makihara (Osaka University, Japan) Tat-Jen Cham (Nanyang Tech. University, Singapore) Koichiro Deguchi (Tohoku University, Japan) Frank Dellaert (Georgia Inst. of Tech., USA) Martial Hebert (CMU, USA) Ki Sang Hong (Pohang University of Sci. and Tech., Korea) Yi-ping Hung (National Taiwan University, Taiwan) Reinhard Klette (University of Auckland, New Zealand) Chil-Woo Lee (Chonnam National University, Korea) Kyoung Mu Lee (Seoul National University, Korea) Sang Wook Lee (Sogang University, Korea) Stan Z. Li (CASIA, China) Yuncai Liu (Shanghai Jiaotong University, China) Yasuyuki Matsushita (Microsoft Research Asia, China) Yoshito Mekada (Chukyo University, Japan) Yasuhiro Mukaigawa (Osaka University, Japan)
VIII
Organization
P.J. Narayanan (IIIT, India) Masatoshi Okutomi (Tokyo Inst. of Tech., Japan) Tomas Pajdla (Czech Technical University, Czech) Shmuel Peleg (The Hebrew University of Jerusalem, Israel) Jean Ponce (Ecole Normale Superieure, France) Long Quan (Hong Kong University of Sci. and Tech., China) Ramesh Raskar (MERL, USA) Jim Rehg (Georgia Inst. of Tech., USA) Jun Sato (Nagoya Inst. of Tech., Japan) Shinichi Sato (NII, Japan) Yoichi Sato (University of Tokyo, Japan) Cordelia Schmid (INRIA, France) Christoph Schnoerr (University of Mannheim, Germany) David Suter (Monash University, Australia) Xiaoou Tang (Microsoft Research Asia, China) Guangyou Xu (Tsinghua University, China)
Program Committee Adrian Barbu Akash Kushal Akihiko Torii Akihiro Sugimoto Alexander Shekhovtsov Amit Agrawal Anders Heyden Andreas Koschan Andres Bruhn Andrew Hicks Anton van den Hengel Atsuto Maki Baozong Yuan Bernt Schiele Bodo Rosenhahn Branislav Micusik C.V. Jawahar Chieh-Chih Wang Chin Seng Chua Chiou-Shann Fuh Chu-song Chen
Cornelia Fermuller Cristian Sminchisescu Dahua Lin Daisuke Miyazaki Daniel Cremers David Forsyth Duy-Dinh Le Fanhuai Shi Fay Huang Florent Segonne Frank Dellaert Frederic Jurie Gang Zeng Gerald Sommer Guoyan Zheng Hajime Nagahara Hanzi Wang Hassan Foroosh Hideaki Goto Hidekata Hontani Hideo Saito
Hiroshi Ishikawa Hiroshi Kawasaki Hong Zhang Hongya Tuo Hynek Bakstein Hyun Ki Hong Ikuko Shimizu Il Dong Yun Itaru Kitahara Ivan Laptev Jacky Baltes Jakob Verbeek James Crowley Jan-Michael Frahm Jan-Olof Eklundh Javier Civera Jean Martinet Jean-Sebastien Franco Jeffrey Ho Jian Sun Jiang yu Zheng
Organization
Jianxin Wu Jianzhuang Liu Jiebo Luo Jingdong Wang Jinshi Cui Jiri Matas John Barron John Rugis Jong Soo Choi Joo-Hwee Lim Joon Hee Han Joost Weijer Jun Sato Jun Takamatsu Junqiu Wang Juwei Lu Kap Luk Chan Karteek Alahari Kazuhiro Hotta Kazuhiro Otsuka Keiji Yanai Kenichi Kanatani Kenton McHenry Ki Sang Hong Kim Steenstrup Pedersen Ko Nishino Koichi Hashomoto Larry Davis Lisheng Wang Manabu Hashimoto Marcel Worring Marshall Tappen Masanobu Yamamoto Mathias Kolsch Michael Brown Michael Cree Michael Isard Ming Tang Ming-Hsuan Yang Mingyan Jiang Mohan Kankanhalli Moshe Ben-Ezra Naoya Ohta Navneet Dalal Nick Barnes
Nicu Sebe Noboru Babaguchi Nobutaka Shimada Ondrej Drbohlav Osamu Hasegawa Pascal Vasseur Patrice Delmas Pei Chen Peter Sturm Philippos Mordohai Pierre Jannin Ping Tan Prabir Kumar Biswas Prem Kalra Qiang Wang Qiao Yu Qingshan Liu QiuQi Ruan Radim Sara Rae-Hong Park Ralf Reulke Ralph Gross Reinhard Koch Rene Vidal Robert Pless Rogerio Feris Ron Kimmel Ruigang Yang Ryad Benosman Ryusuke Sagawa S.H. Srinivasan S. Kevin Zhou Seungjin Choi Sharat Chandran Sheng-Wen Shih Shihong Lao Shingo Kagami Shin’ichi Satoh Shinsaku Hiura ShiSguang Shan Shmuel Peleg Shoji Tominaga Shuicheng Yan Stan Birchfield Stefan Gehrig
Stephen Lin Stephen Maybank Subhashis Banerjee Subrata Rakshit Sumantra Dutta Roy Svetlana Lazebnik Takayuki Okatani Takekazu Kato Tat-Jen Cham Terence Sim Tetsuji Haga Theo Gevers Thomas Brox Thomas Leung Tian Fang Til Aach Tomas Svoboda Tomokazu Sato Toshio Sato Toshio Ueshiba Tyng-Luh Liu Vincent Lepetit Vivek Kwatra Vladimir Pavlovic Wee-Kheng Leow Wei Liu Weiming Hu Wen-Nung Lie Xianghua Ying Xianling Li Xiaogang Wang Xiaojuan Wu Yacoob Yaser Yaron Caspi Yasushi Sumi Yasutaka Furukawa Yasuyuki Sugaya Yeong-Ho Ha Yi-ping Hung Yong-Sheng Chen Yoshinori Kuno Yoshio Iwai Yoshitsugu Manabe Young Shik Moon Yunde Jia
IX
X
Organization
Zen Chen Zhifeng Li Zhigang Zhu
Zhouchen Lin Zhuowen Tu Zuzana Kukelova
Additional Reviewers Afshin Sepehri Alvina Goh Anthony Dick Avinash Ravichandran Baidya Saha Brian Clipp C´edric Demonceaux Christian Beder Christian Schmaltz Christian Wojek Chunhua Shen Chun-Wei Chen Claude P´egard D.H. Ye D.J. Kwon Daniel Hein David Fofi David Gallup De-Zheng Liu Dhruv K. Mahajan Dipti Mukherjee Edgar Seemann Edgardo Molina El Mustapha Mouaddib Emmanuel Prados Frank R. Schmidt Frederik Meysel Gao Yan Guy Rosman Gyuri Dorko H.J. Shim Hang Yu Hao Du Hao Tang Hao Zhang Hirishi Ohno Hiroshi Ohno Huang Wei Hynek Bakstein
Ilya Levner Imran Junejo Jan Woetzel Jian Chen Jianzhao Qin Jimmy Jiang Liu Jing Wu John Bastian Juergen Gall K.J. Lee Kalin Kolev Karel Zimmermann Ketut Fundana Koichi Kise Kongwah Wan Konrad Schindler Kooksang Moon Levi Valgaerts Li Guan Li Shen Liang Wang Lin Liang Lingyu Duan Maojun Yuan Mario Fritz Martin Bujnak Martin Matousek Martin Sunkel Martin Welk Micha Andriluka Michael Stark Minh-Son Dao Naoko Nitta Neeraj Kanhere Niels Overgaard Nikhil Rane Nikodem Majer Nilanjan Ray Nils Hasler
Nipun kwatra Olivier Morel Omar El Ganaoui Pankaj Kumar Parag Chaudhuri Paul Schnitzspan Pavel Kuksa Petr Doubek Philippos Mordohai Reiner Schnabel Rhys Hill Rizwan Chaudhry Rui Huang S.M. Shahed Nejhum S.H. Lee Sascha Bauer Shao-Wen Yang Shengshu Wang Shiro Kumano Shiv Vitaladevuni Shrinivas Pundlik Sio-Hoi Ieng Somnath Sengupta Sudipta Mukhopadhyay Takahiko Horiuchi Tao Wang Tat-Jun Chin Thomas Corpetti Thomas Schoenemann Thorsten Thormaehlen Weihong Li Weiwei Zhang Xiaoyi Yu Xinguo Yu Xinyu Huang Xuan Song Yi Feng Yichen Wei Yiqun Li
Organization
Yong MA Yoshihiko Kawai
Zhichao Chen Zhijie Wang
Sponsors Sponsor Technical Co-sponsors
Asian Federation of Computer Vision IPSJ SIG-CVIM IEICE TG-PRMU
XI
Table of Contents – Part I
Plenary and Invited Talks Less Is More: Coded Computational Photography . . . . . . . . . . . . . . . . . . . . Ramesh Raskar
1
Optimal Algorithms in Multiview Geometry . . . . . . . . . . . . . . . . . . . . . . . . . Richard Hartley and Fredrik Kahl
13
Machine Vision in Early Days: Japan’s Pioneering Contributions . . . . . . Masakazu Ejiri
35
Shape and Texture Coarse-to-Fine Statistical Shape Model by Bayesian Inference . . . . . . . . . Ran He, Stan Li, Zhen Lei, and ShengCai Liao
54
Efficient Texture Representation Using Multi-scale Regions . . . . . . . . . . . . Horst Wildenauer, Branislav Miˇcuˇs´ık, and Markus Vincze
65
Fitting Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ili´c Slobodan A Family of Quadratic Snakes for Road Extraction . . . . . . . . . . . . . . . . . . . Ramesh Marikhu, Matthew N. Dailey, Stanislav Makhanov, and Kiyoshi Honda
75 85
Poster Session 1: Calibration Multiperspective Distortion Correction Using Collineations . . . . . . . . . . . . Yuanyuan Ding and Jingyi Yu
95
Camera Calibration from Silhouettes Under Incomplete Circular Motion with a Constant Interval Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Po-Hao Huang and Shang-Hong Lai
106
Mirror Localization for Catadioptric Imaging System by Observing Parallel Light Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryusuke Sagawa, Nobuya Aoki, and Yasushi Yagi
116
Calibrating Pan-Tilt Cameras with Telephoto Lenses . . . . . . . . . . . . . . . . . Xinyu Huang, Jizhou Gao, and Ruigang Yang
127
XIV
Table of Contents – Part I
Camera Calibration Using Principal-Axes Aligned Conics . . . . . . . . . . . . . Xianghua Ying and Hongbin Zha
138
Poster Session 1: Detection 3D Intrusion Detection System with Uncalibrated Multiple Cameras . . . . Satoshi Kawabata, Shinsaku Hiura, and Kosuke Sato Non-parametric Background and Shadow Modeling for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Tanaka, Atsushi Shimada, Daisaku Arita, and Rin-ichiro Taniguchi Road Sign Detection Using Eigen Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luo-Wei Tsai, Yun-Jung Tseng, Jun-Wei Hsieh, Kuo-Chin Fan, and Jiun-Jie Li
149
159
169
Localized Content-Based Image Retrieval Using Semi-supervised Multiple Instance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Zhang, Zhenwei Shi, Yangqiu Song, and Changshui Zhang
180
Object Detection Combining Recognition and Segmentation . . . . . . . . . . . Liming Wang, Jianbo Shi, Gang Song, and I-fan Shen
189
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Viet Cuong Dinh, Seong Soo Chun, Seungwook Cha, Hanjin Ryu, and Sanghoon Sull
200
Multiview Pedestrian Detection Based on Vector Boosting . . . . . . . . . . . . Cong Hou, Haizhou Ai, and Shihong Lao
210
Pedestrian Detection Using Global-Local Motion Patterns . . . . . . . . . . . . . Dhiraj Goel and Tsuhan Chen
220
Poster Session 1: Image and Video Processing Qualitative and Quantitative Behaviour of Geometrical PDEs in Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arjan Kuijper
230
Automated Billboard Insertion in Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hitesh Shah and Subhasis Chaudhuri
240
Improved Background Mixture Models for Video Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle
251
Table of Contents – Part I
XV
High Dynamic Range Scene Realization Using Two Complementary Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming-Chian Sung, Te-Hsun Wang, and Jenn-Jier James Lien
261
Automated Removal of Partial Occlusion Blur . . . . . . . . . . . . . . . . . . . . . . . Scott McCloskey, Michael Langer, and Kaleem Siddiqi
271
Poster Session 1: Applications High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Zhang, Wenyu Liu, and Chunxiao Liu Attention Monitoring for Music Contents Based on Analysis of Signal-Behavior Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masatoshi Ohara, Akira Utsumi, Hirotake Yamazoe, Shinji Abe, and Noriaki Katayama View Planning for Cityscape Archiving and Visualization . . . . . . . . . . . . . Jiang Yu Zheng and Xiaolong Wang
282
292
303
Face and Gesture Synthesis of Exaggerative Caricature with Inter and Intra Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chien-Chung Tseng and Jenn-Jier James Lien Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shiro Kumano, Kazuhiro Otsuka, Junji Yamato, Eisaku Maeda, and Yoichi Sato Gesture Recognition Under Small Sample Size . . . . . . . . . . . . . . . . . . . . . . . Tae-Kyun Kim and Roberto Cipolla
314
324
335
Tracking Motion Observability Analysis of the Simplified Color Correlogram for Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Zhao and Hai Tao
345
On-Line Ensemble SVM for Robust Object Tracking . . . . . . . . . . . . . . . . . Min Tian, Weiwei Zhang, and Fuqiang Liu
355
Multi-camera People Tracking by Collaborative Particle Filters and Principal Axis-Based Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Du and Justus Piater
365
XVI
Table of Contents – Part I
Poster Session 2: Camera Networks Finding Camera Overlap in Large Surveillance Networks . . . . . . . . . . . . . . Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, and Rhys Hill
375
Information Fusion for Multi-camera and Multi-body Structure and Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Andreopoulos and John K. Tsotsos
385
Task Scheduling in Large Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . Ser-Nam Lim, Larry Davis, and Anurag Mittal
397
Poster Session 2: Face/Gesture/Action Detection and Recognition Constrained Optimization for Human Pose Estimation from Depth Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youding Zhu and Kikuo Fujimura
408
Generative Estimation of 3D Human Pose Using Shape Contexts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Zhao and Yuncai Liu
419
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eng Hui Loke and Masanobu Yamamoto
430
Tracking and Classifying Human Motions with Gaussian Process Annealed Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonid Raskin, Michael Rudzsky, and Ehud Rivlin
442
Gait Identification Based on Multi-view Observations Using Omnidirectional Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazushige Sugiura, Yasushi Makihara, and Yasushi Yagi
452
Gender Classification Based on Fusion of Multi-view Gait Sequences . . . . Guochang Huang and Yunhong Wang
462
Poster Session 2: Learning MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heping Li, Zhanyi Hu, Yihong Wu, and Fuchao Wu
472
Optimal Learning High-Order Markov Random Fields Priors of Colour Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ke Zhang, Huidong Jin, Zhouyu Fu, and Nianjun Liu
482
Table of Contents – Part I
XVII
Hierarchical Learning of Dominant Constellations for Object Class Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathan Mekuz and John K. Tsotsos
492
Multistrategical Approach in Visual Learning . . . . . . . . . . . . . . . . . . . . . . . . Hiroki Nomiya and Kuniaki Uehara
502
Poster Session 2: Motion and Tracking Cardiac Motion Estimation from Tagged MRI Using 3D-HARP and NURBS Volumetric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Liang, Yuanquan Wang, and Yunde Jia
512
Fragments Based Parametric Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prakash C., Balamanohar Paluri, Nalin Pradeep S., and Hitesh Shah
522
Spatiotemporal Oriented Energy Features for Visual Tracking . . . . . . . . . Kevin Cannons and Richard Wildes
532
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras . . . . . . Jinshi Cui, Yasushi Yagi, Hongbin Zha, Yasuhiro Mukaigawa, and Kazuaki Kondo
544
Optical Flow – Driven Motion Model with Automatic Variance Adjustment for Adaptive Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiko Kawamoto
555
A Noise-Insensitive Object Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . . Chunsheng Hua, Qian Chen, Haiyuan Wu, and Toshikazu Wada
565
Discriminative Mean Shift Tracking with Auxiliary Particles . . . . . . . . . . . Junqiu Wang and Yasushi Yagi
576
Poster Session 2: Retrival and Search Efficient Search in Document Image Collections . . . . . . . . . . . . . . . . . . . . . . Anand Kumar, C.V. Jawahar, and R. Manmatha
586
Human Pose Estimation Hand Posture Estimation in Complex Backgrounds by Considering Mis-match of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiro Imai, Nobutaka Shimada, and Yoshiaki Shirai
596
Learning Generative Models for Monocular Body Pose Estimation . . . . . Tobias Jaeggli, Esther Koller-Meier, and Luc Van Gool
608
Human Pose Estimation from Volume Data and Topological Graph Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hidenori Tanaka, Atsushi Nakazawa, and Haruo Takemura
618
XVIII
Table of Contents – Part I
Matching Logical DP Matching for Detecting Similar Subsequence . . . . . . . . . . . . . . Seiichi Uchida, Akihiro Mori, Ryo Kurazume, Rin-ichiro Taniguchi, and Tsutomu Hasegawa
628
Efficient Normalized Cross Correlation Based on Adaptive Multilevel Successive Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shou-Der Wei and Shang-Hong Lai
638
Exploiting Inter-frame Correlation for Fast Video to Reference Image Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arif Mahmood and Sohaib Khan
647
Poster Session 3: Face/Gesture/Action Detection and Recognition Flea, Do You Remember Me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Grabner, Helmut Grabner, Joachim Pehserl, Petra Korica-Pehserl, and Horst Bischof
657
Multi-view Gymnastic Activity Recognition with Fused HMM . . . . . . . . . Ying Wang, Kaiqi Huang, and Tieniu Tan
667
Real-Time and Marker-Free 3D Motion Capture for Home Entertainment Oriented Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brice Michoud, Erwan Guillou, Hector Brice˜ no, and Sa¨ıda Bouakaz
678
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation . . . . . . . Haiyuan Wu, Yosuke Kitagawa, Toshikazu Wada, Takekazu Kato, and Qian Chen
688
Eye Correction Using Correlation Information . . . . . . . . . . . . . . . . . . . . . . . Inho Choi and Daijin Kim
698
Eye-Gaze Detection from Monocular Camera Image Using Parametric Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Ohtera, Takahiko Horiuchi, and Shoji Tominaga
708
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Shi and Timothy Tsui
718
Poster Session 3: Low Level Vision and Phtometory Color Constancy Via Convex Kernel Optimization . . . . . . . . . . . . . . . . . . . Xiaotong Yuan, Stan Z. Li, and Ran He
728
Table of Contents – Part I
XIX
User-Guided Shape from Shading to Reconstruct Fine Details from a Single Photograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Meyer, Hector M. Brice˜ no, and Sa¨ıda Bouakaz
738
A Theoretical Approach to Construct Highly Discriminative Features with Application in AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuxin Jin, Linmi Tao, Guangyou Xu, and Yuxin Peng
748
Robust Foreground Extraction Technique Using Gaussian Family Model and Multiple Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, and Kiyoshi Kogure Feature Management for Efficient Camera Tracking . . . . . . . . . . . . . . . . . . Harald Wuest, Alain Pagani, and Didier Stricker Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Yin, Kangying Cai, Yuki Takeda, Ryo Akama, and Hiromi T. Tanaka
758
769
779
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ping Li, Dirk Farin, Rene Klein Gunnewiek, and Peter H.N. de With
789
Where’s the Weet-Bix? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuhang Zhang, Lei Wang, Richard Hartley, and Hongdong Li
800
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR-Based Stopping in Nonlinear Diffusion Scale-Spaces . . . . . . . . . . . . . . Ram¯ unas Girdziuˇsas and Jorma Laaksonen
811
Poster Session 3: Motion and Tracking Kernel-Bayesian Framework for Object Tracking . . . . . . . . . . . . . . . . . . . . . Xiaoqin Zhang, Weiming Hu, Guan Luo, and Steve Maybank
821
Markov Random Field Modeled Level Sets Method for Object Tracking with Moving Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xue Zhou, Weiming Hu, Ying Chen, and Wei Hu
832
Continuously Tracking Objects Across Multiple Widely Separated Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinghao Cai, Wei Chen, Kaiqi Huang, and Tieniu Tan
843
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pankaj Kumar, Michael J. Brooks, and Anthony Dick
853
XX
Table of Contents – Part I
Image Assimilation for Motion Estimation of Atmospheric Layers with Shallow-Water Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Nicolas Papadakis, Patrick H´eas, and Etienne M´emin
864
Probability Hypothesis Density Approach for Multi-camera Multi-object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nam Trung Pham, Weimin Huang, and S.H. Ong
875
Human Detection AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi-Chen Raxle Wang and Jenn-Jier James Lien
885
Multi-posture Human Detection in Video Frames by Motion Contour Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qixiang Ye, Jianbin Jiao, and Hua Yu
896
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection . . . . Yu-Ting Chen and Chu-Song Chen
905
Combined Object Detection and Segmentation by Using Space-Time Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Murai, Hironobu Fujiyoshi, and Takeo Kanade
915
Segmentation Embedding a Region Merging Prior in Level Set Vector-Valued Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ismail Ben Ayed and Amar Mitiche
925
A Basin Morphology Approach to Colour Image Segmentation by Region Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erchan Aptoula and S´ebastien Lef`evre
935
Detecting and Segmenting Un-occluded Items by Actively Casting Shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tze K. Koh, Amit Agrawal, Ramesh Raskar, Steve Morgan, Nicholas Miles, and Barrie Hayes-Gill
945
A Local Probabilistic Prior-Based Active Contour Model for Brain MR Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jundong Liu, Charles Smith, and Hima Chebrolu
956
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
965
Less Is More: Coded Computational Photography Ramesh Raskar Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA
Abstract. Computational photography combines plentiful computing, digital sensors, modern optics, actuators, and smart lights to escape the limitations of traditional cameras, enables novel imaging applications and simplifies many computer vision tasks. However, a majority of current Computational Photography methods involve taking multiple sequential photos by changing scene parameters and fusing the photos to create a richer representation. The goal of Coded Computational Photography is to modify the optics, illumination or sensors at the time of capture so that the scene properties are encoded in a single (or a few) photographs. We describe several applications of coding exposure, aperture, illumination and sensing and describe emerging techniques to recover scene parameters from coded photographs.
1 Introduction Computational photography combines plentiful computing, digital sensors, modern optics, actuators, and smart lights to escape the limitations of traditional cameras, enables novel imaging applications and simplifies many computer vision tasks. Unbounded dynamic range, variable focus, resolution, and depth of field, hints about shape, reflectance, and lighting, and new interactive forms of photos that are partly snapshots and partly videos are just some of the new applications found in Computational Photography. In this paper, we discuss Coded Photography which involves encoding of the photographic signal and post-capture decoding for improved scene analysis. With filmlike photography, the captured image is a 2D projection of the scene. Due to limited capabilities of the camera, the recorded image is a partial representation of the view. Nevertheless, the captured image is ready for human consumption: what you see is what you almost get in the photo. In Coded Photography, the goal is to achieve a potentially richer representation of the scene during the encoding process. In some cases, Computational Photography reduces to ‘Epsilon Photography’, where the scene is recorded via multiple images, each captured by epsilon variation of the camera parameters. For example, successive images (or neighboring pixels) may have a different exposure, focus, aperture, view, illumination, or instant of capture. Each setting allows recording of partial information about the scene and the final image is reconstructed from these multiple observations. In Coded Computational Photography, the recorded image may appear distorted or random to a human observer. But the corresponding decoding recovers valuable information about the scene. ‘Less is more’ in Coded Photography. By blocking light over time or space, we can preserve more details about the scene in the recorded single photograph. In this paper we look at four specific examples. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 1–12, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
R. Raskar
(a) Coded Exposure: By blocking light in time, by fluttering the shutter open and closed in a carefully chosen binary sequence, we can preserve high spatial frequencies of fast moving objects to support high quality motion deblurring. (b) Coded Aperture Optical Heterodyning: By blocking light near the sensor with a sinusoidal grating mask, we can record 4D light field on a 2D sensor. And by blocking light with a mask at the aperture, we can extend the depth of field and achieve full resolution digital refocussing. (c) Coded Illumination: By observing blocked light at silhouettes, a multi-flash camera can locate depth discontinuities in challenging scenes without depth recovery. (d) Coded Sensing: By sensing intensities with lateral inhibition, a gradient sensing camera can record large as well as subtle changes in intensity to recover a highdynamic range image. We describe several applications of coding exposure, aperture, illumination and sensing and describe emerging techniques to recover scene parameters from coded photographs. 1.1 Film-Like Photography Photography is the process of making pictures by, literally, ‘drawing with light’ or recording the visually meaningful changes in the light leaving a scene. This goal was established for film photography about 150 years ago. Currently, ‘digital photography’ is electronically implemented film photography, refined and polished to achieve the goals of the classic film camera which were governed by chemistry, optics, mechanical shutters. Film-like photography presumes (and often requires) artful human judgment, intervention, and interpretation at every stage to choose viewpoint, framing, timing, lenses, film properties, lighting, developing, printing, display, search, index, and labelling. In this article we plan to explore a progression away from film and film-like methods to something more comprehensive that exploits plentiful low-cost computing and memory with sensors, optics, probes, smart lighting and communication. 1.2 What Is Computational Photography? Computational Photography (CP) is an emerging field, just getting started. We don’t know where it will end up, we can’t yet set its precise, complete definition, nor make a reliably comprehensive classification. But here is the scope of what researchers are currently exploring in this field. – Computational photography attempts to record a richer visual experience, captures information beyond just a simple set of pixels and makes the recorded scene representation far more machine readable. – It exploits computing, memory, interaction and communications to overcome long-standing limitations of photographic film and camera mechanics that have persisted in film-style digital photography, such as constraints on dynamic
Less Is More: Coded Computational Photography
3
range, depth of field, field of view, resolution and the extent of scene motion during exposure. – It enables new classes of recording the visual signal such as the ‘moment’ [Cohen 2005], shape boundaries for non-photorealistic depiction [Raskar et al 2004] , foreground versus background mattes, estimates of 3D structure, ‘relightable’ photos and interactive displays that permit users to change lighting, viewpoint, focus, and more, capturing some useful, meaningful fraction of the ‘light field’ of a scene, a 4-D set of viewing rays. – It enables synthesis of impossible photos that could not have been captured at a single instant with a single camera, such as wrap-around views (‘multiple-centerof-projection’ images [Rademacher and Bishop 1998]), fusion of time-lapsed events [Raskar et al 2004], the motion-microscope (motion magnification [Liu et al 2005]), video textures and panoramas [Agarwala et al 2005]. They also support seemly impossible camera movements such as the ‘bullet time’ (Matrix) sequence recorded with multiple cameras with staggered exposure times. – It encompass previously exotic forms of scientific imaging and data gathering techniques e.g. from astronomy, microscopy, and tomography. 1.3 Elements of Computational Photography Traditional film-like photography involves (a) a lens, (b) a 2D planar sensor and (c) a processor that converts sensed values into an image. In addition, the photography may involve (d) external illumination from point sources (e.g. flash units) and area sources (e.g. studio lights).
Computational Photography
Novel Illumination Light Sources
Novel Cameras
Modulators Generalized Optics
Generalized
Sensor
Processing Ray Reconstruction
Generalized
Optics
4D Incident Lighting
4D Ray Bender
Upto 4D Ray Sampler
4D Light Field Display
Recreate 4D Lightfield
Scene: 8D Ray Modulator
Fig. 1. Elements of Computational Photography
4
R. Raskar
Computational Photography generalizes these four elements. (a) Generalized Optics: Each optical element is treated as a 4D ray-bender that modifies a light field. The incident 4D light field for a given wavelength is transformed into a new 4D lightfield. The optics may involve more than one optical axis [Georgiev et al 2006]. In some cases the perspective foreshortening of objects based on distance may be modified using wavefront coded optics [Dowski and Cathey 1995]. In recent lensless imaging methods [Zomet and Nayar 2006] and coded-aperture imaging [Zand 1996] used for gamma-ray and X-ray astronomy, the traditional lens is missing entirely. In some cases optical elements such as mirrors [Nayar et al 2004] outside the camera adjust the linear combinations of ray bundles that reach the sensor pixel to adapt the sensor to the viewed scene. (b) Generalized Sensors: All light sensors measure some combined fraction of the 4D light field impinging on it, but traditional sensors capture only a 2D projection of this lightfield. Computational photography attempts to capture more; a 3D or 4D ray representation using planar, non-planar or even volumentric sensor assemblies. For example, a traditional out-of-focus 2D image is the result of a capture-time decision: each detector pixel gathers light from its own bundle of rays that do not converge on the focused object. But a Plenoptic Camera [Adelson and Wang 1992, Ren et al 2005] subdivides these bundles into separate measurements. Computing a weighted sum of rays that converge on the objects in the scene creates a digitally refocused image, and even permits multiple focusing distances within a single computed image. Generalizing sensors can extend their dynamic range [Tumblin et al 2005] and wavelength selectivity as well. While traditional sensors trade spatial resolution for color measurement (wavelengths) using a Bayer grid or red, green or blue filters on individual pixels, some modern sensor designs determine photon wavelength by sensor penetration, permitting several spectral estimates at a single pixel location [Foveon 2004]. (c) Generalized Reconstruction: Conversion of raw sensor outputs into picture values can be much more sophisticated. While existing digital cameras perform ‘demosaicking,’ (interpolate the Bayer grid), remove fixed-pattern noise, and hide ‘dead’ pixel sensors, recent work in computational photography can do more. Reconstruction might combine disparate measurements in novel ways by considering the camera intrinsic parameters used during capture. For example, the processing might construct a high dynamic range scene from multiple photographs from coaxial lenses, from sensed gradients [Tumblin et al 2005], or compute sharp images a fast moving object from a single image taken by a camera with a ‘fluttering’ shutter [Raskar et al 2006]. Closed-loop control during photography itself can also be extended, exploiting traditional cameras’ exposure control, image stabilizing, and focus, as new opportunities for modulating the scene’s optical signal for later decoding. (d) Computational Illumination: Photographic lighting has changed very little since the 1950’s: with digital video projectors, servos, and device-to-device communication, we have new opportunities to control the sources of light with as much sophistication as we use to control our digital sensors. What sorts of spatiotemporal modulations for light might better reveal the visually important contents
Less Is More: Coded Computational Photography
5
of a scene? Harold Edgerton showed high-speed strobes offered tremendous new appearance-capturing capabilities; how many new advantages can we realize by replacing ‘dumb’ the flash units, static spot lights and reflectors with actively controlled spatio-temporal modulators and optics? Already we can capture occluding edges with multiple flashes [Raskar 2004], exchange cameras and projectors by Helmholz reciprocity [Sen et al 2005], gather relightable actor’s performances with light stages [Wagner et al 2005] and see through muddy water with coded-mask illumination [Levoy et al 2004]. In every case, better lighting control during capture to builds richer representations of photographed scenes.
2 Sampling Dimensions of Imaging 2.1 Epsilon Photography for Optimizing Film-Like Camera Think of film cameras at their best as defining a ‘box’ in the multi-dimensional space of imaging parameters. The first, most obvious thing we can do to improve digital cameras is to expand this box in every conceivable dimension. This effort reduces Computational Photography to ‘Epsilon Photography’, where the scene is recorded via multiple images, each captured by epsilon variation of the camera parameters. For example, successive images (or neighboring pixels) may have different settings for parameters such as exposure, focus, aperture, view, illumination, or the instant of capture. Each setting allows recording of partial information about the scene and the final image is reconstructed from these multiple observations. Epsilon photography is thus concatenation of many such boxes in parameter space; multiple film-style photos computationally merged to make a more complete photo or scene description. While the merged photo is superior, each of the individual photos is still useful and comprehensible on its own, without any of the others. The merged photo contains the best features from all of them. (a) Field of View: A wide field of view panorama is achieved by stitching and mosaicking pictures taken by panning a camera around a common center of projection or by translating a camera over a near-planar scene. (b) Dynamic range: A high dynamic range image is captured by merging photos at a series of exposure values [Debevec and Malik 1997, Kang et al 2003] (c) Depth of field: All-in-focus image is reconstructed from images taken by successively changing the plane of focus [Agrawala et al 2005]. (d) Spatial Resolution: Higher resolution is achieved by tiling multiple cameras (and mosaicing individual images) [Wilburn et al 2005] or by jittering a single camera [Landolt et al 2001]. (e) Wavelength resolution: Traditional cameras sample only 3 basis colors. But multi-spectral (multiple colors in the visible spectrum) or hyper-spectral (wavelengths beyond the visible spectrum) imaging is accomplished by taking pictures while successively changing color filters in front of the camera, using tunable wavelength filters or using diffraction gratings. (f) Temporal resolution: High speed imaging is achieved by staggering the exposure time of multiple low-framerate cameras. The exposure durations of individual cameras can be non-overlapping ) [Wilburn et al 2005] or overlaping [Shechtman et al 2002].
6
R. Raskar
Taking multiple images under varying camera parameters can be achieved in several ways. The images can be taken with a single camera over time. The images can be captured simultaneously using ‘assorted pixels’ where each pixel is a tuned to a different value for a given parameter [Nayar and Narsimhan 2002]. Simultaneous capture of multiple samples can also be recorded using multiple cameras, each camera having different values for a given parameter. Two designs are currently being used for multi-camera solutions: a camera array [Wilburn et al 2005] and single-axis multiple parameter (co-axial) cameras [Mcguire et al 2005]. Coded Exposure
Temporal 11-D broadband code
Coded Aperture
Spatial 2-D broadband code
Fig. 2. Blocking light to achieve Coded Photography. (Left) Using a 1-D code in time to block and unblock light over time, a coded exposure photo can reversibly encode motion blur (Raskar et al 2006). (Right) Using a 2-D code in space to block parts of the light via a masked aperture, a coded aperture photo can reversibly encode defocus blur (Veeraraghavan et al 2007).
2.2 Coded Photography But there is much more beyond the ‘best possible film camera’. We can virtualize the notion of the camera itself if we consider it as a device that collects bundles of rays, each ray with its own wavelength spectrum and exposure duration. Coded Photography is a notion of an ‘out-of-the-box’ photographic method, in which individual (ray) samples or data sets may or may not be comprehensible as ‘images’ without further decoding, re-binning or reconstruction. Coded aperture techniques, inspired by work in astronomical imaging, try to preserve high spatial frequencies so that out of focus blurred images can be digitally re-focused [Veeraraghavan07]. By coding illumination, it is possible to decompose radiance in a scene into direct and global components [Nayar06]. Using a coded exposure technique, one can rapidly flutter open and close the shutter of a camera in a carefully chosen binary sequence, to capture a single photo. The fluttered shutter encoded the motion in the scene in the observed blur in a reversible way. Other examples include confocal images and techniques to recover glare in the images [Talvala07].
Less Is More: Coded Computational Photography
7
We may be converging on a new, much more capable ‘box’ of parameters in computational photography that we don’t yet recognize; there is still quite a bit of innovation to come! In the rest of the article, we survey recent techniques that exploit exposure, focus, active illumination and sensors. Coding in Time
Coding in Space
Coded Illumination
Coded Sensing
Exposure
Aperture
Inter-View
Gradient Sensor (Differential Encoding)
[Raskar et al 2006]
[Veeraraghavan et al 07]
[Raskar et al 2004]
[Tumblin et al 2005]
Mask, Optical Heterodyning
Intra-view
[Veeraraghavan et al 07]
[Nayar et al 2006]
Fig. 3. An overview of projects. Coding in time or space, coding the incident active illumination and coding the sensing pattern.
3 Coded Exposure In a conventional single-exposure photograph, moving objects or moving cameras cause motion blur. The exposure time defines a temporal box filter that smears the moving object across the image by convolution. This box filter destroys important high-frequency spatial details so that deblurring via deconvolution becomes an illposed problem. We have proposed to flutter the camera’s shutter open and closed during the chosen exposure time with a binary pseudo-random sequence, instead of leaving it open as in a traditional camera [Raskar et al 2006]. The flutter changes the box filter to a broad-band filter that preserves high-frequency spatial details in the blurred image and the corresponding deconvolution becomes a well-posed problem. Results on several challenging cases of motion-blur removal including outdoor scenes, extremely large motions, textured backgrounds and partial occluders were presented. However, the authors assume that PSF is given or is obtained by simple user interaction. Since changing the integration time of conventional CCD cameras is not feasible, an external ferro-electric shutter is placed in front of the lens to code the exposure. The shutter is driven opaque and transparent according to the binary signals generated from PIC using the pseudo-random binary sequence.
8
R. Raskar
Fig. 4. The flutter shutter camera. The coded exposure is achieved by fluttering the shutter open and closed. Instead of a mechanical movement of the shutter, we used a ferro-electric LCD in front of the lens. It is driven opaque and transparent according to the desired binary sequence.
4 Coded Aperture and Optical Heterodyning Can we capture additional information about a scene by inserting a patterned mask inside a conventional camera? We use a patterned attenuating mask to encode the light field entering the camera. Depending on where we put the mask, we can effect desired frequency domain modulation of the light field. If we put the mask near the lens aperture, we can achieve full resolution digital refocussing. If we put the mask near the sensor, we can recover a 4D light field without any additional lenslet array.
Fig. 5. Encoded Blur Camera, i.e. with mask in the aperture, can preserve high spatial images frequencies in the defocus blur. Notice the glint in the eye. In the misfocused photo, on the left, the bright spot appears blurred with the bokeh of the chosen aperture (shown in the inset). In the deblurred result, on the right, the details on the eye are correctly recovered.
Less Is More: Coded Computational Photography
9
Ren et al. have developed a camera that can capture the 4D light field incident on the image sensor in a single photographic exposure [Ren et al. 2005]. This is achieved by inserting a microlens array between the sensor and main lens, creating a plenoptic camera. Each microlens measures not just the total amount of light deposited at that location, but how much light arrives along each ray. By re-sorting the measured rays of light to where they would have terminated in slightly different, synthetic cameras, one can compute sharp photographs focused at different depths. A linear increase in the resolution of images under each microlens results in a linear increase in the sharpness of the refocused photographs. This property allows one to extend the depth of field of the camera without reducing the aperture, enabling shorter exposures and lower image noise. Our group has shown that it is also possible to create a plenoptic camera using a patterned mask instead of a lenslet array. The geometric configurations remains nearly identical [Veeraraghavan2007]. The method is known as ‘spatial optical heterodyning’. Instead of remapping of rays in 4D using microlens array so that they can be captured on a 2D sensor, spatial optical heterodyning remaps frequency components of the 4D lightfield so that the frequency components can be recovered from Fourier transform of the captured 2D image. In microlens array based design, each pixel effectively records light along a single ray bundle. With patterned masks, each pixel records a linear combination multiple ray-bundles. By carefully coding the linear combination, the coded heterodyning method can reconstruct the values of individual ray-bundles. This is reversible modulation of 4D light field by inserting a patterned planar mask in the optical path of a lens based camera. We can reconstruct the 4D light field from a 2D camera image. The patterned mask attenuates light rays inside the camera instead of bending them, and the attenuation recoverably encodes the ray on the 2D sensor. Our mask-equipped camera focuses just as a traditional camera might to capture conventional 2D photos at full sensor resolution, but the raw pixel values also hold a modulated 4D light field. The light field can be recovered by rearranging the tiles of the 2D Fourier transform of sensor values into 4D planes, and computing the inverse Fourier transform. Mask?
Mask
Sensor
Coded Aperture for Full Resolution Digital Refocusing
Sensor
Mask
Sensor
Heterodyne Light Field Camera
Fig. 6. Coding Light Field entering a camera via a mask
10
R. Raskar
5 Coded Illumination By observing blocked light at silhouettes, a multi-flash camera can locate depth discontinuities in challenging scenes without depth recovery. We used a multi-flash camera to find the silhouettes in a scene [Raskar et al 2004]. We take four photos of an object with four different light positions (above, below, left and right of the lens). We detect shadows cast along the depth discontinuities are use them to detect depth discontinuities in the scene. The detected silhouettes are then used for stylizing the photograph and highlighting important features. We also demonstrate silhouette detection in a video using a repeated fast sequence of flashes. Bottom Flash
Top Flash
Left Flash
Right Flash
Ratio images showing shadows and traversal to find edges
Photo
Shadow-Free
Depth Edges
Depth Edges
Fig. 7. Multi-flash Camera for Depth Edge Detection. (Left) A camera with four flashes. (Right) Photos due to individual flashes, highlighted shadows and epipolar traversal to compute the single pixel depth edges.
6 High Dynamic Range Using a Gradient Camera A camera sensor is limited in the range of highest and lowest intensities it can measure. To capture the high dynamic range, one can adaptively exposure the sensor so that the signal to noise ratio is high over the entire image, including in the the dark and brightly lit regions. One approach for faithfully recording the intensities in a high dynamic range scenes is to capture multiple images using different exposures, and then to merge these images. The basic idea is that when longer exposures are used, dark regions are well exposed but bright regions are saturated. On the other hand, when short exposures are used, dark regions are too dark but bright regions are well imaged. If exposure varies and multiple pictures are taken of the same scene, value of a pixel can be taken from those images where it’s neither too dark nor saturated. This type of approach is often referred to as exposure bracketing. At the sensor level, various approaches have also been proposed for high dynamic range imaging. One type of approach is to use multiple sensing elements with different sensitivities within each cell [Street 1998, Handy 1986, Wen 1989, Hamazaki 1996]. Multiple measurements are made from the sensing elements, and they are combined
Less Is More: Coded Computational Photography
11
on-chip before a high dynamic range image is read out from the chip. Spatial sampling rate is lowered in these sensing devices, and spatial resolution is sacrificed. Another type of approach is to adjust the well capacity of the sensing elements during photocurrent integration [Knight 1983, Sayag 1990, Decker 1998] but this gives higher noise. By sensing intensities with lateral inhibition, a gradient sensing camera can record large as well as subtle changes in intensity to recover a high-dynamic range image. By sensing different between neighboring pixels instead of actual intensities, our group has shown that a ‘Gradient Camera’ can record large global variations in intensity [Tumblin et al 2005]. Rather than measure absolute intensity values at each pixel, this proposed sensor measures only forward differences between them, which remain small even for extremely high-dynamic range scenes, and reconstructs the sensed image from these differences using Poisson solver methods. This approach offers several advantages: the sensor is nearly impossible to over- or under-expose, yet offers extremely fine quantization, even with very modest A/D convertors (e.g. 8 bits). The thermal and quantization noise occurs in the gradient domain, and appears as low frequency ‘cloudy’ noise in the reconstruction, rather than uncorrelated highfrequency noise that might obscure the exact position of scene edges.
7 Conclusion As these examples indicate, we have scarcely begun to explore the possibilities offered by combining computation, 4D modeling of light transport, and novel optical systems. Nor have such explorations been limited to photography and computer graphics or computer vision. Microscopy, tomography, astronomy and other optically driven fields already contain some ready-to-use solutions to borrow and extend. If the goal of photography is to capture, reproduce, and manipulate a meaningful visual experience, then the camera is not sufficient to capture even the most rudimentary birthday party. The human experience and our personal viewpoint is missing. Computational Photography can supply us with visual experiences, but can’t decide which one’s matter most to humans. Beyond coding the first order parameters like exposure, focus, illumination and sensing, maybe the ultimate goal of Computational Photography is to encode the human experience in the captured single photo.
Acknowledgements We wish to thank Jack Tumblin and Amit Agrawal for contributing several ideas for this paper. We also thank co-authors and collaborators Ashok Veeraraghavan, Ankit Mohan, Yuanzen Li, Karhan Tan, Rogerio Feris, Jingyi Yu, Matthew Turk. We thank Shree Nayar and Marc Levoy for useful comments and discussions.
References Raskar, R., Tan, K., Feris, R., Yu, J., Turk, M.: Non-photorealistic Camera: Depth Edge Detection and Stylized Rendering Using a Multi-Flash Camera. SIGGRAPH 2004 (2004) T umblin, J., Agrawal, A., Raskar, R.: Why I want a Gradient Camera. In: CVPR 2005, IEEE, Los Alamitos (2005)
12
R. Raskar
Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: motion deblurring using fluttered shutter. ACM Trans. Graph 25(3), 795–804 (2006) Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled Photography: Mask-Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing. ACM Siggraph (2007) Nayar, S.K., Narasimhan, S.G.: Assorted Pixels: Multi-Sampled Imaging With Structural Models. In: ECCV. Europian Conference on Computer Vision, vol. IV, pp. 636–652 (2002) Debevec, Malik.: Recovering high dynamic range radiance maps from photographs. In: Proc. SIGGRAPH (1997) Mann, Picard.: Being ’undigital’ with digital cameras: Extending dynamic range by combining differently exposed pictures. In: Proc. IS&T 46th ann. conference (1995) McGuire, M., Matusik, Pfister, Hughes, Durand.: Defocus Video Matting, ACM Transactions on Graphics. Proceedings of ACM SIGGRAPH 2005 24(3) (2005) Adelson, E.H., Wang, J.Y.A.: Single Lens Stereo with a Plenoptic Camera. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2) (1992) Ng, R.: Fourier Slice Photography, SIGGRAPH (2005) Morimura. Imaging method for a wide dynamic range and an imaging device for a wide dynamic range. U.S. Patent 5455621 (October 1993) Levoy, M., Hanrahan, P.: Light field rendering. In: SIGGRAPH, pp. 31–42 (1996) Dowski Jr., E.R., Cathey, W.T.: Extended depth of field through wave-front coding. Applied Optics 34(11), 1859–1866 (1995) Georgiev, T., Zheng, C., Nayar, S., Salesin, D., Curless, B., Intwala, C.: Spatio-angular Resolution Trade-Offs in Integral Photography. In: proceedings, EGSR 2006 (2006)
Optimal Algorithms in Multiview Geometry Richard Hartley1 and Fredrik Kahl2 Research School of Information Sciences and Engineering, The Australian National University National ICT Australia (NICTA) Centre for Mathematical Sciences, Lund University, Sweden
1
2
Abstract. This is a survey paper summarizing recent research aimed at finding guaranteed optimal algorithms for solving problems in Multiview Geometry. Many of the traditional problems in Multiview Geometry now have optimal solutions in terms of minimizing residual imageplane error. Success has been achieved in minimizing L2 (least-squares) or L∞ (smallest maximum error) norm. The main methods involve Second Order Cone Programming, or quasi-convex optimization, and Branch-andbound. The paper gives an overview of the subject while avoiding as far as possible the mathematical details, which can be found in the original papers. J.E.Littlewood: The first test of potential in mathematics is whether you can get anything out of geometry. G.H.Hardy: The sort of mathematics that is useful to a superior engineer, or a moderate physicist has no esthetic value and is of no interest to the real mathematician.
1
Introduction
In this paper, we describe recent work in geometric Computer Vision aimed at finding provably optimal solutions to some of the main problems. This is a subject which the two authors of this paper have been involved in for the last few years, and we offer our personal view of the subject. We cite most of the relevant papers that we are aware of and apologize for any omissions. There remain still several open problems, and it is our hope that more researchers will be encouraged to work in this area. Research in Structure from Motion since the start of the 1990s resulted in the emergence of a dominant accepted technique – bundle adjustment [46]. In this method, a geometric problem is formulated as a (usually) non-linear optimization problem, which is then solved using an iterative optimization algorithm.
NICTA is funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 13–34, 2007. c Springer-Verlag Berlin Heidelberg 2007
14
R. Hartley and F. Kahl
Generally, the bundle adjustment problem is formulated as follows. One defines a cost function (also called an objective function) in terms of a set of parameters. Solving the problem involves finding the set of parameters that minimize the cost. Generally, the parameters are associated with the geometry that we wish to discover. Often they involve parameters of a set of cameras, as well as parameters (such as 3D point coordinates) describing the geometry of the scene. The cost function usually involves image measurements, and measures how closely a given geometric configuration (for instance the scene geometry) explains the image measurements. Bundle adjustment has many advantages, which account for its success and popularity. 1. It is quite general, and can be applied to a large range of problems. 2. It is very easy to augment the cost function to include other constraints that the problem must satisfy. 3. We can “robustify” the cost function by minimizing a robust cost function, such as the Huber cost function. 4. Typically, the estimation problem is sparse, so sparse techniques can be used to achieve quite fast run times. One of the issues with bundle adjustment is that it requires a relatively accurate initial estimate of the geometry in order to converge to a correct minimum. This requirement led to one of the other main themes of research in Multiview Geometry through the 1990s - finding reliable initial solutions usually through so-called algebraic techniques. Most well known among such techniques is the 8point algorithm [28] for estimation of the essential or fundamental matrix, which solves the two-view relative orientation problem. Generally, in such methods, one defines an algebraic condition that must hold in the case of a noise-free solution, and defines an algebraic cost function that expresses how far this condition is from being met. Unfortunately, the cost function often is not closely connected with the geometry, and the cost function may be meaningless in any geometric or statistical sense. Multiple minima. One of the drawbacks of bundle adjustment is the possibility of converging to a local, rather than a global minimum of the cost function. The cost functions that arise in multiview optimization problems commonly do have multiple local minima. As an example, in Fig 1 we show graphs of the cost functions associated with the two-view triangulation problem (described later) and the autocalibration problem. The triangulation problem with more than two views is substantially more complex. It has been shown in [43] that the triangulation problem with three views involves solving a polynomial of degree 47, and hence the cost function potentially may have up to 24 minima. For higher numbers of views, the degree of the polynomial grows cubically. Stew´enius and Nist´er have calculated the sequence of degrees of the polynomial to be 6, 47, 148, 336, 638, 1081 for 2 to 7 view triangulation, which implies a large number of potential local minima of the cost function.
Optimal Algorithms in Multiview Geometry
15
Fig. 1. Top: Two-view triangulation cost function, showing up to three minima. The independent variable (x-axis) parametrizes the epipolar plane. On the left, an example with three local minima (two of them with equal cost). On the right, an example with global solution with zero cost (perfect noise-free triangulation), yet having a further local minimum. The graphs are taken from [12]. Bottom: 2-dimensional cross-section of a cost associated with an autocalibration problem. The graphs are taken from [9]. Left to right: top view, side-view, contour plot. Note the great complexity of the cost function, and the expected difficulties in finding a global minimum.
As for autocalibration, progress has been made on this problem in [5], which finds optimal methods of carrying out the projective - affine - Euclidean upgrade, the “stratified” approach to autocalibration. However, this is not quite the same as an optimal solution to autocalibration, which remains an open (very difficult) problem. Optimal methods. The difficulties and uncertainties associated with bundle adjustment, and algebraic methods has led to a theme of research in the last few years that aims at finding guaranteed provably optimal methods for solving these problems. Although such methods have not been successful for all geometric problem, the number of problems that can be solved using such optimal methods continues to grow. It is the purpose of this paper to survey progress in this area.
2
What Is Meant by an Optimal Solution?
In this section, we will argue that what is meant by an optimal solution to a geometric problem is not clearly defined. We consider a problem in which a set of measurements are made and we need to fit a parametrized model of some kind to these measurements. Optimization involves finding the set of parameters of the model that best fit the purpose. The optimization problem is defined in terms of a defined cost function which must be minimized over the set of
16
R. Hartley and F. Kahl
meaningful parameters. However, what particular cost functions merit being called “optimal” will be considered in the rest of this section. To set up the problem, consider a set of measurements, xi to which we wish to ˆ i which are fit a parametrized model. The model gives rise to some model values x defined in some way in terms of the parametrization. The set of residuals δ i are defined as δi = xi −ˆ xi where · represents a suitable norm in the measurement space. For image measurements, this is most reasonably the distance in the image ˆ i . Finally, denote by Δ the vector with components δi . This between xi and x may be called the vector of residuals. 2.1
L2 Error and ML Estimation
Often, it is argued that the optimal solution is the least squares solution, which minimizes the cost function ˆ i 2 , xi − x Δ 2 = i
namely the L2 -norm of the vector of residuals.1 The argument for the optimality of this solution is as follows. If we assume that the measurements are derived from actual values, corrupted with Gaussian noise with variance σ 2 , then the ˆ i is given by probability of the set of measurements xi given true values x ˆ i 2 /(2σ 2 ) . P ({xi } | {ˆ xi }) = K exp −xi − x i
where K is a normalizing constant. Taking logarithms and minimizing, we see that the modelled data that2 maximizes the probability of the measurements ˆ i . Thus, the least-squares solution is the maximialso minimizes i xi − x mum likelihood (ML) estimate, under an assumption of Gaussian noise in the measurements. This is the argument for optimality of the least-squares solution. Although useful, this argument does rely on two assumptions that are open to question. 1. It makes an assumption of Gaussian noise. This assumption is not really justified at all. Measurement errors in a digital image are not likely to satisfy a Gaussian noise model. 2. Maximum likelihood is not necessarily the same as optimal, a term that we have not defined. One might define optimal to mean maximum likelihood, but this is a circular argument. 2.2
L∞ Error
An alternative noise model, perhaps equally justified for image point measurements is that of uniform bounded noise. Thus, we assume that all measurements 1
In future the symbol · represents the 2-norm of a vector.
Optimal Algorithms in Multiview Geometry
17
less than a given threshold distance from the true value are equally likely, but measurements beyond this threshold have probability zero. In the case of a discrete image, measurements more accurate than one pixel from the true value are difficult to obtain, though in fact they may be achieved in carefully controlled situations. If we assume that the measurement error probability model is ˆ ) = K exp (−(x − x ˆ /σ)p ) P (x | x
(1)
where K is a normalizing factor, then as before, the ML estimate is the one ˆ i p . Letting p increase to infinity, the probability that minimizes i xi − x distribution (1) converges uniformly (except at σ) to a uniform distribution for ˆ . Now, taking the p-th root, we see that minimizing (1) is x within distance σ of x equivalent to normalizing the p-norm of the vector Δ of residuals. Furthermore, as p increases to infinity, Δp converges to the L∞ norm Δ∞ . In this way, minimizing the L∞ error corresponds to an assumption of uniform distribution of measurement errors. Looked at another way, under the L∞ norm, all sets of measurements that are within a distance σ of the modelled values are equally likely, whereas a set of measurements where one of the values exceeds the threshold σ has probability zero. Then L∞ optimization finds the smallest noise threshold for which the set of measurements is possible, and determines the ML estimate for this minimum threshold. Note that the L∞ norm of a vector is simply the largest component of the vector, in absolute value. Thus, ˆi , min Δ∞ = min max xi − x i
where the minimization is taken over the parameters of the model. For this reason, L∞ minimization is sometimes referred to as minimax optimization. 2.3
Other Criteria for Optimality
It is not clear that the maximum likelihood estimate has a reasonable claim to being the optimal estimate. It is pointed out in [13] that the maximum likelihood estimate may be biased, and in fact have infinite bias even for quite simple estimation problems. Thus, as the noise level of measurements increases, the average (or expected) estimate drifts away from the true value. This is of course undesirable. In addition, a different way of thinking of the estimation problem is in terms of the risk of making a wrong estimate. For example consider the triangulation problem (discussed in more detail later) in which several observers estimate a bearing (direction vector) to a target from known observation points. If the bearing directions are noisy, then where is the target? In many cases, there is a cost associated with wrongly estimating the position of the target. (For instance, if the target is an incoming ballistic missile, the
18
R. Hartley and F. Kahl
cost of a wrong estimate can be quite high.) A reasonable procedure would be to choose an estimate that minimizes the expected cost. As an example, if the cost of an estimate is equal to the square of the distance between the estimate and the true value, then the expected cost is equal to the mean of the posterior probability distribution of the parameters, P (θ | {xi })2 More discussion of these matters is contained in [13], Appendix 3. We are not, however, aware of any literature in multiview geometry finding estimates of this kind. What we mean by optimality. In this survey, we will consider the estimates that minimize the L2 or L∞ norms of the residual error vector, with respect to a parametrized model as being the optimal. This is reasonable in that it is related in either case to a specific geometric noise model, tied directly to the statistics of the measurements.
3
Polynomial Methods
One approch for obtaining optimal solutions to multiview problems is to compute all stationary points of the cost function and then check which of these is the global minimum. From a theoretical point of view, any structure and motion problem can be solved in this manner as long as the cost function can be expressed as a rational polynomial function in the parameters. This will be the case for most cost functions encountered (though not for L∞ cost functions, which are not differentiable). The method is as follows. A minimum of the cost function must occur at a point where the derivatives of the cost with respect to the parameters vanish. If the cost function is a rational polynomial function, then the derivatives are rational as well. Setting the derivatives to zero leads to a system of polynomial equations, and the solutions of this set of equations define the stationary points of the cost function. These can be checked one-by-one to find the minimum. This method may also be applied when the parameters satisfy certain constraints, such as a constraint of zero determinant for the fundamental matrix, or a constraint that a matrix represents a rotation. Such problems can be solved using Lagrange multipliers. Although this method is theoretically generally applicable, in practice it is only tractable for small problems, for example the triangulation problem. A solution to the two-view triangulation problem was given in [12], involving the solution of a polynomial of degree 6. The three-view problem was addressed in [43]; the solution involves the solution of a polynomial of degree 47. Further work (unpublished) by Stew´enius and Nist´er has shown that the triangulation problem for 4 to 7 views can be solved by finding the roots of a polynomial of degree 148, 336, 638, 1081 respectively, and in general, the degree grows cubically. Since 2
This assumes that the parameters θ are in a Euclidean space, which may not always be the case. In addition, estimation of the posterior probability distribution may require the definition of a prior P (θ).
Optimal Algorithms in Multiview Geometry
19
solving large sets of polynomials is numerically difficult, the issue of accuracy has been addressed in [4]. The polynomial method may also be applied successfully in many minimalconfiguration problems. We do not attempt here to enumerate all such problems considered in the literature. One notable example, however, is the relative orientation (two-view reconstruction) problem with 5 point correspondences. This has long been known to have 10 solutions [7,16].3 The second of these references gives a very pleasing derivation of this result. Recent simple algorithms for solving this problem by finding the roots of a polynomial have been given in [31,26]. Methods for computing a polynomial solution need not result in a polynomial of the smallest possible degree. However recent work using Galois theory [32] gives a way to address this question, showing that the 2-view triangulation and the relative orientation problem essentially require solution of polynomials of degree 6 and 10 respectively.
4
L∞ Estimation and SOCP
In this section, we will consider L∞ optimization, and discuss its advantages vis-a-vis L2 optimization. We show that there are many problems that can be formulated in L∞ and give a single solution. This is the main advantage, and contrasts with L2 optimization, which may have many local minima, as was shown in Fig 1. 4.1
Convex Optimization
We start by considering convex optimization problems. First, a few definitions. Convex set. A subset S of Rn is said to be convex if the line segment joining any two point in S is contained in S. Formally, if x0 , x1 ∈ S, then (1−α)x0 +αx1 ∈ S for all α with 0 ≤ α ≤ 1. Convex function. A function f : Rn → R is convex if its domain is a convex set and for all x, y ∈ domain(f ), and α with 0 ≤ α ≤ 1, we have f ((1 − α)x0 + αx1 ) ≤ (1 − α)f (x0 ) + αf (x1 ). Another less formal way of defining a convex function is so say that a line joining two points on the graph of the function will always lie above the function. This is illustrated in Fig 2. 3
These papers find 20 solutions, not 10, since they are solving for the number of possible rotations. There are 10 possible essential matrices each of which gives two possible rotations, related via the twisted pair ambiguity (see [13]). Only one of these two rotations is cheirally correct, corresponding to a possible realizable solution.
20
R. Hartley and F. Kahl
Fig. 2. Left. Examples of convex and non-convex sets. Middle. The definition of a convex function; the line joining two points lies above the graph of the function. Right. Convex optimization.
Convex optimization. A convex optimization problem is as follows: – Given a convex function f : D → R, defined on a convex domain D ⊂ Rn , find the minimum of f on D. A convex function is always continuous, and given reasonable conditions that ensure a minimum of the function (for instance D is compact) such a convex optimization problem is solvable by known algorithms4 . A further desirable property of a convex problem is that it has no local minima apart from the global minimum. The global minimum value is attained at a single point, or at least on a convex set in Rn where the function takes the same minimum value at all points. For further details, we refer the reader to the book [3]. Quasi-convex functions. Unfortunately, although convex problems are agreeable, they do not come up often in multiview geometry. However, interestingly enough, certain other problems do, quasi-convex problems. Quasi-convex functions are defined in terms of sublevel sets as follows. Definition 1. A function f : D → R is called quasi-convex if its α-sublevel set, Sα = {x ∈ D | f (x) ≤ α} is convex for all α. Examples of quasi-convex and non-quasi-convex functions are shown in Fig 3. Quasi-convex functions have two important properties. 1. A quasi-convex function has no local minima apart from the global minimum. It will attain its global minimum at a single point or else on a convex set where it assumes a constant value. 4
The most efficient algorithm to use will depend on the form of the function f , and the way the domain D is specified.
Optimal Algorithms in Multiview Geometry
21
Fig. 3. Quasi-convex functions. The left two functions are quasi-convex. All the sublevel sets are convex. The function on the right is not quasi-convex. The indentation in the function graph (on the left) means that the sublevel-sets are not convex. All convex functions are quasi-convex, but the example on the left shows that the converse is not true.
2. The pointwise maximum of a set of quasi-convex functions is quasi-convex. This is illustrated in Fig 4 for the case of functions of a single variable. The general case follows directly from the following observation concerning sublevel sets. Sδ (fi ) Sδ (max fi ) = i
i
which is convex, since each Sδ (fi ) is convex.
Fig. 4. The pointwise maximum of a set of quasi-convex functions is quasi-convex
A quasi-convex optimization problem is defined in the same way as a convex optimization problem, except that the function to be minimized is quasi-convex. Nevertheless, quasi-convex optimization problems share many of the pleasant properties of convex optimization. Why consider quasi-convex functions? The primary reason for considering such functions is that the residual of a measured image point x with respect to the projection of a point X in space is a quasi-convex function. In other words, f (X) = d(x, PX) is a quasi-convex function of X. Here, PX is the projection of a point X into an image, and d(·, ·) represents distance in the image.
22
R. Hartley and F. Kahl
Fig. 5. The triangulation problem: Assuming that the maximum reprojection error is less than some value δ, the sought point X must lie in the intersection of a set of cones. If δ is set too small, then the cones do not have a common intersection (left). If δ is set too large, then the cones intersect in a convex region in space, and the desired solution X must lie in this region (right). The optimal value of δ lies between these two extremes, and can be found by a binary search (bisection) testing successive values of δ. For more details, refer to the text.
Specifically, the sublevel set Sδ (f (X)) is a convex set, namely a cone with vertex the centre of projection, as will be discussed in more detail shortly (see Fig 5). As the reader may easily verify by example, however, the sum of quasi-convex functions is not in general quasi-convex. If we take several image measurements then the sum of squares of the residuals will not in general be a quasi-convex 2 function. In other words, an L2 cost function of the form N i=1 fi (X) will not in general be a quasi-convex function of X, nor have a single minimum. On the other hand, as remarked above, the maximum of a set of quasiconvex functions is quasi-convex, and hence will have a single minimum. Specifically, maxi fi (X) will be quasi-convex, and have a single minimum with respect to X. For this reason, it is typically easier to solve the minimax problem minX maxi fi (X) than the corresponding least-squares (L2 ) problem. Example: The triangulation problem. The triangulation problem is the most simple problem in multiview geometry. Nevertheless, in the L2 formulation, it still suffers from the problem of local minima, as shown in Fig 1. In this problem, we have a set of known camera centres Oi and a set of direction vectors vi which give the direction of a target point X from each of the camera centres. Thus, nominally, vi = (X − Oi )/X − Oi . The problem is to find the position of the point X. We choose to solve this problem in L∞ norm, in other words, we seek the point X that minimizes the maximum error (over all i) between vi and the directions given by X − Oi . Consider Fig 5. Some simple notation is required. Define ∠(X − Oi , vi ) to be the angle between the vectors X − Oi and vi . Given a value δ > 0, the set of points Cδ (Oi , vi ) = {X | ∠(X − Oi , vi ) ≤ δ} forms a cone in R3 with vertex Oi , axis vi and angle determined by δ.
Optimal Algorithms in Multiview Geometry
23
We begin by hypothesizing that there exists a solution X to the triangulation problem for which the maximum error is δ. In this case, the point X must lie inside cones in R3 with vertex Oi , axis vi and angle determined by δ. If the cones are too narrow, they do not have a common intersection and there can be no solution with maximum error less than δ. On the other hand, if δ is sufficiently large, then the cones intersect, and the desired solution X must lie in the intersection of the cones. The optimal value of δ is found by a binary search over values of δ to find the smallest value such that the cones Cδ (Oi , vi ) intersect in at least one point. The intersection will be a single point, or in special configurations, a segment of a line. The problem of determining whether a set of cones have non-empty intersection is solved by a convex optimization technique called Second Order Cone Programming (SOCP), for which open source libraries exist [44]. We make certain observations about this problem: 1. Each cone Cδ (Oi , vi ) is a convex set, and hence their intersection is convex. 2. If we define a cost function Cost∞ (X) = max ∠(X − Oi , vi ), i
then the sublevel set Sδ (Cost∞ ) is simply the intersection of the cones Cδ (Oi , vi ), which is convex for all δ. This by definition says that Cost∞ (X) is a quasi-convex function of X. 3. Finding the optimum min Cost∞ (X) = min max ∠(X − Oi , vi ) X
X
i
is accomplished by a binary search over possible values of δ, where for each value of δ we solve an SOCP feasibility problem, (determine whether a set of cones have a common intersection). Such a problem is known as a minimax or L∞ optimization problem. Generally speaking, this procedure generalizes to arbitrary quasi-convex optimization problems; they may be solved by binary search involving a convex feasibility problem at each step. If we have a set of individual cost functions fi (X), perhaps each associated with a single measurement, and each of them quasi-convex, then the maximum of these cost functions maxi fi (X) is also quasiconvex, as illustrated in Fig 4. In this case, the minimax problem of finding minX maxi fi (X) is solvable by binary search. Reconstruction with known rotations. Another problem that may be solved by very similar means to the triangulation problem is that of Structure and Motion with known rotations, which is illustrated in Fig 6. The role of cheirality. It is important in solving problems of this kind to take into account the concept of “cheirality”, which means the requirement that
24
R. Hartley and F. Kahl
Fig. 6. Structure and motion with known rotations. If the orientations of several cameras are all known, then image points correspond to direction vectors in a common coordinate frame. Here, blue (or grey) circles represent the positions of cameras, and black circles the positions of points. The arrows represent direction vectors (their length is not known) from cameras to points. The positions of all the points and cameras may be computed (up to scale and translation) using SOCP. Abstractly, this is the embedding problem for a bi-partite graph in 3D, where the orientation of the edges of the graph is known. The analogous problem for an arbitrary (not bi-partite) graph was applied in [41] to solve for motion of the cameras without computing the point positions.
points visible in an image must lie in front of the camera, not behind. If we subdivide space by planes separating front and back of the camera, then there will be at least one local minimum of the cost function (whether L∞ or L2 ) in each region of space. Since the number of regions separated by n planes grows cubically, so do the number of local minima, unless we constrain the solution so that the points lie in front of the cameras. Algorithm acceleration. Although the bisection algorithm using SOCP has been the standard approach to L∞ geometric optimization problems, there has been recent work for speeding up the computations [39,40,34]. However, it has been shown that the general structure and motion (with missing data) is N P -hard no matter what criterion of optimality of reprojection error is used [33]. 4.2
Problems Solved in L∞ Norm
The list of problems that can be solved globally with L∞ estimation continues to grow and by now it is quite long, see Table 1. In [21], an L∞ estimation algorithm serves as the basis for solving the leastmedian squares problem for various geometric problems. However, the extension to such problems is essentially based on heuristics and it has no guarantee of finding the global optimum.
Optimal Algorithms in Multiview Geometry
25
Table 1. Geometric reconstruction problems that can be solved globally with the L∞ or L2 norm L∞ -norm − Multiview triangulation − Camera resectioning (uncalibrated case) − Camera pose (calibrated case) − Homography estimation − Structure and motion recovery with known camera orientation − Reconstruction by using a reference plane − Camera motion recovery − Outlier detection − Reconstruction with covariance-based uncertainty − Two-view relative orientation − 1D retinal vision L2 -norm − Affine reconstruction from affine cameras − Multiview triangulation − Camera resectioning (uncalibrated case) − Homography estimation − 3D – 3D registration and matching − 3D – 3D registration and matching (unknown pairing)
5
References [11,18,20,6] [18,20] [10,47] [18,20] [11,18,20] [18] [41] [42,20,25,47] [41,21] [10] [2] References [23,45] [1,29] [1] [1] [14] [27]
Branch-and-Bound Theory
The method based on L∞ optimization is not applicable to all problems. In this section, we will describe in general terms a different method that has been used with success in obtaining globally optimal solutions. Branch and bound algorithms are non-heuristic methods for global optimization in non-convex problems. They maintain a provable upper and/or lower bound on the (globally) optimal objective value and terminate with a certificate proving that the solution is within of the global optimum, for arbitrarily small . Consider a non-convex, scalar-valued objective function Φ(x), for which we seek a global minimum over a domain Q0 . For a subdomain Q ⊆ Q0 , let Φmin (Q) denote the minimum value of the function Φ over Q. Also, let Φlb (Q) be a function that computes a lower bound for Φmin (Q), that is, Φlb (Q) ≤ Φmin (Q). An intuitive technique to determine the solution would be to divide the whole search region Q0 into a grid with cells of sides δ and compute the minimum of a lower bounding function Φlb defined over each grid cell, with the presumption that each Φlb (Q) is easier to compute than the corresponding Φmin (Q). However, the number of such grid cells increases rapidly as δ → 0, so a clever procedure must be deployed to create as few cells as possible and “prune” away as many of these grid cells as possible (without having to compute the lower bounding function for these cells). Branch and bound algorithms iteratively subdivide the domain into subregions (which we refer to as rectangles) and employ clever strategies to prune away as many rectangles as possible to restrict the search region.
26
R. Hartley and F. Kahl
Φ(x)
x
l
(a)
u
Φ(x)
Φ(x)
Φ(x)
q1∗
q1∗
q1∗ q2∗
x
l
u
(b)
x
l
u
(c)
x
l
u
(d)
Fig. 7. This figure illustrates the operation of a branch and bound algorithm on a one dimensional non-convex minimization problem. Figure (a) shows the the function Φ(x) and the interval l ≤ x ≤ u in which it is to be minimized. Figure (b) shows the convex relaxation of Φ(x) (indicated in yellow/dashed), its domain (indicated in blue/shaded) and the point for which it attains a minimum value. q1∗ is the corresponding value of the function Φ. This value is the current best estimate of the minimum of Φ(x), and is used to reject the left subinterval in Figure (c) because the minimum value of the convex relaxation is higher than q1∗ . Figure (d) shows the lower bounding operation on the right sub-interval in which a new estimate q2∗ of the minimum value of Φ(x) is found.
A graphical illustration of the algorithm is presented in Fig 7. Computation of the lower bounding functions is referred to as bounding, while the procedure that chooses a domain and subdivides it is called branching. The choice of the domain picked for refinement in the branching step and the actual subdivision itself are essentially heuristic. Although guaranteed to find the global optimum (or a point arbitrarily close to it), the worst case complexity of a branch and bound algorithm is exponential. However, in many cases the properties offered by multiview problems lead to fast convergence rates in practice.
6
Branch-and-Bound for L2 Minimization
The branch-and-bound method can be applied to find L2 norm solutions to certain simple problems. This is done by a direct application of branch-andbound over the parameter space. Up to now, methods used for branching have been quite simple, consisting of simple subdivision of rectangles in half, or in half along all dimensions. In order to converge as quickly as possible to the solution, it is useful to confine the region of parameter space that needs to be searched. This can be conveniently done if the cost function is a sum of quasi-convex functions. For instance, suppose the cost is C2 (X) = i fi (X)2 , and the optimal value is denoted by Xopt . If a good initial estimate X0 is available, with C2 (X0 ) = δ 2 , then C2 (Xopt ) = fi (Xopt )2 ≤ C2 (X0 ) = δ 2 . i
This implies that each fi (Xopt ) ≤ δ, and so Xopt ∈ i Sδ (fi ), which is a convex set enclosing both X0 and Xopt . One can find a rectangle in parameter space that
Optimal Algorithms in Multiview Geometry
27
encloses this convex region, and begin the branch-and-bound algorithm starting with this rectangle. This general method was used in [1] to solve the L2 multiview triangulation problem and the uncalibrated camera resection (pose) problem. In that paper fractional programming (described later in section 8.2) was used to define the convex sub-envelope of the cost function, and hence provide a cost lower bound on each rectangle. The same branch-and-bound idea, but with a different bounding method, was described in [29] to provide a simpler solution to the triangulation problem. The triangulation problem for other geometric features, more specifically lines and conics, was solved in [17] using the same approach.
7 7.1
Branch-and-Bound in Rotation Space Essential Matrix
All the problems that we have so-far considered, solvable using quasi-convex optimization or SOCP, involved no rotations. If there are rotations, then the optimization problem is no longer quasi-convex. An example of this type of problem is estimation of the essential matrix from a set of matching points xi ↔ xi . A linear solution to this problem was given in 1981 by Longuet-Higgins [28]. From the Essential matrix E, that satisfies the defining equation xi Exi = 0 for all i, we can extract the relative orientation (rotation and translation) of the two cameras. To understand why this is not a quasi-convex optimization problem, we look at the minimal problem involving 5 point correspondences. It is well know that with 5 points, there are 10 solutions for the relative orientation. (Recent algorithms for solving the 5-point orientation are given in [31,26]). However, if there are many possible discrete solutions, then the problem can not be quasi-convex or convex – such problems have a unique solution. This is so whether we are seeking an L∞ or L2 solution, since the 5-point problem has an exact solution, and hence the cost is zero in either norm. Many algorithms for estimating the essential matrix have been given, without however any claim to optimality. Recently, an optimal solution, at least in L∞ norm was given in [10]. To solve the essential matrix problem optimally (at least in L∞ norm) we make the following observation. If the rotation of the two cameras were somehow known, then the problem would reduce to the one discussed in Fig 6, where the translation of the cameras can be estimated optimally (in L∞ norm) given the rotations. The residual cost of this optimal solution may be found, as a function of the assumed rotation. To solve for the relative pose (rotation and translation) of the two cameras, we merely have to consider all possible rotations, and select the one that yields the smallest residual. The trick is to do this without having to look at an infinite number of rotations. Fortunately, branch-and-bound provides a means of carrying out this search. A key to the success of branch-and-bound is that the optimal cost (residual) estimated for one value of the rotation constrains the optimal cost for “nearby”
28
R. Hartley and F. Kahl
rotations. This allows us to put a lower bound on the optimal cost associated with all rotations in a region of rotation space. The branch-and-bound algorithm carries out a search of rotation space (a 3-dimensional space). The translation is not included in the branch-and-bound search. Instead, for a given rotation, the optimal translation may be computed using SOCP, and hence factored out of the parameter-space search. A similar method of search over rotation space is used in [10] to solve the calibrated camera pose problem. This is the problem of finding the position and orientation of a camera given known 3D points and their corresponding image points. An earlier iterative algorithm that addresses this problem using L∞ norm is given in [47]. The algorithm appears to converge to the global optimum, but the author states that this is unproven. 7.2
General Structure and Motion
The method outlined in section 7.1 for solving the structure and motion problem for two views could in principle be extended to give an optimal solution (albeit in L∞ norm) to the complete structure and motion problem for any number of views. This would involve a search over the combined space of all rotations. For the two camera problem there is only one relative rotation, and hence the branchand-bound algorithm involves a search over a 3-dimensional parameter space. In the case of n cameras, however the parameter space is 3(n − 1)-dimensional. Had we but world enough and time a branch-and-bound search over the combined rotation parameter space would yield an optimal L∞ solution to structure and motion with any number of cameras and points. Unfortunately, in terms of space and time requirements, this algorithm would be vaster than empires and more slow[30]. 7.3
1D Retinal Vision
One-dimensional cameras have proven useful in several different applications, most prominently for autonomous guided vehicles (see Fig 8), but also in ordinary vision for analysing planar motion and the projection of lines. Previous results on one-dimensional vision are limited to classifying and solving of minimal cases, bundle adjustment for finding local minima to the structure and motion problem and linear algorithms based on algebraic cost functions. A method for finding the global minimum to the structure and motion problem using the max norm of reprojection errors is given in [2]. In contrast to the 2Dcase which uses SOCP, the optimal solution can be computed efficiently using simple linear programming techniques. It is assumed that neither the positions of the objects nor the positions and orientations of the cameras are known. However, it is assumed that the correspondence problem is solved, i.e., it is known which measured bearings correspond to the same object. The problem can formally be stated as follows. Given n bearings from m different positions, find the camera positions and 2D points
Optimal Algorithms in Multiview Geometry
29
2
reflector
1.5
1
0.5
0
−0.5 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Fig. 8. Left: A laser guided vehicle. Middle: A laser scanner or angle meter. Right: Calculated structure and motion for the icehockey experiment.
in the plane, such that the reprojected solution has minimal residual errors. The norm for measuring the errors will be the L∞ norm. The basic idea of the optimization scheme is to first consider optimization with fixed camera orientations - which is a quasi-convex problem - and then use branch-and-bound over the space of possible orientations, similar to that of section 7.1. Hockey rink data. By combining optimal structure and motion with optimal resection and intersection it is possible to solve for arbitrarily many cameras and views. We illustrate this with the data from a real set of measurements performed at an ice-hockey rink. The set contains 70 images of 14 points. The result is shown in the right of Fig 8. 7.4
3D – 3D Alignment
A similar method of doing branch-and-bound in rotation space was used in [27] to find an optimal solution for the problem of aligning two sets of points in 3D with unknown pairing. The solution consists of a specified pairing of the two point sets, along with a rotation and translation to align the paired points. The algorithm relies on the fact that if the rotation is known, then the optimal pairing can be computed directly using the Hungarian algorithm [36]. This enables the problem to be addressed by a branch-and-bound search over rotations. The problem is solved for L2 norm in [27] using a bounding method based on Lipschitz bounds. Though the L∞ problem is not specifically addressed in the paper, it would probably also yield to a similar approach. 7.5
An Open Problem: Optimal Essential Matrix Estimation in L2 Norm
The question naturally arises of whether we can use similar techniques to section 7.1 to obtain the optimal L2 solution for the essential matrix. At present, we have no solution to this problem. Two essential steps are missing.
30
R. Hartley and F. Kahl
1. If the relative rotation R between the two cameras is given, can we estimate the optimal translation t. This is simple in L∞ norm using SOCP, or in fact linear programming. In L2 norm, a solution has been proposed in [15], but still it is iterative, and it is not clear that it is guaranteed to converge. For n points, this seems to be a harder problem than optimal n-view L2 triangulation (for which solutions have been recently given [1,29]). 2. If we can find the optimal residual for a given rotation, how does this constrain the solution for nearby rotations? Loose bounds may be given, but they may not be sufficiently tight to allow for efficient convergence of the branch-and-bound algorithm.
8 8.1
Other Methods for Minimizing L2 Norm Convex Relaxations and Semidefinite Programming
Another general approach for solving L2 problems was introduced in [19] based on convex relaxations (underestimators) and semidefinite programming. More specifically, the approach is based on a hierarchy of convex relaxations to solve non-convex optimization problems. Linear matrix inequalities (LMIs) are used to construct the convex relaxations. These relaxations generate a monotone sequence of lower bounds of the minimal value of the objective function and it is shown how one can detect whether the global optimum is attained at a given relaxation. Standard semidefinite programming software (like SeDuMi [44]) is extensively used for computing the bounds. The technique is applied to a number of classical vision problems: triangulation, camera pose, homography estimation and epipolar geometry estimation. Although good results are obtained, there is no guarantee of achieving the optimal solution and the sizes of the problem instances are small. 8.2
Fractional Programming
Yet another method was introduced in [1,35]. It was the first method to solve the n-view L2 triangulation problem with a guarantee of optimality. Other problem applications include camera resectioning (that is, uncalibrated camera pose), camera pose estimation and, homography estimation. In its most general form, fractional programming seeks to minimize/maximize the sum of fractions subject to convex constraints. Our interest from the point of view of multiview geometry, however, is specific to the minimization problem min x
p fi (x) i=1
gi (x)
subject to x ∈ D
where fi : Rn → R and gi : Rn → R are convex and concave functions, respectively, and the domain D ⊂ Rn is a convex compact set. This is because the residual of the projection of a 3D point into an image may be written in this form. Further, it is assumed that both fi and gi are positive with lower
Optimal Algorithms in Multiview Geometry
31
and upper bounds over D. Even with these restrictions the above problem is N P -complete [8], but practical and reliable estimation of the global optimum is still possible for many multiview problems through an iterative algorithm that solve an appropriate convex optimization problem at each step. The procedure is based on branch and bound. Perhaps the most important observation made in [1] is that many multiview geometry problems can be formulated as a sum of fractions where each fraction consists of a convex over a concave function. This has inspired a new more efficient ways of computing the L2 -minimum for n-view triangulation, see [29].
9
Applications
There have been various application papers that have involved this type of optimization methodology, though they can not be said to have found an optimal solution to the respective problems. In [38] SOCP has been used to solve the problem of tracking and modelling a deforming surface (such as a sheet of paper) from a single view. Results are shown in Fig 9.
Fig. 9. Modelling a deforming surface from a single view. Left: the input image, with automatically overlaid grid. Right: the computed surface model viewed from a new viewpoint. Image features provide cone constraints that constrain the corresponding 3D points to lie on or near the corresponding line-of-sight, namely a ray through the camera centre. Additional convex constraints on the geometry of the surface allow the shape to be determined unambiguously using SOCP.
In another application-inspired problem, SOCP has been applied (see [22]) to the odometry problem for a vehicle with several rigidly mounted cameras with almost non-overlapping fields of view. Although the algorithm in [22] is tested on laboratory data, it is motivated by its potential use with vehicles such as the one shown in Fig 10. Such vehicles are used for urban-mapping. Computation of individual essential matrices for each of the cameras reduces the computation of the translation of the vehicle to a multiple view triangulation problem, which is solved using SOCP.
32
R. Hartley and F. Kahl
Fig. 10. Camera and car mount used for urban mapping. Images from non-overlapping cameras on both sides of the car can be used to do odometry of the vehicle. An algorithm based on triangulation and SOCP is proposed in [22]. (The image is used with permission of the UNC Vision group).
10
Concluding Remarks
The application of new optimization methods to the problems of Multiview Geometry have led to the development of reliably and provably optimal solutions under different geometrically meaningful cost functions. At present these algorithms are not as fast as standard methods, such as bundle adjustment. Nevertheless the run times are not wildly impractical. Recent work on speeding up the optimization process is yielding much faster run times, and further progress is likely. Such optimization techniques are also being investigated in other areas of Computer Vision, such as discrete optimization. A representative paper is [24]. For 15 years or more, geometric computer vision has relied on a small repertoire of optimization methods, with Levenberg-Marquardt [37] being the most popular. The benefit of using new methods such as SOCP and other convex and quasi-convex optimization methods is being realised.
References 1. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical global optimization for multiview geometry. In: European Conf. Computer Vision, Graz, Austria, pp. 592–605 (2006) 2. ˚ Astr¨ om, K., Enqvist, O., Olsson, C., Kahl, F., Hartley, R.: An L∞ approach to structure and motion problems in 1d-vision. In: Int.Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 3. Boyd, S., Vanderberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 4. Byr¨ od, M., Josephson, K., ˚ Astr¨ om, K.: Improving numerical accuracy in gr¨ obner basis polynomial equation solvers. In: Int. Conf.Computer Vision, Rio de Janeiro, Brazil (2007) 5. Chandraker, M.K., Agarwal, S., Kriegman, D.J., Belongie, S.: Globally convergent algorithms for affine and metric upgrades in stratified autocalibration. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007)
Optimal Algorithms in Multiview Geometry
33
6. Farenzena, M., Fusiello, A., Dovier, A.: Reconstruction with interval constraints propagation. In: Proc. Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1185–1190 (2006) 7. Faugeras, O.D., Maybank, S.J.: Motion from point matches: Multiplicity of solutions. Int. Journal Computer Vision 4, 225–246 (1990) 8. Freund, R.W., Jarre, F.: Solving the sum-of-ratios problem by an interior-point method. J. Glob. Opt. 19(1), 83–102 (2001) 9. Hartley, R., de Agapito, L., Hayman, E., Reid, I.: Camera calibration and the search for infinity. In: Proc. 7th International Conference on Computer Vision, Kerkyra, Greece, September 1999, pp. 510–517 (1999) 10. Hartley, R., Kahl, F.: Global optimization through searching rotation space and optimal estimation of the essential matrix. Int. Conf. Computer Vision (2007) 11. Hartley, R., Schaffalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Conf. Computer Vision and Pattern Recognition, Washington DC, USA, vol. I, pp. 504–509 (2004) 12. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68(2), 146–157 (1997) 13. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 14. Horn, B.K.P.: Closed form solution of absolute orientation using unit quaternions. J. Opt. Soc. America 4(4), 629–642 (1987) 15. Horn, B.K.P.: Relative orientation. Int. Journal Computer Vision 4, 59–78 (1990) 16. Horn, B.K.P.: Relative orientation revisited. J. Opt. Soc. America 8(10), 1630–1638 (1991) 17. Josephson, K., Kahl, F.: Triangulation of points, lines and conics. In: Scandinavian Conf. on Image Analysis, Aalborg, Denmark (2007) 18. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Int. Conf. Computer Vision, Beijing, China, pp. 1002–1009 (2005) 19. Kahl, F., Henrion, D.: Globally optimal estimates for geometric reconstruction problems. Int. Journal Computer Vision 74(1), 3–15 (2007) 20. Ke, Q., Kanade, T.: Quasiconvex optimization for robust geometric reconstruction. In: Int. Conf. Computer Vision, Beijing, China, pp. 986–993 (2005) 21. Ke, Q., Kanade, T.: Uncertainty models in quasiconvex optimization for geometric reconstruction. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1199–1205 (2006) 22. Kim, J.H., Hartley, R., Frahm, J.M., Pollefeys, M.: Visual odometry for nonoverlapping views using second-order cone programming. In: Asian Conf. Computer Vision (November 2007) 23. Koenderink, J.J., van Doorn, A.J.: Affine structure from motion. J. Opt. Soc. America 8(2), 377–385 (1991) 24. Kumar, P., Torr, P.H.S., Zisserman, A.: Solving markov random fields using second order cone programming relaxations. In: Conf. Computer Vision and Pattern Recognition, pp. 1045–1052 (2006) 25. Li, H.: A practical algorithm for L-infinity triangulation with outliers. In: CVPR, vol. 1, pp. 1–8. IEEE Computer Society, Los Alamitos (2007) 26. Li, H., Hartley, R.: Five-point motion estimation made easy. In: Int. Conf. Pattern Recognition, pp. 630–633 (August 2006) 27. Li, H., Hartley, R.: The 3D – 3D registration problem revisited. In: Int. Conf. Computer Vision (October 2007) 28. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981)
34
R. Hartley and F. Kahl
29. Lu, F., Hartley, R.: A fast optimal algorithm for l2 triangulation. In: Asian Conf. Computer Vision (November 2007) 30. Marvell, A.: To his coy mistress. circa (1650) 31. Nist´er, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Analysis and Machine Intelligence 26(6), 756–770 (2004) 32. Nist´er, D., Hartley, R., Stew´enius, H.: Using Galois theory to prove that structure from motion algorithms are optimal. In: Conf. Computer Vision and Pattern Recognition (June 2007) 33. Nist´er, D., Kahl, F., Stew´enius, H.: Structure from motion with missing data is N P -hard. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 34. Olsson, C., Eriksson, A., Kahl, F.: Efficient optimization of L∞ -problems using pseudoconvexity. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 35. Olsson, C., Kahl, F., Oskarsson, M.: Optimal estimation of perspective camera pose. In: Int. Conf. Pattern Recognition, Hong Kong, China, vol. II, pp. 5–8 (2006) 36. Papadimitriou, C., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliffs (1982) 37. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C. Cambridge University Press, Cambridge (1988) 38. Salzman, M., Hartley, R., Fua, P.: Convex optimization for deformable surface 3D tracking. In: Int. Conf. Computer Vision (October 2007) 39. Seo, Y., Hartley, R.: A fast method to minimize L∞ error norm for geometric vision problems. In: Int. Conf. Computer Vision (October 2007) 40. Seo, Y., Hartley, R.: Sequential L∞ norm minimization for triangulation. In: Asian Conf. Computer Vision (November 2007) 41. Sim, K., Hartley, R.: Recovering camera motion using the L∞ -norm. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1230–1237 (2006) 42. Sim, K., Hartley, R.: Removing outliers using the L∞ -norm. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 485–492 (2006) 43. Stew´enius, H., Schaffalitzky, F., Nist´er, D.: How hard is three-view triangulation really? In: Int. Conf. Computer Vision, Beijing, China, pp. 686–693 (2005) 44. Sturm, J.F.: Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric cones. Optimization Methods and Software 11(12), 625–653 (1999) 45. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization approach. Int. Journal Computer Vision 9(2), 137–154 (1992) 46. Triggs, W., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.: Bundle adjustment for structure from motion. In: Vision Algorithms: Theory and Practice, pp. 298–372. Springer, Heidelberg (2000) 47. Zhang, X.: Pose estimation using L∞ . In: Image and Vision Computing New Zealand (2005)
Machine Vision in Early Days: Japan’s Pioneering Contributions Masakazu Ejiri R & D Consultant in Industrial Science, formerly at Central Research Laboratory, Hitachi, Ltd.
Abstract. The history of machine vision started in the mid-1960s by the efforts of Japanese industry researchers. A variety of prominent visionbased systems was made possible by creating and evolving real-time image processing techniques, and was applied to factory automation, office automation, and even to social automation during the 1970-2000 period. In this article, these historical attempts are briefly explained to promote understanding of the pioneering efforts that opened the door and formed the bases of today’s computer vision research. Keywords: Factory automation, office automation, social automation, real-time image processing, video image analysis, robotics, assembly, inspection.
1
Introduction
There is an old saying, “knowing the old brings you a new wisdom for tomorrow,” that originated with Confucius (a Chinese philosopher, 551BC-479BC). This is the basic idea underlying this article, and its purpose is to enlighten young researchers on old technologies rather than new ones. In the 1960s, one of the main concerns of researchers in the field of information science was the realization of intelligence by using a conventional computer, which had been used mainly for numerical computing. At that time, a hand-eye system was thought to be an excellent research tool to visualize intelligence and demonstrate its behavior. The hand-eye system was, by itself, soon recognized as an important research target, and it became known as the “intelligent robot.” One of the core technologies of the intelligent robot was, of course, vision, and people started to call this vision research area “computer vision.” However, the academic research on computer vision was apt to be stagnant. Its achievements stayed at the level of simulated tasks and could not surpass our expectations because of its intrinsic difficulty and the limitation of computing power in those days. On the other hand, practical vision technology was eagerly anticipated in industry, particularly in Japan, as one of the core technologies towards attaining flexible factory automation. Research was initiated in the mid-1960s, typically at our group of Hitachi’s Central Research Laboratory. In contrast to the word “computer vision,” we used the word “machine vision” for representing a more Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 35–53, 2007. c Springer-Verlag Berlin Heidelberg 2007
36
M. Ejiri
pragmatic approach towards “useful” vision systems, because the use of computers was less essential in the pragmatic approach. Most of the leading companies followed this lead, and they all played an important role in the incubation and development of machine vision technology. Currently, computer vision is regarded as a fundamental and scientific approach to investigate the principles that underlie vision and how artificial vision can be best achieved, in contrast to the more pragmatic, needs-oriented machine vision approach. We have to note, however, that there is no difference in the ultimate goals of these two approaches. Though the road to machine vision was not smooth, Japanese companies fortunately achieved some key successes. In this article, we briefly introduce these pioneering vision applications and discuss the history of machine vision.
2
Prehistoric Machine Vision
Our first attempt at vision application, in 1964, was to automate the assembly process (i.e., wire-bonding process) of transistors. In this attempt, we used a very primitive optical sensor by combining a microscope and a rotating-drum type scanner with two slits on its surface. By detecting the reflection from the transistor surface with photo-multipliers, the position and orientation of transistor chips were determined with about a 95% success rate. However, this percentage was still too low to enable us to replace human workers; thus, our first attempt failed and was eventually abandoned after a two-year struggle. What we learned from this experience was the need for reliable artificial vision comparable to a human’s pattern recognition capability, which quickly captures the image first, and then reduces the information quantity drastically until the positional information is firmly determined. Our slit-type optical-scanning method inherently lacked the right quantity of captured information; thus, the recognition result was apt to be easily affected by reflective noises. In those days, microprocessors had not yet been developed and the available computers were still too expensive, bulky, and slow, particularly for image processing and pattern recognition. Moreover, memory chips were extremely costly, so the use of full-frame image memories was prohibitive. Though there was no indication that these processors would soon improve, we started seminal research on flexible machines in 1968. A generic intelligent machine conceived at that time was the one that consisted of three basic functions: understanding of a human’s instruction or intention to clarify the goal of the task, understanding of objects or environment to clarify the start of the task, and decision-making to find the optimum route between the start and the goal. Based on this conception, a prototype intelligent robot was developed in 1970. The configuration of this intelligent robot is shown in Fig.1. In this robot, the image of a trihedral plan drawing was captured by one of the cameras and was analyzed to clarify the goal assemblage as well as the components of the assemblage. Another camera looked at the real blocks scattered on a table, and found their positions and postures. From these blocks, the
Machine Vision in Early Days: Japan’s Pioneering Contributions
37
Fig. 1. Intelligent robot (1970)
computer recognized the component blocks needed to complete this assembly task, and made a plan to assemble these blocks. For this assembly planning, backward reasoning was used to find the route from the goal to the start, not from the start to goal. That is, the route was found by analyzing the disassembly task from the goal assemblage to each component. The assembly sequence was determined as the reverse of this disassembly sequence. Thus, the robot could assemble blocks into various forms by responding to the objectives presented macroscopically by an assembly drawing [2]. This research formed part of Japan’s national project on PIPS (Pattern Information Processing System), initiated in the following year.
3
Factory Applications
Our project on the prototype intelligent robot in 1970 revealed many basic problems underlying “flexible machines” and gave us useful insights into future applications of robotics. One significant problem we were confronted with was the robot’s extremely slow image-processing speed in understanding objects. Our next effort was therefore focused on developing high-speed dedicated hardware for image processing with the minimum use of memory, instead of using rather slow and expensive computers. One of the core ideas was to adaptively threshold the image signal into a binary form by responding to the signal behavior and to input it into a shift-register-type local memory that dynamically stored the latest pixel data of several horizontal scan-lines. This local-paralleltype configuration enabled us to simultaneously extract plural pixel data from a 2-D local area in synchronization with image scanning. By designing the logic circuit connected to this 2-D local area according to the envisaged purpose, the processing hardware could be adapted to many applications. One useful yet simple method using this local-parallel-type image processing was windowing. This method involved setting up a few particular window areas
38
M. Ejiri
Fig. 2. Bolting robot for piles/poles (1972)
in the image plane, and the pixels in the windows were selectively counted to find the background area and the object area occupying the windows. In 1972 a bolting robot applying windowing was developed in order to automate the molding process of concrete piles and poles [3]. It became the first application of machine vision to moving objects. Note that this paper [3] was published at a later time. This was likely to happen because publication was not the first priority for industry researchers. Another effective method based on local-parallel architecture was erosion/ dilation of patterns, which was executed by simple AND/OR logic on a 2-D local area. This method could detect defects in printed circuit boards (PCBs), and formed one of the bases of today’s morphological processing. This defectdetection machine in 1972 also became the first application of machine vision to the automation of visual inspection [4]. These two pioneering applications are illustrated in Figs. 2 and 3. Encouraged by the effectiveness of these machine-vision systems in actual production lines, we again started to develop a new assembly machine for transistors that was, this time, based fully on image processing. A multiple local pattern matching method was extensively studied for this purpose. In this method, each local pattern position in a transistor image was found by matching to a standard pattern. The distance and the angle between a pair of detected local pattern positions were sequentially checked to see if these local patterns were correctly detected. The electrode positions for wiring were then calculated from the coordinates of the first detected correct pair. By basing a local-parallel-type image processor on this matching, we finally developed fully automatic transistor assembly machines in 1973 [5]. This successful development was the result of a ten-year effort since our first failed attempt. The developed assembly machines were recognized as the world’s first image-based machines for fully automatic assembly of semiconductor devices. These machines and their configuration are shown in Fig. 4.
Machine Vision in Early Days: Japan’s Pioneering Contributions
39
Fig. 3. PCB inspection machine (1972)
Fig. 4. Transistor assembly machine (1973)
After this development, our efforts were then focused on expanding machinevision applications from transistors to other semiconductor devices, such as integrated circuits (ICs), hybrid ICs, and large-scale integrated circuits (LSIs). Consequently, the automatic assembly of all types of semiconductor devices was completed by 1977. With the export of this automatic assembly technology to a US manufacturer as a start, the technology gained widespread attention from semiconductor manufacturers worldwide and expanded quickly into industry. As a result, the semiconductor industry as a whole prospered by virtue of higher speed production of higher quality products with more uniform performance than had been achieved previously. Encouraged by the success of semiconductor assembly, our efforts were further broadened to other industrial applications in the mid-1970s to early 1980s. Examples of such applications during this period are a hose-connecting robot
40
M. Ejiri
Fig. 5. Wafer inspection machines (1984-1987)
for pressure testing in pump production lines, a reading machine of 2-D objectcode for an intra-factory physical distribution system, and a quality inspection machine for marks and characters printed on electronic parts [6][7][8]. Machines for inspecting photo-masks in semiconductor fabrication and CRT black-matrix fabrication were other examples [9] (by Toshiba) and [10] (by Hitachi). Machines for classifying medical tablets and capsules [11] (by Fuji Electric) and machines for classifying agricultural products and fish [12][13][14] (by Mitsubishi Electric) were also unique and epoch-making achievements in those days. These examples show that the key concept representing those years seemed to be the realization of a “productive society” through factory automation, and the objectives of machine vision were mainly position detection for assembly, shape detection for classification, and defect detection for inspection. In 1980, the PIPS project finished after a 10-year effort by Japanese industry. A variety of recognition systems were successfully prototyped for hand-written Kanji characters, graphics, drawings, documents, color pictures, speeches, and three-dimensional objects. One particular outcome among others was the development of high-speed, general-purpose image processors [15], which in turn served as the basis of subsequent research and development in industry. The most difficult but rewarding development in the mid-1980s was an inspection machine for detecting defects in semiconductor wafers [16]. It was estimated that even the world’s largest super-computer available at that time would require at least one month of computing for finishing the defect detection of an 8-inch single wafer. We therefore had to develop special hardware for lowering the processing time to less than 1 hour/wafer. The resulting hardware was a network of local-parallel-type image processors that use a “design pattern referring method,” shown in Fig. 5. In this machine, hardware-based knowledge processing, in which each processor was regarded as a combination of IF-part and THEN-part logical circuits, was first attempted [17].
Machine Vision in Early Days: Japan’s Pioneering Contributions
41
Meanwhile, the processing speed of microprocessors improved considerably since their appearance in the mid-1970s, and the memory capacity drastically increased without excessively increasing costs. These improvements facilitated the use of gray-scale images instead of binary ones, and dedicated LSI chips for image processing were developed in the mid-1980s [18]. These developments all contributed to achieving more reliable, microprocessor-based general-purpose machine vision systems with full-scale buffers for gray-level images. As a result, applications of machine vision soon expanded from circuit components, such as semiconductors and PCBs, to home-use electronic equipment, such as VCRs and color TVs. Currently, machine vision systems are found in various areas such as electronics, machinery, medicine, and food industries.
4
Office Applications
Besides the above-described machine-vision systems for factory applications, there was extensive research on character recognition in the area of office automation. For example, in the mid-1960s, a FORTRAN program reader was developed to replace key-punching tasks. Mail-sorting machines for post offices were developed in the late 1960s to automatically read handwritten postal codes (by Toshiba et al.). Another topical developmental effort started in 1974 for automatic classification of fingerprint patterns, and in 1982 the system was first put in use at a Japanese police office with great success, and later at US police offices (by NEC). Our first effort to apply machine-vision technology to areas other than factory automation was the automatic recognition of monetary bills in 1976. This recognition system was extremely successful in spurring the development of automated teller machines (ATMs) for banks (see Fig. 6). Due to the processing time limitation, the entire image of a bill was not captured, but by combining several partial images obtained from optical sensors with those from magnetic sensors, so-called sensor-fusion was first attempted and thus resulted in highaccuracy bill recognition with a less than 1/1015 theoretical error rate. Early ATM models for domestic use employed vertical safes, but in the later models, the horizontal safes were extensively used for increasing spatial efficiency and for facilitating use in Asian countries having a larger number of bill types. Our next attempt, in the early 1980s, was the efficient handling of a large amount of graphic data in the office [19]. The automatic digitization of paperbased engineering drawings and maps was first studied. The recognition of these drawings and maps was based on a vector representation technique, such as that shown in Fig. 7. The recognition was usually executed by spatially-paralleltype image processors, in which each processor was designated to a specific image area. Currently, geographic information systems (GIS) based on these digital maps gained popularity and are being used by many service companies and local governments to manage their electric power supply, gas supply, water supply facilities, and sewage service facilities (see Fig. 8). The use of digital maps was then extended to car navigation systems and more recently to various
42
M. Ejiri
Fig. 6. ATM: automated teller machines (1976-1995)
Fig. 7. Automatic digitizer for maps and engineering drawings (1982)
other information service systems via the Internet. Machine-vision technology contributed, mainly in the early developmental stage of these systems, to the digitization of original paper-based maps into electronic form until these digital maps began to be produced directly from measured data through computer-aided map production. Spatially divided parallel processing was also useful for large-scale images such as those from satellite data. One of our early attempts in this area was the recognition of wind vectors, back in 1972, by comparing two simulated satellite images with a 30-minute interval. This system formed a basis of weather forecasting using Japan’s first meteorological geo-stationary satellite “Himawari,” launched a few years later. Also, an algorithm for deriving a sea temperature contour map from infra-red satellite images was built for environmental study and fisheries.
Machine Vision in Early Days: Japan’s Pioneering Contributions
43
Fig. 8. GIS: geographic information systems(1986)
Research on document understanding also originated as part of machine-vision research in the mid-1980s [20]. During those years, electronic document editing and filing became popular owing to the progress in word-processing technology for over 4000 Kanji and Kana characters. The introduction of an electronic patent-application system in Japan in 1990 was an important stimulus for further research on office automation. We developed dedicated workstations and a parallel-disk-type distributed filing system for the use of patent examiners. This system enabled examiners to efficiently retrieve and display the images of past documents for comparison. The recognition of handwritten postal addresses was one of the most challenging topics in machine-vision applications. In 1992, a decision was made by a government committee (to which the author served as a member) to adopt a new 7-digit postal code system in Japan beginning in 1998. To this end, three companies (Hitachi, Toshiba and NEC) developed new automatic mail-sorting machines for post offices in 1997. An example of the new sorting machines is shown in Fig. 9. In those machines, hand-written/printed addresses in Kanji characters are fully read together with the 7-digit postal codes; both results are then matched for consistency; and the recognized full address is printed on each letter as a transparent barcode consisting of 20-digit data. The letters are then dispatched to other post offices for delivery. In a subsequent process, only these barcodes are read, and prior to home delivery the letters are arranged by the new sorting machine in such a way that the order of the letters corresponds to the house order on the delivery route. In these postal applications, the recognition of all types of printed fonts and hand-written Kanji characters was made possible by using a multi-microprocessor type image processing system. A mail image is sent to one of the unoccupied processors, and this designated processor analyzes the image. The address recognition by a designated single processor usually requires 1.0 to 2.5 seconds, depending on the complexity of the address image. As up to 32 microprocessors are used
44
M. Ejiri
Fig. 9. New mail sorting machine (1997)
in parallel for successively flowing letters, the equivalent recognition time of the whole system is less than 0.1 seconds/letter, producing a maximum processing speed of 50,000 letters per an hour. The office applications of vision technology described above show that the key concept representing those years seemed to be the realization of an “efficient society” through office automation, and the objectives of machine vision were mainly efficient handling of large-scale data and also high-precision, high-speed recognition and handling of paper-based information. Recent progress in network technology has also increased the importance of office automation. To secure the reliability of information and communication systems, a variety of advanced image processing technologies will be expected. These will include more effective and reliable compression, encryption, scrambling, and watermarking technologies of image data.
5
Social Applications
In recent years, applications to social automation have become increasingly important. Social automation here means “making the systems designed for social use more intelligent,” and it includes systems for traffic and for environmental use. The technologies used in these systems are, for example, surveillance, monitoring, flow control and security assurance. The earliest attempt at social automation was probably our elevator-eye project in 1977, in which we tried to implement machine vision in an elevator system in order to control the human traffic in large-scale buildings. The elevator hall on each floor was equipped with a camera to observe the hall, and a vision system to which these cameras were connected surveyed all floors in a time-sharing manner and estimated the number of persons waiting for an elevator. The vision system then designated an elevator cage to quickly serve the crowded floor [21]. The configuration of this system is shown in Fig. 10.
Machine Vision in Early Days: Japan’s Pioneering Contributions
45
Fig. 10. Elevator and other traffic applications (1977-1986)
In this elevator system, a robust change-finding algorithm based on edge vectors was used in order to cope with the change in the brightness of the surroundings. In this algorithm, the image plane was divided into several blocks, and the edge-vector distribution in each block was compared with that of the background image, which was updated automatically by new image data when no motion was observed and thus nobody was in the elevator hall. This system could minimize the average waiting time for the elevator. Though a few systems were put into use in Tokyo area in the early 1980s, there has not been enough market demand to continue to develop the system further. More promising applications of image recognition seemed to be for monitoring road traffic, where license plates, traffic jams, and illegally parked cars were identified so that traffic could be controlled smoothly and parking lots could be automatically allocated [22]. Charging tolls automatically at toll gates without stopping cars, by means of a wireless system with IC card, is now popular on highways as a result of the ITS (Intelligent Transport System) project. The system will be further improved if the machine vision can be effectively combined to quickly recognize other important information such as license plate numbers and even driver’s faces and other identities. A water-purity monitoring system using fish behavior [23] was in operation for at least 10 years at a river control center in a local city in Japan since the river water was accidentally polluted by toxicants. A schematic diagram of the system is shown in Fig. 11. The automatic observation of algae in water in sewage works was also studied. Volcanic lava flow was continuously monitored at the base of Mt. Fugendake in Nagasaki, Japan, during the eruption period in 1993. To optically send images from unmanned remote observation posts to the central control station, laser communication routes were planned by using 3-D undulation data derived from GIS digital contour maps. A GIS was also constructed to assist in restoration after the “Hanshin-Awaji” earthquake in Kobe, Japan, in 1995. Aerial photographs after the earthquake were analyzed by
46
M. Ejiri
Fig. 11. Environmental use (1990-1995)
matching them with digital 3-D urban maps containing additional information on the height of buildings. Buildings with damaged walls and roofs could thus be quickly detected and given top priority for restoration [24]. Intruder detection is also becoming important in the prevention of crimes and in dangerous areas such as those around high-voltage electric equipment. Railroad crossings can also be monitored intensively by comparing the vertical line data in an image with that in a background image updated automatically [25]. Arranging the image differences in this vertical window gives a spatiotemporal image of objects intruding onto the crossing. In almost all of these social applications, real-time color-image processing is becoming increasingly important for reliable detection and recognition. As mentioned before, the application of image processing to communications is increasingly promising as multimedia and network technologies improve. Humanmachine interfaces will be greatly improved if the machine is capable of recognizing every media used by humans. Human-to-human communication assisted by intelligent machines and networks is also expected. Machine vision will contribute to this communication in such fields as motion capturing, face recognition, facial expression understanding, gesture recognition, sign language understanding, and behavior understanding. In addition, applications of machine vision to the field of human welfare, medicine, and environment improvement will become increasingly important in the future. Examples of these applications are rehabilitation equipment, medical surgery assistance, and water purification in lakes. Thus, the key concept representing the future seems to be the realization of a calm society, in which all uneasiness will be relieved through networked social automation, and the important objectives of machine vision will typically be the realization of the following two functions: 24-hour/day “abnormality monitoring” via networks and reliable “personal identification” via networks.
Machine Vision in Early Days: Japan’s Pioneering Contributions
6
47
Key Technologies
In most of the future applications, dynamic image processing will be a key to success. There are various approaches already for analyzing incoming video images in real-time by using smaller-scale personal computers. One typical example is the “Mediachef” system, which automatically cuts video images into a set of scenes by finding significant changes between consecutive image frames [26]. The principle of the system is shown in Fig. 12. This is one of the essential technologies for video indexing and video-digest editing. To date, this technology has been put into use in the video inspection process in a broadcasting company so that subliminal advertising can be detected before the video is on the air.
Fig. 12. Key technologies: “Mediachef” for video indexing and editing (1990)
For the purpose of searching scenes, we developed a real-time video coding technique that uses an average color in each frame and represents its sequence by a “run” between frames. This method can compress 24-hour video signals into a memory capacity of only 2 MB. This video-coding technology can be applied to automatically detect the time of broadcast of a specific TV commercial by continuously monitoring TV signals by means of a compact personal computer. It therefore allows manufacturers to monitor their commercials being broadcast by an advertising company and, thus, provides evidence of a broadcast. The technology called “Cyber BUNRAKU,” in which human facial expressions are monitored by small infrared-sensitive reflectors put on a performer’s face, is also noteworthy. By combining the facial expressions thus obtained with the limb motions of a 19-jointed “Bunraku doll” (used in traditional Japanese theatrical performance), a 3-D character model in a computer can be animated in real-time to create video images [27], as shown in Fig. 13. This technology can create TV animation programs much faster than through traditional handdrawing methods. Another example of dynamic video analysis is “Tour into the picture (TIP)” technology. As shown in Fig. 14, a 2-D picture is scanned and interpreted into
48
M. Ejiri
Fig. 13. Key technologies: Cyber BUNRAKU(1996)
Fig. 14. Key technologies: Tour into the picture (1997)
three-dimensional data by manually fitting vanishing lines on the displayed picture. The picture can then be looked at from different angles and distances [28]. A motion video can thus be generated from a single picture and viewers can feel as if they were taking a walk in an ancient city when an old picture of the city is available. A real-time creation of panoramic pictures is also an important application of video-image processing [29]. A time series of each image frame from a video camera during panning and tilting is spatially connected in real-time into a single still picture (i.e. image mosaicing), as shown in Fig. 15. Similarly, by connecting all the image frames obtained during the zooming process, a high-resolution picture (having higher resolution in the inner areas) can be obtained as shown in Fig. 16.
Machine Vision in Early Days: Japan’s Pioneering Contributions
49
Fig. 15. Key technologies: Panoramic view by panning and tilting(1998)
Fig. 16. Key technologies: Panoramic view by zooming (1999)
As mentioned already, one important application of machine vision is personal identification in social use. Along these lines, there have been a few promising developments. These include the personal identification systems by means of fingerprint patterns [30] (by NEC, 1996), of iris patterns (by Oki Electric, 1997), of finger vein patterns (by Hitachi, 2000, see Fig. 17) and of palm vein patterns (by Fujitsu, 2002). These are now finding the wide use in security systems, including the application to automated teller machines (ATMs). We have given a few examples of real-time image processing technologies, which will be key technologies applicable to wide variety of systems in the future. The most difficult technical problem that social automation is likely to face, however, is how to make robust machine-vision systems that can be used day or night in all types of weather conditions. To cope with the wide changes in illumination, the development of a variable-sensitivity imaging device with a wide dynamic range is still a stimulating challenge. Artificial retina chips [31] (by Mitsubishi, 1998) and high-speed vision chips (by Fujitsu and University of Tokyo, 1999) are expected to play an important role along these lines.
50
M. Ejiri
Fig. 17. Key technologies: Biometrics based on finger vein patterns (2000)
7
Summaries
The history of machine vision and its applications in Japan was briefly reviewed by focusing on the efforts in industry, and is summarized roughly in Fig. 18 as a chronological form.
Fig. 18. History of machine vision research
Details of topics of industrial activities are listed on a year-to-year basis in Table 1, together with the various topics in other related fields for easier understanding of each period. The history is also summarized in Table. 2 as a list form characterizing each developmental stage. As indicated, we can see that, in addition to factory automation, office automation and social automation have been greatly advanced in those years by the evolution of machine-vision technology, owing to the progress of processor and
Machine Vision in Early Days: Japan’s Pioneering Contributions
51
Table 1. History of machine vision research (1961-2000)
memory technologies. However, it is also a fact that one of the most prominent contributions of machine vision technology was in the production of semiconductors. The semiconductor industry, and thus our human life, would not have been able to enjoy prosperity without machine vision technology. This article was prepared from the view point of the old saying; “knowing the old brings you a new wisdom for tomorrow” by Confucius. The author will be extremely pleased if this article is read widely by young researchers, as it would give them some insights to this field, and would encourage them to get into, and play a great role in, this seemingly simple but actually difficult research field.
52
M. Ejiri Table 2. History of machine vision research
The closing message from the author to young researchers is as follows: Lift your sights, raise your spirits, and get out into the world!
References 1. Ejiri, M., Miyatake, T., Sako, H., Nagasaka, A., Nagaya, S.: Evolution of realtime image processing in practical applications. In: Proc. IAPR MVA, Tokyo, pp. 177–186 (2000) 2. Ejiri, M., Uno, T., Yoda, H., Goto, T., Takeyasu, K.: A prototype intelligent robot that assembles objects from plan drawings. IEEE Trans. Comput. C-21(2), 161–170 (1972) 3. Uno, T., Ejiri, M., Tokunaga, T.: A method of real-time recognition of moving objects and its application. Pattern Recognition 8, 201–208 (1976) 4. Ejiri, M., Uno, T., Mese, M., Ikeda, S.: A process for detecting defects in complicated patterns, Comp. Graphics & Image Processing 2, 326–339 (1973) 5. Kashioka, S., Ejiri, M., Sakamoto, Y.: A transistor wire-bonding system utilizing multiple local pattern matching techniques. IEEE Trans. Syst. Man & Cybern. SMC-6(8), 562–570 (1976) 6. Ejiri, M.: Machine vision: A practical technology for advanced image processing. Gordon & Breach Sci. Pub, New York (1989) 7. Ejiri, M.: Recent image processing applications in industry. In: Proc. 9th SCIA, Uppsala, pp. 1–13 (1995) 8. Ejiri, M.: A key technology for flexible automation. In: Proc. of Japan-U.S.A. Symposium on Flexible Automation, Otsu, Japan, pp. 437–442 (1998) 9. Goto, N.: Toshiba Review 33, 6 (1978) (in Japanese) 10. Hara, Y., et al.: Automatic visual inspection of LSI photomasks. In: Proc. 5th ICPR (1980) 11. Haga, K., Nakamura, K., Sano, Y., Miyamori, N., Komuro, A.: Fuji Jiho 52(5), pp.294–298 (1979) (in Japanese) 12. Nakahara, S., Maeda, A., Nomura, Y.: Denshi Tokyo. IEEE Tokyo Section 18, 46–48 (1979)
Machine Vision in Early Days: Japan’s Pioneering Contributions
53
13. Nomura, Y., Ito, S., Naemura, M.: Mitsubishi Denki Giho 53(12), 899–903 (1979) (in Japanese) 14. Maeda, A., Shibayama, J.: Pattern measurement, ITEJ Technical Report, 3, 32 (in Japanese) (1980) 15. Mori, K., Kidode, M., Shinoda, H., Asada, H.: Design of local parallel pattern processor for image processing. In: AFIP Conf. Proc., vol. 47, pp. 1025–1031 (1978) 16. Yoda, H., Ohuchi, Y., Taniguchi, Y., Ejiri, M.: An automatic wafer inspection system using pipelined image processing techniques, IEEE Trans. Pattern Analysis & Machine Intelligence. PAMI-10 1 (1988) 17. Ejiri, M., Yoda, H., Sakou, H.: Knowledge-directed inspection for complex multilayered patterns. Machine Vision and Applications 2, 155–166 (1989) 18. Fukushima, T., Kobayashi, Y., Hirasawa, K., Bandoh, T., Ejiri, M.: Architecture of image signal processor, Trans. IEICE, J-66C 12, 959–966 (1983) 19. Ejiri, M., Kakumoto, S., Miyatake, T., Shimada, S., Matsushima, H.: Automatic recognition of engineering drawings and maps. In: Proc. Int. Conf. on Pattern Recognition, Montreal, Canada, pp. 1296–1305 (1984) 20. Ejiri, M.: Knowledge-based approaches to practical image processing. In: Proc. MIV-89, Inst. Ind. Sci, Univ. of Tokyo, pp. 1–8. Tokyo (1989) 21. Yoda, H., Motoike, J., Ejiri, M., Yuminaka, T.: A measurement method of the number of passengers using real-time TV image processing techniques, Trans. IEICE, J-69D 11, 1679–1686 (1986) 22. Takahashi, K., Kitamura, T., Takatoo, M., Kobayashi, Y., Satoh, Y.: Traffic flow measuring system by image processing. In: Proc. IAPR MVA, Tokyo, pp. 245–248 (1996) 23. Yahagi, H., Baba, K., Kosaka, H., Hara, N.: Fish image monitoring system for detecting acute toxicants in water. In: Proc. 5th IAWPRC, pp. 609–616 (1990) 24. Ogawa, Y., Kakumoto, S., Iwamura, K.: Extracting regional features from aerial images based on 3-D map matching, Trans. IEICE, D-II 6, 1242–1250 (1998) 25. Nagaya, S., Miyatake, T., Fujita, T., Itoh, W., Ueda, H.: Moving object detection by time-correlation-based background judgment. In: Li, S., Teoh, E.K., Mital, D., Wang, H. (eds.) Recent Developments in Computer Vision. LNCS, vol. 1035, pp. 717–721. Springer, Heidelberg (1996) 26. Nagasaka, A., Miyatake, T., Ueda, H.: Video retrieval method using a sequence of representative images in a scene. In: Proc. IAPR MVA, Kawasaki, pp. 79–82 (1994) 27. Arai, K., Sakamoto, H.: Real-time animation of the upper half of the body using a facial expression tracker and an articulated input device, Research Report 96CG-83, Information Processing Society of Japan (in Japanese), 96, 125, pp. 1–6 (1996) 28. Horry, Y., Anjyo, K., Arai, K.: Tour into the picture: Using a spidery mesh interface to make animation from a single image. In: Proc. ACM SIGGRAPH 1997, pp. 225– 232 (1997) 29. Nagasaka, A., Miyatake, T.: A real-time video mosaics using luminance-projection correlation, Trans. IEICE, J82-D-II 10, 1572–1580 (1999) 30. Kamei, T., Shinbata, H., Uchida, K., Sato, A., Mizoguchi, M., Temma, T.: Automated fingerprint classification, IEICE Technical Report, Pattern Recognition and Understanding, 95(470), 17–24 (in Japanese) (1996) 31. Ui, H., Arima, Y., Murao, F., Komori, S., Kyuma, K.: An artificial retina chip with pixel-wise self-adjusting intensity response, ITE Technical Report, 23(30), pp.29–33 (in Japanese) (1999)
Coarse-to-Fine Statistical Shape Model by Bayesian Inference Ran He, Stan Li, Zhen Lei, and ShengCai Liao Institute of Automation, Chinese Academy of Sciences, Beijing, China
[email protected]
Abstract. In this paper, we take a predefined geometry shape as a constraint for accurate shape alignment. A shape model is divided in two parts: fixed shape and active shape. The fixed shape is a user-predefined simple shape with only a few landmarks which can be easily and accurately located by machine or human. The active one is composed of many landmarks with complex shape contour. When searching an active shape, pose parameter is calculated by the fixed shape. Bayesian inference is introduced to make the whole shape more robust to local noise generated by the active shape, which leads to a compensation factor and a smooth factor for a coarse-to-fine shape search. This method provides a simple and stable means for online and offline shape analysis. Experiments on cheek and face contour demonstrate the effectiveness of our proposed approach. Keywords: Active shape model, Bayesian inference, statistical image analysis, segmentation.
1 Introduction Shape analysis is an important area in computer vision. A common task of shape analysis is to recover both pose parameters and low-dimensional representation of the underlying shape from an observed image. Applications of shape analysis spread from medical image processing, face recognition, object tracking and etc. After the pioneering work on active shape model (ASM) put forward by Cootes and Taylor [1,2], various shape models have been developed for shape analysis, which mainly focus on two parts: (1) statistic framework to estimate the shape and pose parameters and (2) optimal features to accurately model appearance around landmarks. For parameter estimation, Zhou, Gu, and Zhang [3] propose a Bayesian tangent shape model to estimate parameters more accurately by Bayesian inference. Liang et al. [4] adopt Markov network to find an optimal shape which is regularized by the PCA based shape prior through a constrained regularization algorithm. Li and Ito [5] use AdaBoosted histogram classifiers to model local appearances and optimize shape parameters. Thomax Brox et al. [6] integrated 3D shape knowledge into a variational model for pose estimation and image segmentation. For optimal features, van Ginneken et al. [7] propose a non-linear ASM with Optimal Features (OF-ASM), which allows distributions of multi-modal intensities and uses a k-nearest neighbors classifier for local textures classification. Federico Sukno et al. [8] further develop Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 54–64, 2007. © Springer-Verlag Berlin Heidelberg 2007
Coarse-to-Fine Statistical Shape Model by Bayesian Inference
55
this non-linear appearance model, incorporating a reduced set of differential invariant features as local image descriptors. A Cascade structure containing multiple ASMs is introduced in [9] to make location of landmarks more accurate and robust. However, these methods will lose their efficiency when dealing with complicated geometry of shapes or large texture variations. Can we utilize some accurate information to simplify ASM algorithm and make shape parameters estimation more robust? For example, we can utilize face detection algorithm to detect the coordinates of eyes and mouth or manually label these coordinates when we want to find a facial contour for further analysis. In this paper, the problem of shape analysis is addressed from three aspects. Firstly, we present geometry constrained active shape model (GCASM) and divide it in two parts: fixed shape and active shape. The fixed shape is a user-predefined shape with only a few points and lines. Those points could be easily and accurately located by machine or human. The active one is a user's desired shape and is composed of many landmarks with a complex contour. It will be located automatically with the help of the fixed shape. Secondly, Bayesian inference is introduced to make parameter estimation more robust to local noise generated by the active shape, which leads to a compensation factor and a smooth factor to perform a coarse-to-fine shape search. Thirdly, optimal features are selected as local image descriptors. Since the pose parameters can be calculated by the fixed shape, classifiers are trained for each landmark without scarifying performance. The rest of the paper is organized as follows: In Section 2, we begin with a brief review of ASM. Section 3 describes our proposed algorithm and Bayesian inference. Experimental results are provided in Section 4. Finally, we draw the conclusions in Section 5.
2 Active Shape Models This section briefly reviews the ASM segmentation scheme. We follow the description and notation of [2]. An object is described by points, referred as landmark points. The landmark points are (manually) determined in a set of N training images. From these collections of landmark points, a point distribution model (PDM) [10] is constructed as follows. The landmark points (x1, y1, … , xn, yn) are stacked in shape vectors.
x = ( x1 , y1 ,..., xn , yn )T .
(1)
Principal component analysis (PCA) is applied to the shape vectors x by computing the mean shape, covariance and eigensystem of the covariance matrix.
x=
1 N
∑
N
x and S =
i =1 i
N N ( xi − x )( xi − x )T . ∑ i =1 N −1
(2)
The eigenvectors corresponding to the k largest eigenvalues λ j are retained in a matrix Φ = (φ1 | φ2 | ⋅ ⋅ ⋅ | φk ) . A shape can now be approximated by x ≈ x + Φb .
(3)
56
R. He et al.
Where b is a vector of k elements containing the shape parameters, computed by b = ΦT ( x − x ) .
(4)
When fitting the model to a set of points, the values of b are constrained to lie within a range | b j |≤ ±c λ j .
(5)
where c usually has a value between two and three. Before PCA is applied, the shapes can be aligned by translating, rotating and scaling so as to minimize the sum of squared distances between the landmark points. We can express the initial estimate x of a shape as a scaled, rotated and translated version of original shape
x = M ( s,θ )[ x] + t .
(6)
Where M ( s,θ ) and t are pose parameters (See [1] for details). Procrustes analysis [11] and EM algorithm [3] are often used to estimate the pose parameters and align the shapes. This transformation and its inverse are applied both before and after projection of the shape model. The alignment procedure makes the shape model independent of the size, position, and orientation of the objects.
3 Coarse-to-Fine Statistical Shape Model 3.1 Geometry Constrained Statistical Shape Model
To make use of the user-predefined information, we extend PDM to two parts: active shape and fixed shape. The active shape is a collection of landmarks to describe an object in the basic PDM. It is composed of many points with a complex contour. The fixed shape is a predefined simple shape accurately marked by user or machine. It is composed of several connected lines between these points which can be easily and accurately marked by machine or human. Considering there are tremendous points in a line, we present a line with several equidistant points. Thus the extended PDM is constructed as follows. The landmarks (x1, y1, … , xm, ym) are stacked in active shape vectors, and landmarks (xm+1, ym+1, … , xn, yn) are stacked in fixed shape vectors. x = ( x1 , y1 ,..., xm , ym , xm +1 , ym +1 ,..., xn , yn )T .
(7)
As in PDM, a shape can now be approximated by x ≈ x + Φb .
(8)
When aligning shapes during training, the pose parameters of a shape (scaling, rotation and translation) are estimated by the fixed shape. An obvious reason is that the fixed shape is simpler and more accurate than the active one. Taking cheek contour as an example, the active shape is composed of landmarks in a cheek contour and the fixed shape is composed of 13 landmarks derived from three manual labeled points: left eye center, right eye center and mouth center. Five
Coarse-to-Fine Statistical Shape Model by Bayesian Inference
57
landmarks are added equidistantly between two eyes center to represent horizontal connected line. And five landmarks are inserted equidistantly in the vertical line passing the mouth center and perpendicular to the horizontal line. (See left graph of Fig.1 for details) During training, two shapes are aligned according to the points between two eyes only. Each item of b reflects a specific variation along the corresponding principle component (PC) axis. Shape variation along first three PCs is shown in right graph of Fig.1. The interpretation of these PCs is straight forward. The first PC describes left-right head rotations. The second PC accounts for face variation in vertical direction: long or short. And the third one explains a human face fat or thin.
Fig. 1. The fixed shape and shapes reconstructed by the first three PCs. The thirteen white circles in left image are points of the fixed shape. In right image, the middle one in each row is the mean shape.
3.2 Bayesian Inference
When directly calculating shape parameter b by formula (4), there is an offset between the reconstructed fixed shape and the given fixed shape. But the fixed shape is supposed to be accurate. This noise comes from reconstruction error of the active shape. Inspired by paper [3], we associate PCA with a probabilistic explanation. An isotropic Gaussian noise item is added to both fixed and active shape; thereby we can compute the poster of model parameters. The model can be written as: y = x + Φb + ε .
(9)
y − x − Φb = ε .
(10)
Where the shape parameter b is a n-dimensional vector distributed as multivariate Gaussian N (0, Λ ) and Λ = diag (λ1 ,..., λk ) . ε denotes an isotropic noise on the whole shape. It is a n-dimensional random vector which is independent with b and distributes as
p(ε ) ~ exp{− || ε ||2 / 2( ρ ) 2 } .
(11)
ρ = ∑ i =1α i || yiold − yi ||2 .
(12)
n
Where yold is the shape estimated in the last iteration and y is an observed shape in the current iteration. ai is classification confidence related to a classifier used in locating a
58
R. He et al.
landmark. When ai is 0, which implies that classifier can perfectly predict shape’s boundary; when ai is 1, which means classifier fails to predict the boundary. Combing (10) and (11) we obtain the likelihood of model parameters: 1 P(b | y ) = constP ( y | b) P (b) ~ exp( − [( y − x − Φb)T ( y − x − Φb) / ρ + bT Λ −1b]) 2
Let
(13)
∂ (ln P (b | y )) = 0 , we get: ∂b
b j = (λ j \ (λ j + ρ ))φ Tj ( y − x ) .
(14)
Combining (4), we obtain: b j = (λ j \ (λ j + ρ ))b j .
(15)
It is obvious that value of bj will become smaller after updating of (15) ( ρ ≥ 0 ). This updating will slow down search speed. Hence, a compensation factor p1 is introduced to make shape variation along eigenvectors corresponding to large eigenvalues more aggressive (see formula 18). If p1 is equal to (λmax + ρ ) / λmax , we get b j = ((λmax + ρ ) \ λmax ) × (λ j \ (λ j + ρ )) × b j .
(16)
Formula (16) shows that the parameter bj corresponding to a larger eigenvalue will receive a small punishment. And the parameter bj corresponding to a small eigenvalue will become smaller after updating. Moreover, we expect a smooth shape contour and neglect details in the first several iterations. A smooth factor p2 (see formula 18) is introduced to further punish the parameter bj. It is noticed that ρ is smaller than the largest eigenvalue and will become smaller. The p2 regularizes the parameters by enlarging the punishment. As in Fig.2, the reconstructed shape’s contour by Bayesian inference is smoother than the one by PCA in regions pointed by the black arrows. Although the PCA reconstruction can remove some noise, the reconstructed shape is still unstable when the image is noisy. Formula (18) makes the parameter estimation more robust to local noise.
Fig. 2. Shapes reconstructed from PCA and Bayesian Inference. Left shape is mean shape after desired movements; middle shape is reconstructed by PCA; right shape is reconstructed by Bayesian Inference. The black arrows highlight the regions to be compared.
Coarse-to-Fine Statistical Shape Model by Bayesian Inference
59
3.3 Optimal Features
Recently, optimal features are applied in ASMs and have drawn more and more attentions. [5, 6, 7] Experimental results show that optimal features can make shape segmentation more accurate. But a main drawback of optimal features method is that it takes ASMs more time to find the desired landmarks due to extract optimal features in each iteration. An efficient speed-up strategy is to select a subset of the available features for all landmarks. [6, 7] It is clear that textures around different landmarks are different. It is impossible for a single subset of optimal features to describe various textures around all the landmarks. In GCASM, the pose parameters of scale, rotation and translation can be calculated by the fixed shape. All landmarks can be categorized into several groups, for each of which we select the same discriminate features. When search a shape, the image is divided into several areas according to the categories. For each area, the same optimal features are extracted to determine movement. Optimal features are features reported in both paper [6] and [7]. Fig.3 shows classification results for each landmark. The Mean classification accuracy is 76.67%. We can learn about that landmarks near jaw and two ears have low classification accuracy, and the landmarks near cheek have high classification accuracy. Considering this classification error, we introduce Bayesian Inference and ai of formula (12) to make shape estimation more robust.
Fig. 3. Classification results for each cheek landmark. Classification accuracy stands for a classifier’s ability to classify whether a point near the landmark is in or outside of the shape. The points around the indices of 4 and 22 are close to ears and the points around the index of 13 are close to jaw.
3.4 Coarse-to-Fine Shape Search
During image search, main differences between GCASM and ASM lie in twofold. One is that since the pose parameters of GCASM have been calculated by the fixed shape, we needn’t to think about the pose variation during iterative updating procedure. The other is that the fixed shape is predefined accurately in GCASM. After reconstruction from the shape parameters, the noise will make the reconstructed fixed shape leave away from the given fixed shape. Because the fixed shape is supposed to be accurate, it should be realigned to the initial points. The iterative updating procedure of GCASM and ASM are shown in Fig.4. We use formula (17) to calculate shape parameter b=[b1,…,bk]T and normalize b by formula (18).
60
R. He et al.
b j = ΦTj ( y − x ) .
(17)
b j = ( p1λ j /(λ j + p2 ρ ))b j
(18)
Where 1 ≤ p1 ≤ (λmax + p2 ρ ) / λmax , p2 ≥ 1 . We call the parameter p1 compensation factor which makes shape variation in a more aggressive way. The parameter p2 is a smooth factor which gives a penalty to the shape parameter when shape has a large variation. The compensation factor and smoother factor give more emphasis on shape parameters corresponding to large eigenvalues. This can adjust a shape along major PCs and neglect shape’s local detail in initial several iterations. When the algorithm converges ( ρ → 0 ), p1λ j /(λ j + p2 ρ ) is equal to 1. Hence, the compensation factor and smoother factor lead a coarse-to-fine shape searching. Here, we simply set p1 = (λmax + p2 ρ ) / λmax , α i = 1 and p2 = 4 . Obviously, the formula (18) can also be used in ASMs to normalize the shape parameter.
Fig. 4. Updating rules of ASM and GCASM. The left block diagram is the basic ASM’s updating rule and the right block diagram is GCASM updating rule.
4 Experiments In this section, our proposed method is tested on two experiments: cheek contour search and facial contour search. A total of 100 face images are randomly taken from the XM2VTS face database. [12] Each image is aligned by coordinates of two eyes. The average distance between two eyes is 80 pixels. Three points of the fixed shape including two eyes and mouth center are manually labeled. The fixed shape takes a shape of letter ‘T’. Hamarneh’s ASM source code [13] is taken as the standard ASM without modification. Optimal features are collected from features reported in both paper [7] and [8]. The number of optimal features is reduced by sequential feature
Coarse-to-Fine Statistical Shape Model by Bayesian Inference
61
selection [14]. In this work, all the points near the landmarks are classified by linear regression to predict whether they lie in or out of a shape. 4.1 Experiments on Cheek Contour
A designed task to directly search a cheek contour without eyes, brow, mouth, and nose is presented to validate our method. A total of 25 cheek landmarks are labeled manually on each image. The PCA thresholds are set to 99% for every ASMs. The fixed shape is composed of points between two eyes and mouth. As in Fig.3, it is difficult to locate points around landmarks near ears and jaw. When a contour shape is simple and textures around landmarks are complex, the whole shape will be dragged off from the right position if there are several inaccurate points. It is clear that the cheek shape can be accurately located with the help of the fixed shape.
Fig. 5. Comparison of different algorithms’ cheek searching results: Shapes in first column are results of ASM searching; Shapes in second column are results of simple OF-ASM; Shapes in third column are results of the basic GCASM; Shapes in fourth column are results of GCASM with optimal features; Shapes in fifth column are results of GCASM with optimal features and Bayesian inference
As in Fig.5, first two columns are the searching results of ASM and OF-ASM. It is clear that the searching results miss desired position because of local noise. Several inaccurate landmarks will drag the shape from desired position. It also illustrates that optimal features can model contour appearance more accurately. As illustrated in the last three columns in Fig.5, searching results are well trapped in a local area when the fixed shape is introduced. Because the fixed shape is accurate without noise, reconstructed shape will fall into a local area around the fixed shape even if some landmarks are inaccurate. Every landmark will find a local best matched point instead of a global one. Comparing the third and fourth column, we can learn about that optimal features can locate landmarks more accurately. But optimal
62
R. He et al.
features couldn’t keep local contour detail very well. There is still some noise in searching results. Looking at the fifth column of Fig.5, it is clear that borders of the shapes become smoother. The Bayesian inference can further improve the accuracy. 4.2 Experiments on Facial Contour
A total of 96 face landmarks are labeled manually on each image. The PCA thresholds are set to 95% for every ASMs. Three landmarks are inserted into two eyes to present horizontal connected line. And three landmarks are inserted between mouth and horizontal line to present the vertical line. For the sake of simplicity, optimal features don’t used in this subsection. The results are shown in table 1. Table 1. Comparison results of traditional ASM and our method without optimal features
ASM Our algorithm Improvement
Face 7.74 4.68 39.5%
F.S.O 6.45 4.41 31.6%
Cheek Contour 11.4 5.47 52.0%
Where F.S.O. means five sense organs. Location error is measured in pixel. It is clear that our algorithm is much more accurate than ASM.
Fig. 6. Comparison results of ASM and GCASM with Bayesian inference. The first row is ASM results, and the second row is our results.
Fig.6 shows a set of searching results of basic ASM and GCASM with Bayesian inference. In the case, there are wrinkles and shadings on the facial contour or other facial sub-parts. It is clear that our method can recover the shape from local noise. A direct reason is that the shape variation is restricted in a local area when combining accurate information in ASM. The Bayesian inference holds the whole shape and smoothes the shape border.
Coarse-to-Fine Statistical Shape Model by Bayesian Inference
63
5 Conclusion This work focuses on an interesting topic how to combine some accurate informationgiven by user or machine to further improve shape alignment accuracy. The PDM is extended by adding a fixed shape which is generated from given information. After PCA reconstruction, local noise of the active shape will make the whole shape unsmooth. Hence Bayesian inference is proposed to further normalize parameters of the extended PDM. Both compensate factor and smooth factor lead a coarse-to-fine shape adjustment. Comparisons of our algorithm and the ASM algorithms demonstrate the effectiveness and efficiency.
Acknowledgements This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the Authen-Metric Collaboration Foundation.
References 1. Cootes, T.F., Taylor, C.J., Cooper, D., Graham, J.: Active shape models-Their training and application. Comput. Vis. Image Understanding 61(1), 38–59 (1995) 2. Cootes, T.F., Taylor, C.J.: Statistical models of appearance for computer vision, Wolfson Image Anal. Unit, Univ. Manchester, Manchester, U.K., Tech. Rep (1999) 3. Zhou, Y., Gu, L., Zhang, H.-J.: Bayesian tangent shape model: Estimating shape and pose parameters via Bayesian inference. In: IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI (June 2003) 4. Liang, L., Wen, F., Xu, Y.Q., Tang, X., Shum, H.Y.: Accurate Face Alignment using Shape Constrained Markov Network. In: Proc. CVPR (2006) 5. Li, Y.Z., Ito, W.: Shape parameter optimization for Adaboosted active shape model. In: ICCV, pp. 259–265 (2005) 6. Brox, T., Rosenhahn, B., Weickert, J.: Three-Dimensional Shape Knowledge for Joint Image Segmentation and Pose Estimation. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) Pattern Recognition. LNCS, vol. 3663, pp. 109–116. Springer, Heidelberg (2005) 7. Ginneken, B.V., Frangi, A.F., Staal, J.J., ter Har Romeny, B.M., Viergever, M.A.: Active shape model segmentation with optimal features. IEEE Transactions on Medical Imaging 21(8), 924–933 (2002) 8. Sukno, F., Ordas, S., Butakoff, C., Cruz, S., Frangi, A.F.: Active shape models with invariant optimal features IOF-ASMs. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 365–375. Springer, Heidelberg (2005) 9. Zhang, S., Wu1, L.F., Wang, Y.: Cascade MR-ASM for Locating Facial Feature Points. The 2nd International Conference on Biometrics (2007) 10. Dryden, I., Mardia, K.V.: The Statistical Analysis of Shape. Wiley, London, U.K (1998)
64
R. He et al.
11. Goodall, C.: Procrustes methods in the statistical analysis of shapes. J.Roy. Statist. 53(2), 285–339 (1991) 12. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The extended M2VTS database. In: Proc. AVBPA, pp. 72–77 (1999) 13. Hamarneh, G.: Active Shape Models with Multi-resolution, http://www.cs.sfu.ca/~hamarneh/ software/asm/index.html 14. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classiers. Pattern Recognition, 25–41 (2000)
Efficient Texture Representation Using Multi-scale Regions Horst Wildenauer1 , Branislav Miˇcuˇs´ık1,2, and Markus Vincze1 Automation and Control Institute Institute of Computer Aided Automation, PRIP Group, Vienna University of Technology, Austria 1
2
Abstract. This paper introduces an efficient way of representing textures using connected regions which are formed by coherent multi-scale over-segmentations. We show that the recently introduced covariancebased similarity measure, initially applied on rectangular windows, can be used with our newly devised, irregular structure-coherent patches; increasing the discriminative power and consistency of the texture representation. Furthermore, by treating texture in multiple scales, we allow for an implicit encoding of the spatial and statistical texture properties which are persistent across scale. The meaningfulness and efficiency of the covariance based texture representation is verified utilizing a simple binary segmentation method based on min-cut. Our experiments show that the proposed method, despite the low dimensional representation in use, is able to effectively discriminate textures and that its performance compares favorably with the state of the art.
1
Introduction
Textures and structured patterns are important cues towards image understanding, pattern classification and object recognition. The analysis of texture properties and their mathematical and statistical representation is attracting the interest of researchers since many years; with the primary goal of finding low dimensional and expressive representations that allow for reliable handling and classification of texture patterns. Texture representations, which have been successfully applied to image segmentation tasks, include steerable filter responses [1], color changes in a pixel’s neighborhood [2], covariance matrices of gradients, color, and pixel coordinates [3], Gaussian Mixture Models (GMM) computed from color channels [4,5], color histograms [6], or multi-scale densities [7,8]. Since textures ”live” at several scales, a scale-dependent discriminative treatment should be aimed for. In this paper, we explore the possibility to refine coarse texture segmentation by matching textures between adjacent scales, taking
The research has been supported by the Austrian Science Foundation (FWF) under the grant S9101, and the European Union projects MOVEMENT (IST-2003-511670), Robots@home (IST-045350), and MUSCLE (FP6-507752).
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 65–74, 2007. c Springer-Verlag Berlin Heidelberg 2007
66
H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze
(a)
(c)
(b)
(d)
(e)
Fig. 1. Segmentation results using the min-cut algorithm [12] with different texture representations. (a) Input image with user specified foreground and background markers. (b) The proposed multi-scale texture representation. (c) Color Histograms [6]. (d) GrabCut [4] using GMMs. (e) Color changes in the pixel neighbourhoods [2].
advantage of spatial and statistical properties which persist across scale. We show that texture segments can be efficiently treated in a multi-scale hierarchy similarly to [8], however building on superpixels. In our approach, textures are represented by covariance matrices, for which an effective similarity measure based on the symmetric generalized eigenproblem was introduced in [9]. In contrast to the rectangular windows used in [3], covariance matrices are computed from irregular structure-coherent patches, found at different scales. In order to allow for an efficient image-partitioning into scalecoherent regions we devised a novel superpixel method, utilizing watershed segmentations imposed by extrema of an image’s mean curvature. However, the suggested framework of multi-scale texture representation is generally applicable for other superpixel methods, such as [10,11], depending on accuracy and time complexity constraints imposed by the application domain. We verify the feasibility and meaningfulness of the multi-scale covariance based texture representation by a binary segmentation method based on the min-cut algorithm [12]. Figure 1 shows an example of how different types of texture descriptors influence the min-cut segmentation of a particularly challenging image, consisting of textured regions with highly similar color characteristics. The remainder of the paper is organized as follows. We present the details of the proposed method in Section 2. Section 3 reports experimental results and compares them to the results obtained using state-of-the-art methods. The paper is concluded with a discussion in Section 4.
Efficient Texture Representation Using Multi-scale Regions
2 2.1
67
Our Approach Superpixels
Probably one of the most commonly used blob detectors is based on the properties of the Laplacian of Gaussians (LoG) or its approximation, the Difference of Gaussians (DoG) [13]. Given a Scale-Space representation L(t) obtained by repeatedly convolving an input image by Gaussians of increasing sizes t, the shape of the intensity surface around a point x at scale t can be described using the Hessian matrix L (x, t) Lxy (x, t) . (1) H(x, t) = xx Lxy (x, t) Lyy (x, t) The LoG corresponds to the trace of the Hessian: ∇2 L(x, t) = Lxx (x, t) + Lyy (x, t),
(2)
and equals the mean intensity curvature multiplied by two. The LoG’s computation results in strong √ negative or positive responses for bright and dark blob-like structures of size t respectively. Using this, the position and characteristic scale of blobs can by found by detecting Scale-Space extrema of scale normalized LoG responses [14]. In our approach, we do not directly search for blob positions and scales, but rather use spatial response extrema as starting points for a watershed-based oversegmentation of an image’s mean curvature surface. Specifically, we proceed as follows: √ 1. Computation of LoG responses at scales of t = 2m/3 , with m = 1 . . . M , where M denotes a predefined number of scales. I.e., we calculate 3 scale levels per Scale-Space octave. 2. Watershed-segmentation: (a) Detection of spatial response extrema at all scales. Extrema with low contrast, i.e. those with a minimum absolute difference to adjacent pixels smaller than a predefined threshold, are discarded. (b) At each scale, segment the image into regions assigned to positive or negative mean curvature. This is achieved by applying the watershed to the negative absolute Laplacian −|∇2 L(x, t)| using the seeds from (a). The majority of watersheds thus obtained follow the zero-crossings of the Laplacian; i.e., the edges where the mean curvature of the intensity surface changes its sign. Though, for irregularly shaped blobs, which exhibit significant variations in mean curvature, usually several seed-points are detected. This results in an over-segmentation of regions with otherwise consistent curvature signs. Figure 2 shows a direct comparison of the superpixels produced by our method at a single scale and the normalized-cut based superpixels suggested in [10]. Another method for image over-segmentation, which is partially favoured for its speed, utilizes the Minimum Spanning Tree [11]. However, for larger superpixels, which are needed to stably compute the covariance-based descriptor
68
H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze
Fig. 2. Left: Superpixels obtained by the proposed method. Right: Superpixels obtained by the method of Ren et al. [10].
on, the regions obtained by this method are highly irregular and often do not align well with object boundaries. Figure 3 shows the effect of using different superpixels in conjunction with the method proposed in this paper. As one can see, our method gives acceptable results compared to the normalized-cut based approach, which needs more than 100 times longer to compute the segmentation. The outlined approach is similar in spirit to the watershed segmentation of principal curvature images proposed by Deng et al. [15]. In their approach, the first principle curvature image (i.e., the image of the larger eigenvalue of the Hessian matrix) is thresholded near zero and either the positive, or negative remainder is flooded starting from the resulting zero-valued basins. Hence, as opposed to our method, the watersheds follow the ridges of the image’s principal curvature surface. In experiments we found that this approach was not suitable for our purposes since it tends to under-segment images, aggressively merging regions with same-signed principal curvature. 2.2
Covariance-Based Texture Similarity
Recently, Tuzel et al. [3] have introduced region covariance matrices as potent, low-dimensional image patch descriptors, suitable for object recognition and texture classification. One of the authors’ main contributions was the introduction of an integral-image like preprocessing stage, allowing for the computation of covariances from image features of an arbitrarily sized rectangular window in constant time. However, in the presented work covariances are directly obtained from irregularly shaped superpixels, the aforementioned fast covariance computation is not applicable. We proceed to give a brief description of the covariance-based texture descriptor in use. The sample covariance matrix of feature vectors collected inside a superpixel is give by: M=
N 1 (z n − μ)(z n − μ) , N − 1 n=1
(3)
Efficient Texture Representation Using Multi-scale Regions
69
Fig. 3. Effect of superpixels on the final image segmentation. From left to right: Superpixels through a color-based Minimum Spanning Tree [11]. The proposed approach. Superpixels based on normalized-cuts using combined color and textured gradient [10].
where μ denotes the sample mean, and {z n }n=1...N are the d-dimensional feature vectors extracted at N pixel positions. In our approach, these feature vectors are composed of the values of the RGB color channels R, G, and B and the absolute values of the first derivatives of the Intensity I at the n-th pixel ∂I ∂I (4) z n = Rn , Gn , Bn , , . ∂x ∂y The resulting 5×5 covariance matrix gives a very compact texture representation with the additional advantage of exhibiting a certain insensitivity to illumination changes. And, as will be shown experimentally, offers sufficient discriminative power for the segmentation task described in the remainder of the paper. To measure the similarity ρ(M i , M j ) of two covariance matrices M i and M j we utilize the distance metric initially proposed by F¨ orstner [9]: d ρ(M i , M j ) = ln λ2k (M i , M j ), (5) k=1
where the {λk }k=1...d are the eigenvalues obtained by solving the generalized eigenvalue problem (6) M i ek = λk M j ek , k = 1 . . . d with ek = 0 denoting the generalized eigenvectors. The cost for computing ρ is in the order of O(d3 ) flops which, due to the low dimensionality of the representation, leads to speed advantages compared to histogram matching methods. For a detailed discussion among the topic, other choices of feature combinations as well as the useful properties of region covariances see [3]. 2.3
Foreground and Background Codebooks
From the covariance-based descriptors proposed in Subsection 2.2 we compute representative codebooks for foreground and background regions. These are used later on to drive the image segmentation.
70
H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze
Sink
Source
Fig. 4. Two-layer MRF with superpixels detected at two different scales. To avoid clutter not all superpixels are connectd to sink/source nodes.
For the foreground and background codebook, we require user-specified markB ers, as shown in Figure 1(a), computing the covariance matrices M F i and M i from all points under the marker region. Usually, the background contains more textures and cluttered areas requiring more seeds to be establish. Moreover, in applications like object detection or recognition, the background can significantly vary across images while objects of interest usually remain quite consistent in appearance. To somewhat alleviate the burden of manually selecting many seeds, we propose to avoid the need of background markers by following a simple strategy: We take a rim at the boundary of the image and feed all superpixels under the rim into a hierarchical clustering method with a predefined stopping distance threshold, with the distance between superpixels given by Equation 5. After clustering we take the K most occupied clusters and compute the mean covariance matrix for each cluster out of all covariance matrices belonging to the cluster. For efficiency reasons, we do not calculate the mean covariance matrix by polling over all participating feature vectors, but use the method described in [16,3], which has its roots in formulations of symmetric positive definite matrices lying on connected Riemannian manifolds. Using this procedure, we arrive at the background codebook matrices M B i . Of course, the applicability of this ad-hoc technique is limited to cases where the object of interest touches the boundary, or when the rim is not representative enough. However, in most cases the approach lead to background codebooks with sufficient explanatory power for a successful segmentation. 2.4
Multi-scale Graph-Cut
In order to verify the validity of the covariance-based texture representation, taking into account the superpixel behaviour across different scales we adopted a binary segmentation method based on the min-cut algorithm [12]. Suppose that the image at a certain scale t is represented by a graph Gt = Vt , Et , where Vt is a set of all vertices representing superpixels, and Et is a set of all intrascale edges connecting spatially adjacent vertices. To capture the Scale-Space behaviour we connect the graphs by interscale edges forming a set of edges S. We form the entire graph G = V, E, S consisting of the union of
Efficient Texture Representation Using Multi-scale Regions
71
all vertices Vt , and all intrascale Et and interscale edges S. For more clarity, the resulting graph structure is depicted in Figure 4. The binary segmentation of the graph G is achieved by finding the minimumcut [12], minimizing the Gibbs energy, Edata (xi , M i ) + λ δ(xi , xj ) Esm im (M i , M j )+ E(x) = i∈V
(i,j)∈E
+γ
δ(xi , xj ) Esm
sc (M i , M j ),
(7)
(i,j)∈C
where x = [x0 , x1 , . . .] corresponds to a vector with label xi for each vertex. We concentrate on a bi-layer segmentation where the label xi is either 0 (background) or 1 (foreground). M i corresponds to the measurement in the i-th graph vertex, i.e., to a covariance matrix for a given superpixel. The weight constants λ, γ control the influence of the image (intrascale), and interscale smoothness terms respectively; δ denotes the Kronecker delta. The data term. describes how likely the superpixel is foreground or background. The data term for the foreground is defined as l(M i , F ) , (8) l(M i , F ) + l(M i , B)
where l(Mi , F ) = exp − mink=1...|F | ρ(Mi , MkF )/(2σ12 ) stands for the foreground likelihood of the superpixel i. M F k denotes the k-th covariance matrix from a foreground codebook set F , and σ1 is an experimentally determined parameter. As the derivation of the background terms and likelihoods follows analogously, we will omit its description. Edata (xi = 1, M i ) =
The smoothness term. describes how strongly neighborhood pixels are bound together. There are two types of the smoothness terms, see Equation (7), one for intrascale neighborhoods, Esm im , one for interscale neighborhoods, Esm sc . The intrascale smoothness term using α blending is defined as Esm im (Mi , Mj ) = α exp − ρ(M i , M j )/(2σ22 ) +
2 2 + (1 − α) exp − l(M i , F ) − l(M j , F ) /(2σ3 ) , (9) where σ2 and σ3 are pre-defined parameters. The interscale smoothness term is only defined for edges between two vertices from neighboring scales when the corresponding superpixels share at least one same image pixel. The weight on the edge between superpixels i and j from consecutive scales is set to Esm sc (Mi , Mj ) = β
area(i ∩ j) + area(i)
2 2 + (1 − β) exp − l(M i , F ) − l(M j , F ) /(2σ3 ) . (10)
72
H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze
Fig. 5. Importance of inter-scale graph edges. From left to right: Only one lower scale used. Only one higher scale used. Three consecutive scales used.
The second term in both Equations (9), (10) increases the dependency of smoothness terms on the foreground likelihood, making it more robust as originally suggested by [8]. However, we rely on this term only partially through the interpolation parameteres α, β, since a full dependency on the likelihood often resulted in compact, but otherwise incomplete segmentations. Figure 5 shows how the use of multiple scales and inter-scale edges improves the segmentation compared to segmentation performed separately for given scales.
3
Experimental Results
We performed segmentation tests on images from the Berkeley dataset1 . We compare the result to the recent approach proposed by Micusik&Pajdla [2]. Their method looks at color changes in the pixel neighbourhood, yielding superior results on textured images compared to other methods. For both methods the same manually established foreground and background markers were used. To guarantee a fair comparison, the automatic background codebook creation proposed in Section 2.3 was omitted. We present some results where our proposed method performs superior or comparable to [2]. These images typically contain textures with similar colors and are, as stated in [2], the most crucial for their texture descriptor. One must realize that covariance based texture description cannot cope reliably with homogenoeus color regions, see the missing roof of the hut in Figure 6. This should be kept in mind, and use such a descriptor complementary with some color features. Overall, as experiments show, the newly proposed technique performs very well on textures. The advantage over methods, e.g. [6,4,2], is computational efficienty. Moreover, using more accurate superpixels, e.g. [10], improve the accuracy of the result for the price of higher time consumption.
4
Summary and Conclusion
We present an efficient way of representing textures using connected regions, formed by coherent multi-scale over-segmentations. We show the favourable 1
http://www.cs.berkeley.edu/projects/vision/grouping/segbench
Efficient Texture Representation Using Multi-scale Regions
(a)
(b)
73
(c)
Fig. 6. Segmentation comparison. (a) Input image with user marked seeds. (b) The method from [2]. (c) Our approach.
74
H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze
performance on segmentation of textures images. However, our primary goal is not to segment images accurately, but to demonstrate the feasibility of the covariance matrix based descriptor used in a multi-scale hierarchy built on superpixels. The method is aimed at a further use in recognition and an image understanding systems where so accurate segmentation is not required.
References 1. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. IJCV 43(1), 7–27 (2001) 2. Miˇcuˇs´ık, B., Pajdla, T.: Multi-label image segmentation via max-sum solver. In: Proc. CVPR (2007) 3. Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006) 4. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. In: Proc. ACM SIGGRAPH, pp. 309–314. ACM Press, New York (2004) 5. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Probabilistic fusion of stereo with color and contrast for bi-layer segmentation. PAMI 28(9), 1480–1492 (2006) 6. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: Proc. ICCV, pp. 105–112 (2001) 7. Hadjidemetriou, E., Grossberg, M., Nayar, K.S.: Multiresolution histograms and their use for recognition. PAMI 26(7), 831–847 (2004) 8. Turek, W., Freedman, D.: Multiscale modeling and constraints for max-flow/mincut problems in computer vision. In: Proc. CVPR Workshop, vol. 180 (2006) 9. F¨ orstner, W., Moonen, B.: A metric for covariance matrices. Technical report, Dpt. of Geodesy and Geoinformatics, Stuttgart University (1999) 10. Ren, X., Malik, J.: Learning a classification model for segmentation (2003) 11. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. IJCV 59(2), 167–181 (2004) 12. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI 26(9), 1124–1137 (2004) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 14. Lindeberg, T.: Scale-Space Theory in Computer Vison. Kluwer Academic Publishers, Dordrecht (1994) 15. Deng, H., Zhang, W., Diettrich, T., Shapiro, L.: Principal curvature-based region detector for object recognition. In: Proc. CVPR (2007) 16. Pennec, X., Fillard, P., Ayache, N.: A riemannian framework for tensor computing. International Journal of Computer Vision 66(1), 41–66 (2006)
Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data Ili´c Slobodan Deutsche Telekom Laboratories Berlin University of Technology Ernst-Reuter Platz 7, 14199 Berlin, Germany
[email protected]
Abstract. In this paper we develop highly flexible Timoshenko beam model for tracking large deformations in noisy data. We demonstrate that by neglecting some physical properties of Timoshenko beam, classical energy beam can be derived. The comparison of these two models in terms of their robustness and precision against noisy data is given. We demonstrate that Timoshenko beam model is more robust and precise for tracking large deformations in the presence of clutter and partial occlusions. The experiments using both synthetic and real image data are performed. In synthetic images we fit both models to noisy data and use Monte Carlo simulation to analyze their performance. In real images we track deformations of the pole vault, the rat whiskers and the car antenna.
1 Introduction In this paper, we develop true physical 2D Timoshenko beam model and use it for tracking large deformations in noisy image data. Timoshenko beam relays on shear deformation to account for non-linearities. We derive from it physically based energy beam, by neglecting shear deformation. The models which closely approximate real physics we call true physical models, in this case Timoshenko beam, while the models which are designed to retain some physical properties we call physically based models, in this case energy beam. Physically based models introduced almost twenty years ago [1,2,3,4] have demonstrated their effectiveness in the Computer Vision problems. However, they typically rely on simplifying assumptions to yield easy to minimize energy functions and ignore the complex non-linearities that are inherent to large deformations present in highly flexible structures. To justify the use of complex true physical models over simplified physically based models we compare Timoshenko beam to energy beam. Both models were fitted to noisy synthetic data and real images in presence of clutter and partial occlusions. We demonstrate that using fully non-linear Timoshenko beam model, which approximates the physical behavior more closely, yields to more robust and precise fitting to noisy data and tracking of large deformations in the presence of clutter and partial occlusions.
The rat whiskers images shown in this paper were obtained at Harvard University’s School of Engineering and Applied Sciences by Robert A. Jenks.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 75–84, 2007. c Springer-Verlag Berlin Heidelberg 2007
76
I. Slobodan
The fitting algorithm used for both beam representations is guided by the image forces. Since the image forces are proportional to the square distance of the model to the image data, they are usually not sufficient to deform Timoshenko beam model of known material properties, so that it fits the data immediately. The image forces only move the model toward the image observations. To fit the model to the data we repeat Gauss-Newton minimization in several steps. We stop when the distance of the model to the data of two consecutive Gauss-Newton minimizations is smaller then some given precision. We use Levenberg-Marquardt optimizer to fit quadratic energy function of the energy beam. In this case image forces are sufficient to deform the beam in a single minimization because the model energy, being only the approximation of the real model strain energy, does not impose realistic physical restrictions to the beam deformations. In the reminder of the paper we give a brief overview of the know physically based techniques, then introduce non-linear Timoshenko planar beam model, derive energy beam from it, then describe our fitting algorithm and finally present the results.
2 Related Work Recovering model deformations from images requires deformable models to constrain the search space and make the problem tractable. In recent decade deformable models were exploited in Computer Graphics [5], Medical Imaging [6] and Computer Vision [7]. There are several important categories of physically based models: mass-spring models [8], finite element models (FEM) [9,10], snake like models [1,3,4] and models obtained from FEM, to reduce number of dofs, by modal analysis [11,2,12,13]. In this paper we are particularly interested in physical models, especially those based on FEM. FEM are known to be precise and to produce physically valid deformations. However, because of their mathematical complexity the FEM were mainly developed for small linear deformations [14] where the model stiffness matrix is constant. In case of large deformations the stiffness matrix and the applied forces become the function of the displacement. Such non-linear FEM were used by [15] to recover material parameters from images of highly elastic shell like models. The model deformation was measured form 3D model scan and then given to the Finite Element Analysis (FEA) software. By contract we develop non-linear beam equations and recover the model deformations automatically through optimization. In computer vision physically based models, based on the continuous energy function were extensively used. The original ones [1] were 2D and have been shown to be effective for 2D deformable surface registration [16]. They were soon adapted for 3D surface modeling purposes by using deformable superquadrics [3,4], triangulated surfaces [2], or thin-plate splines [6]. In this framework, modeling generic structures often requires many degrees of freedom that are coupled by regularization terms. In practice, this coupling implicitly reduces the number of degrees of freedom, which makes these models robust to noise and is one of the reasons for their immense popularity. In this paper we compare complex true physical 2D Timoshenko beam model to 2D elastic energy beam in terms of their robustness against noisy data. We reveal, in spite of their complexity, the real benefits of true physical models.
Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data
77
3 Plane Timoshenko Beam Model The beam represents the most common structural component in civil and mechanical engineering. A beam is a rod-like — meaning that one dimension is considerably larger then the other two — structural element that resists transversal loading applied between its supports. Timoshenko beam, we develop, assumes geometrically large deformations under small strains with the linearly elastic materials. Timoshenko beam theory [17] accounts for nonlinear effect, such as shear assuming that the cross-section does not remains normal to the deformed longitudinal axis. The beam is divided to a number of finite elements. The beam deformation is defined by 3 dof per node being the axial displacement uX (X), the transverse displacement uY (X) and the cross-section rotation θ(X), where X is longitudinal coordinate in the reference configuration as shown in Fig 1. The undeformed initial configuration is referred as reference and the deformed one as current configuration. The parameters describing beam geometry and the material properties are A0 cross-sectional area, L0 element length in reference configuration, L element length in deformed configuration, I0 second moment of inertia, E Young modulus of elasticity, and G is shear modulus. The material remains linearly elastic. The beam rotation is defined by the angle ψ also equal to the rotation of the cross-section. The angle γ¯ is a shear angle for which the cross section deviates from its normal position defined by the angle ψ. The total rotation of a beam cross section becomes θ = ψ − γ¯ which is exactly one of dofs defined above. To describe beam kinematics we consider the motion of the point P0 (X, Y ) in reference configuration to the point P (x, y) in the current configuration. We keep the assumption that the cross section dimensions do not change and that the shear rotation
(a)
(b)
Fig. 1. (a) Plane Timoshenko beam kinematics notation. (b)Synthetic example of fitting the plane beam, initially aligned along the x-axis , to the synthetic image data shown as magenta dots. The intermediate steps shown in blue are the output of number of repeated Gauss-Newton optimizations driven by the image forces.
78
I. Slobodan
is small γ¯ 1 so that cosγ ≈ 1. The Lagrangian representation of the motion relating points P0 (X, Y ) and P (x, y) is then given by x = X +uX −Y sin θ, y = uY +Y cos θ. The displacement of any point on the beam element can be than represented by a vec tor w = uX (X) uY (X) θ(X) . In the FEM formulation for 2-node C 0 element it is natuaral to express the displacements and rotation functions of w as linear 2 2 2combination of node displacements uX = i=1 Ni uXi , vX = i=1 Ni vXi , θX = i=1 Ni θXi or in matrix form w = Nu where N1 = 12 (1 − ξ), N2 = 12 (1 + ξ) are the linear element shape functions. The strain is a measure of the change of the object shape, in this case the length, before and after the deformation caused by some applied load. The stress is the internal distribution of force per unit area that balances and reacts to external loads applied to a body. We have three different stress components per beam element: e axial strain measuring the beam relative extension, γ shear strain measuring the relative angular change between any two lines in a body before and after the deformation and κ measuring the curvature change. They can be computed from the deformation gradient of motion F =
∂x ∂x ∂X ∂Y ∂y ∂y ∂X ∂Y 1 T 2 (F F
. Green-Lagrange(GL) strain tensor describing the model strain
becomes e = − I). After the derivation the only nonzero elements are axial strain eXX and the shear strain 2eXY = eXY + eY X . Under small strain assumption we can finally express strain vector as: eXX e−Yκ e1 = = (1) e= e2 2eXY γ where three strain quantities introduced above are e axial strain, γ shear strainand κ curvature.These can be collected in the generalized strain vector hT = e γ κ . Because of the assumed linear variations in X of uX (X), uY (Y ) and θ(X), e and γ depend on θ and κ is constant over the element depending only on rotation angles at element end nodes. e and γ can be expressed in a geometrically invariant form: e=
L L L L cos γ¯ − 1 = cos (θ − ψ) − 1, γ = sin γ¯ = sin (θ − ψ) L0 L0 L0 L0
(2)
These geometrically invariant strain quantities can be used for the beam in arbitrary reference configuration. The variations of δe, δγ and δκ with respect to the nodal displacement variations are required for derivations of strain-displacement relation δh = Bδu. To form strain-displacement matrix B we take partial derivatives of e, γ and κ with respect to the node displacements and collect them into matrix B: ⎤ ⎡ cω sω L0 N2 γ −cω −sω L0 N1 γ (3) B = ⎣ sω −cω −L0 N1 (1 + e) −sω cω −L0 N2 (1 + e)⎦ 0 0 −1 0 0 1 where ω = θ + ψ, cω = cos ω and sω = sin ω. We introduce pre-stress resultants N 0 , V 0 and M 0 which define the axial forces, transverse shear forces and bending moments respectively, in the reference configuration. We also define the stress resultants in the current configuration using linear elastic 0 0 equation to be N = N 0 + EA0 e, V = V + GA 0 γ and M = M + EI0 κ, and collect them into stress-resultant vector z = N V M .
Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data
79
The internal model strain energy along the beam, under zero pre-stress resultants N 0 = V 0 = M 0 = 0, can be expressed as the length integral: 1 1 1 zT hdX = EA0 e2 dX + GA0 γ 2 dX + EI0 κ2 dX (4) U= 2 2 2 L0 L0 L0 L0 where L0 is the beam length in reference configuration. The internal force vector can be obtained by taking the fist variation of the strain energy with respect to the nodal displacement: ∂U = BT (u)zdX. (5) p= ∂u L0 This expression we evaluate by reduced Gauss integration in order to eliminate shear locking which overstiff the model deformation making a shear energy to dominate. In addition we use residual bending flexibility (RBF) correction and replace GA0 for shear energy of Eq. 4 by 12EI0 /L20 . Finally, the first variation of the internal force defines the tangent stiffness matrix: ∂B ∂z ∂p = + z)dX = (KM + KG ). (BT (6) KT = ∂u ∂u ∂u L0 where KM is material stiffness and KG is geometric stiffness. The material stiffness is constant and identical to the linear stiffness matrix of the C 1 Euler-Bernoulli beam. The geometric stiffness comes the variation of B while stress resultants are kept fixed and caries the beam nonlinearity responsible for large geometric deformations.
4 Energy Beam Model The model energy of the energy beam can be derived directly from the Timoshenko beam strain energy of Eq. 4. Let us neglect the shear deformation by putting shear anγ − 1 ≈ LL0 − 1, gle to be zero γ¯ . The strain quantities of Eq. 2 become e = LL0 cos¯ γ = LL0 sin γ¯ ≈ 0. In this way the shear energy is eliminated and only the axial energy and bending energy are left. Also, since shear is eliminated the rotational dof θ(X) disappears, and only displacements in X and Y directions are taken to form new displacement vector w = [uX uY ]. Since we deal with discrete beam its energy can be expressed as: U=
1 ws 2
(
(i,j)∈1..n
vi − vj 1 − 1)2 + wb L0 2
2vj − vi − vk
2
(7)
(i,j,k)∈1..n
where i, j are pairs of element nodes, and i, j, k are triplets of element nodes necessary to define curvature at the j t h beam node. Derived energy can be considered as physically based since it directly comes from the realistic physical beam model. The weight coefficients ws and wb can be considered proportional to the Young modulus of elasticity E. However, we will show that, in practice, they significantly change behavior of the fitting algorithm.
80
I. Slobodan
5 Model Fitting The general approach in mechanical simulations is that some external load f is applied to the model and the displacement u is computed. This can be done through energy minimization, where the total potential energy Π of the system is computed as the sum of the system internal strain energy U and the energy caused by the external load P . The minimum of the energy in respect to the displacement u can be found by derivation: ∂U ∂P ∂Π = + ⇒ r(u) = p(u) + f = 0 (8) ∂u ∂u ∂u The r(u) is a force residual, p(u) is internal force of Eq. 5 and f is the external load. This is a nonlinear system of equations and is usually solved by using Newton-Raphson method. It is incremental method and at each iteration we solve for the displacement update du by solving linear system KT du = −(f + p(u)). In classical mechanical simulation the external forces f are a priori given. In our case we do not know them. To compute model displacements we solve Eq. 8, using image T forces. We create a vector of image observations F(u) = d1 (u) d2 (u) . . . dN (u) where di (u) are distances of the image observations from the beam segments. We use the edges in the image, obtained using Canny edge detector, to be our observations. In practice we sample every beam segment and then search for the multiple observations in the direction of the beam normal. The external image energy becomes PI = 12 FT (u)F(u). The images forces are obtained as a derivative of the energy in respect to the displacement fI = ∇FT (u)F(u). The force residual of Eq. 8 becomes: r(u) = p(u) + fI (u) = 0. We derive the displacement increment by developing the residual in a Taylor series around the current displacement u as follows: r(u + du) = p(u + du) + fI (u + du) = 0 ∇FT F + (∇2 FT F + ∇FT ∇F)du + p(u) + KT du = 0 (KT + ∇2 FT F + ∇FT ∇F)du = −(∇FT F + p(u)) ≈ (KT + ∇FT ∇F)du = −(∇FT F + p(u))
(9)
we obtain Gauss-Newton optimization step. We neglect the second order term ∇2 FT F of Eq. 9. To make it more robust we use Tukey robust estimator ρ of [18]. This is simply done by weighting the residuals di (u) of the image observation vector F(u) at each Newton iteration: Each di (u) is replaced by di (u) = wi di (u) such that: (di )2 (u) = (wi )2 d2i (u), therefore the weight is chosen to be: wi = ρ(di (u))1/2 /di (u). We then create a weighting matrix W = diag(. . . , wi , . . .). We then solve in each step: (KT + ∇FT W∇F)du = −(∇FT WF + p(u))
(10)
By solving the Eq. 9 we compute the displacement caused by the image forces. Since the image forces are proportional to the square distance of the model to the image edge points, they are not sufficient to deform the model so that it fits the data. They only move the model toward the image observations. To obtain the exact fit of the model to the data we repeatedly fit the model to the data performing Gauss-Newton method in several steps. We stop when the distance of the model to the data of two consecutive Gauss-Newton minimizations becomes smaller then some given precision. The
Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data
81
optimization algorithm is illustrated on the synthetic example of Fig. 1(b). We obtain the total displacement uT as a sum of all intermediate displacements. For fitting energy beam we slightly modify Eq. 10 by adding λI to the left side of Eq. 10 such that the we obtain Levenberg-Marquart optimizer. In practice it turned out to be more suitable for optimizing energy function of Eq. 7. The computational complexity of our method is quadratic and corresponds to the complexity of the Gauss-Newton minimization.
6 Results We fit the Timoshenko beam and energy beam to synthetic data and compare their performance in presence of different amount of added noise. We then run our experiments on real images in three different cases: the deformation of a car antenna, pole vault and the deformation of the rat whiskers. 6.1 Fitting Synthetic Data We generate the synthetic data clouds around two given ground truth positions of the deformed beams depicted in Fig. 1(b) by adding certain amount of Gaussian noise around them. The amount of noise is controlled by the variance σ ∈ {0.01, 0.1, 0.5, 1.0, 2.0}. We perform a Monte-Carlo simulation such that for each value of σ we fit, in a number of trials, Timoshenko and energy beam to randomly generated data clouds. The number of trials is 100 for every value of σ. We measure mean square error of the fitting result to the ground truth position of the beam. Error measures from ground trought for different values of Young modulus
Error measures from ground trought for different energy weights
0.25 E=1E2 E=1E3 E=1E4 E=1E5 E=1E6
0.06
2.5 ws=1E3,wb=1E1 2
ws=1E1,wb=1E3 ws=1E4,wb=1E2 ws=1E4,wb=1E2
1.5
ws=1E4,wb=1E4
1
0.05
0.04
0.03
0.02
0.15
0.1
0.05
0.5
0.01
0
0.2
0.4
0.6
0.8 1 1.2 1.4 Noise standard derivation σ
1.6
1.8
0
2
0
0.5
1 1.5 Noise standard derivation σ
2
0
2.5
Average mean square error
ws=1E4,wb=1E2 ws=1E4,wb=1E4
3
2
0.8 1 1.2 1.4 Noise standard derivation σ
1.6
1.8
2
Physics Beam Energy Beam
0.16
0.15
0.1
0.05
1
0.6
0.18
Average mean square error
0.2
ws=1E4,wb=1E2 4
0.4
0.2 E=1E2 E=1E3 E=1E4 E=1E5 E=1E6 E=1E7
ws=1E3,wb=1E1 ws=1E1,wb=1E3
0.2
Error in respect to the ground trought vs. different noise levels
0.25 ws=1E0,wb=1E0
5
0
Error measures from ground trought for different values of Young modulus
Error measures from ground trought for different energy weights 6
Average mean square error
Physics Beam Energy Beam 0.2 Average mean square error
Average mean square error
Average mean square error
ws=1E0,wb=1E0
0
Error in respect to the ground trought vs. different noise levels
0.07
3
0.14 0.12 0.1 0.08 0.06 0.04 0.02
0
0
0.2
0.4
0.6
0.8 1 1.2 1.4 Noise standard derivation σ
(a)
1.6
1.8
2
0
0
0.2
0.4
0.6
0.8 1 1.2 1.4 Noise standard derivation σ
(b)
1.6
1.8
2
0
0
0.2
0.4
0.6
0.8 1 1.2 1.4 Noise standard derivation σ
1.6
1.8
2
(c)
Fig. 2. Mean square error measured for number of fittings using Monte Carlo simulation from the ground truth in respect to the different amount of noise. (a) Energy beam fitting errors for different values of energy weights. (b) Timoshenko beam fitting errors for different values of Young modulus of elasticity. (c) Comparison of fitting errors for energy beam shown in red, and Timoshenko beam shown in blue.
82
I. Slobodan
(a)
(b)
(c)
(d)
Fig. 3. Failiour examples using energy beam with different energy weighting coefficients. (a) For wb = 104 , ws = 102 stays smooth but changes its length producing failiour in 5th frame. (b) For wb = 102 , ws = 104 tries to retains its length but no the smoothness producing failiour in 3rd frame. (c,d) The rat whisker fails in the 10th frame because of the occluded ear, and the energy coefficients are also wb = 103 , ws = 10 and ws = 102 , wb = 10 respectively.
Fig. 4. Tracking the car antenna using Temoshenko beam. Selected frames from the tracking sequence with the recovered model shown in white.
Fig. 5. Tracking the pole in a pole vault using Timoshenko beam. Because of the moving camera the image frames are warped to one reference frame using robustly estimated homography. Selected frames from the tracking sequence with the recovered model shown in yellow.
Initially we perform fittings for different values of energy weights ws and wb for energy beam and different values of Young modulus E for Timoshenko beam as shown in Fig. 2(a,b) respectively. The error differs for the different values of the energy weights. We take those values for which the error is minimal and refit the beams to the noisy data with different values of σ. Usually a good balance between stretching and bending energy is required for reasonable fitting performance of the energy beam. For Timoshenko
Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data
83
Fig. 6. Tracking the deformation of the rat whisker using Timoshenko beam. Selected frames from the tracking sequence with the recovered model are shown in white.
beam the small values of Young modulus ranging from 102 to 103 are unrealistic for materials with the small strains to which the Timoshenko theory applies. It means that small values of Young modulus are suitable for elastic materials with the large strains, while big values of Young modulus are suitable for elastic materials which have small strains, i.e tend to retain their length but can have large rotations. For that reasons we obtained the best fitting performance for values of E being 105 and 106 . The errors in respect to the ground truth for both beams in two synthetic examples are shown in Fig. 2(c). Timoshenko beam retains the same error measure with the increase of the amount of noise, while the error for the energy beam increases with the increase of the amount of added noise. This indicates that Timoshenko beam is more robust when fitted to noisy data. The same is proven bellow during tracking in real images. 6.2 Real Images In real images we chose to track highly flexible structures such as a car antenna, pole vault and the deformation of the rat whiskers. The car antenna example of Fig. 4 has simple background and both Timoshenko and energy beam track it with no problems as can be seen in supplementary videos. More complex pole vault and rat whiskers were successfully tracked using Timoshenko beam while they failed when energy beam was used. The failiour examples are depicted in Fig. 3. The selected frames form successful tracking using Timoshenko beam are depicted in Fig. 5 and Fig 6. In all examples the initialization was done manually in the first frame and then the frame to frame fitting was done. The energy beam has tendance to attach to the strong edges regardless the combination of the energy weights as depicted by Fig. 3, while the Timoshenko beam overcomes this problem because its naturally imposed physical constrains implicitly contained in the model description.
7 Conclusion In this paper we investigated true physical Timoshenko beam model to track large non-linear deformations in images. We compared it to physically based energy beam
84
I. Slobodan
approach which uses simplifying physical assumptions to create the model energy, similar to most physically based models used in computer vision. These approaches ignore the complex non-linearities that are inherent to large deformations. We discovered that using Timoshenko beam, which approximates the physical behavior more closely, contributed to robust fitting to noisy data and efficient tracking of large deformations in the presence of clutter and partial occlusions.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Cohen, L., Cohen, I.: Deformable models for 3-d medical images using finite elements and balloons. In: Conference on Computer Vision and Pattern Recognition, pp. 592–598 (1992) 3. Terzopoulos, D., Metaxas, D.: Dynamic 3D models with local and global deformations: Deformable superquadrics. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 703–714 (1991) 4. Metaxas, D.T.D.: Constrained deformable superquadrics and nonrigid motion tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(6), 580–591 (1993) 5. Gibson, S., Mirtich, B.: A survey of deformable modeling in computer graphics. Technical report, Mitsubishi Electric Research Lab, Cambridge, MA (1997) 6. McInerney, T., Terzopoulos, D.: Deformable models in medical images analysis: a survey. Medical Image Analysis 1(2), 91–108 (1996) 7. Metaxas, D.: Physics-Based Deformable Models: Applications to Computer Vision, Graphics, and Medical Imaging. Kluwer Academic Publishers, Dordrecht (1996) 8. Lee, Y., Terzopoulos, D., Walters, K.: Realistic modeling for facial animation. In: SIGGRAPH 1995. Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 55–62. ACM Press, New York (1995) 9. Essa, I., Sclaroff, S., Pentland, A.: Physically-based modeling for graphics and vision. In: Martin, R. (ed.) Directions in Geometric Computing. Information Geometers, U.K (1993) 10. Sclaroff, S., Pentland, A.P.: Physically-based combinations of views: Representing rigid and nonrigid motion. Technical Report 1994-016 (1994) 11. Pentland, A.: Automatic extraction of deformable part models. International Journal of Computer Vision 4(2), 107–126 (1990) 12. Delingette, H., Hebert, M., Ikeuchi, K.: Deformable surfaces: A free-form shape representation. SPIE Geometric Methods in Computer Vision 1570, 21–30 (1991) 13. Nastar, C., Ayache, N.: Frequency-based nonrigid motion analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(11) (1996) 14. O’Brien, J.F., Cook, P.R., Essl, G.: Synthesizing sounds from physically based motion. In: Fiume, E. (ed.) SIGGRAPH 2001. Computer Graphics Proceedings, pp. 529–536 (2001) 15. Tsap, L., Goldgof, D., Sarkar, S.: Fusion of physically-based registration and deformation modeling for nonrigid motion analysis (2001) 16. Bartoli, A., Zisserman, A.: Direct Estimation of Non-Rigid Registration. In: British Machine Vision Conference, Kingston, UK (2004) 17. Timoshenko, S., MacCullogh, G.: Elements of Strength in Materials, 3rd edn., van Nostrand. New York (1949) 18. Lepetit, V., Fua, P.: Monocular model-based 3d tracking of rigid objects: A survey. Foundations and Trends in Computer Graphics and Vision 1(1), 1–89 (2005)
A Family of Quadratic Snakes for Road Extraction Ramesh Marikhu1 , Matthew N. Dailey2 , Stanislav Makhanov3 , and Kiyoshi Honda4 Information and Communication Technologies, Asian Institute of Technology Computer Science and Information Management, Asian Institute of Technology 3 Sirindhorn International Institute of Technology, Thammasat University 4 Remote Sensing and GIS, Asian Institute of Technology
1 2
Abstract. The geographic information system industry would benefit from flexible automated systems capable of extracting linear structures from satellite imagery. Quadratic snakes allow global interactions between points along a contour, and are well suited to segmentation of linear structures such as roads. However, a single quadratic snake is unable to extract disconnected road networks and enclosed regions. We propose to use a family of cooperating snakes, which are able to split, merge, and disappear as necessary. We also propose a preprocessing method based on oriented filtering, thresholding, Canny edge detection, and Gradient Vector Flow (GVF) energy. We evaluate the performance of the method in terms of precision and recall in comparison to ground truth data. The family of cooperating snakes consistently outperforms a single snake in a variety of road extraction tasks, and our method for obtaining the GVF is more suitable for road extraction tasks than standard methods.
1
Introduction
The geographic information system industry would benefit from flexible automated systems capable of extracting linear structures and regions of interest from satellite imagery. In particular, automated road extraction would boost the productivity of technicians enormously. This is because road networks are among the most important landmarks for mapping, and manual marking and extraction of road networks is an extremely slow and laborious process. Despite years of research and significant progress in the computer vision and image processing communities (see, for example, [1,2] and Fortier et al.’s survey [3]), the methods available thus far have still not attained the speed and accuracy necessary for practical application in GIS tools. Among the most promising techniques for extraction of complex objects like roads are active contours or snakes, originally introduced by Kass et al. [4]. Since the seminal work of Kass and colleagues, techniques based on active contours have been applied to many object extraction tasks [5] including road extraction [6]. Rochery et al. have recently proposed higher-order active contours, in particular quadratic snakes, which hold a great deal of promise for extraction of linear Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 85–94, 2007. c Springer-Verlag Berlin Heidelberg 2007
86
R. Marikhu et al.
structures like roads [7]. The idea is to use a quadratic formulation of the contour’s geometric energy to encourage anti-parallel tangents on opposite sides of a road and parallel tangents along the same side of a road. These priors increase the final contour’s robustness to partial occlusions and decrease the likelihood of false detections in regions not shaped like roads. In this paper, we propose two heuristic modifications to Rochery et al.’s quadratic snakes, to address limitations of a single quadratic snake and to accelerate convergence to a solution. First, we introduce the use of a family of quadratic snakes that are able to split, merge, and disappear as necessary. Second, we introduce an improved formulation of the image energy combining Rochery et al.’s oriented filtering technique [7] with thresholding, Canny edge detection, and Xu and Prince’s Gradient Vector Flow (GVF) [8]. The modified GVF field created using the proposed method is very effective at encouraging the quadratic snake to snap to the boundaries of linear structures. We demonstrate the effectiveness of the family of snakes and the modified GVF field in a series of experiments with real satellite images, and we provide precision and recall measurements in comparison with ground truth data. The results are an encouraging step towards the ultimate goal of robust, fully automated road extraction from satellite imagery. As a last contribution, we have developed a complete GUI environment for satellite image manipulation and quadratic snake evolution, based on the Matlab platform. The system is freely available as open source software [9].
2 2.1
Experimental Methods Quadratic Snake Model
Here we provide a brief overview of the quadratic snake proposed by Rochery et al. [7]. An active contour or snake is parametrically defined as T γ(p) = x(p) y(p) ,
(1)
T where p is the curvilinear abscissa of the contour and the vector x(p) y(p) defines the Cartesian coordinates of the point γ(p). We assume the image domain Ω to be a bounded subset of R2 . The energy functional for Rochery et al.’s quadratic snake is given by Es (γ) = Eg (γ) + λEi (γ),
(2)
where Eg (γ) is the geometric energy and Ei (γ) is the image energy of the contour γ. λ is a free parameter determining the relative importance of the two terms. The geometric energy functional is defined as β tγ (p) · tγ (p ) Ψ (γ(p) − γ(p )) dp dp , (3) Eg (γ) = L(γ) + αA(γ) − 2
A Family of Quadratic Snakes for Road Extraction
87
where L(γ) is the length of γ in the Euclidean metric over Ω, A(γ) is the area enclosed by γ, tγ (p) is the unit-length tangent to γ at point p, and Ψ (z), given the distance z between two points on the contour, is used to weight the interaction between those two points (see below). α and β are constants weighting the relative importance of each term. Clearly, for positive β, Eg (γ) is minimized by contours with short length and parallel tangents. If α is positive, contours with small enclosed area are favored; if it is negative, contours with large enclosed area are favored. The interation function Ψ (z) is a smooth function expressing the radius of the region in which parallel tangents should be encouraged and anti-parallel tangents should be discouraged. Ψ (z) incorporates two constants: d, the expected road width, and , the expected variability in road width. During snake evolution, weighting by Ψ (z) in Equation 3 discourages two points with anti-parallel tangents (the opposite sides of a putative road) from coming closer than distance d from each other. The image energy functional Ei (γ) is defined as Ei (γ) =
nγ (p) · ∇I(γ(p)) dp − tγ (p) · tγ (p ) ∇I(γ(p)) · ∇I(γ(p )) Ψ (γ(p) − γ(p )) dp dp , (4)
where I : Ω → [0, 255] is the image and ∇I(γ(p)) denotes the 2D gradient of I evaluated at γ(p). The first linear term favors anti-parallel normal and gradient vectors, encouraging counterclockwise snakes to shrink around or clockwise snakes to expand to enclose dark regions surrounded by light roads.1 The quadratic term favors nearby point pairs with two different configurations, one with parallel tangents and parallel gradients and the other with anti-parallel tangents and anti-parallel gradients. After solving the Euler-Lagrange equations for minimizing the energy functional Es (γ) (Equation 2), Rochery et al. obtain the update equation nγ (p) ·
+ 2λ
1
∂Es (p) = −κγ (p) − α − λ∇I(γ(p))2 + G(γ(p)) ∂γ + β r (γ(p), γ(p )) · nγ (p ) Ψ (γ(p) − γ(p )) dp
r (γ(p), γ(p )) · nγ (p ) (∇I(γ(p)) · ∇I(γ(p ))) Ψ (γ(p) − γ(p )) dp + 2λ ∇I(γ(p )) · (∇∇I(γ(p)) × nγ (p )) Ψ (γ(p) − γ(p )) dp , (5)
For dark roads in light background, we negate all the terms involving image, including G(γ(p)) in Equation 5. In the rest of the paper, we assume light roads on a dark background.
88
R. Marikhu et al.
where κγ (p) is the curvature of γ at γ(p) and G(γ(p)) is the “specific energy,” γ(p)−γ(p ) evaluated at point γ(p) (Section 2.2). r (γ(p), γ(p )) = γ(p)−γ(p ) is the unit vector pointing from γ(p) towards γ(p ). ∇∇I(γ(p)) is the Hessian of I evaluated at γ(p). α, β, and λ are free parameters that need to be determined experimentally. d and are specified a priori according to the desired road width. Following Rochery et al., we normally initialize our quadratic snakes with a rounded rectangle covering the entire image. 2.2
Oriented Filtering
We use Rochery’s oriented filtering method [10] to enhance linear edges in our satellite imagery. The input image is first convolved with oriented derivativeof-Gaussian filters at various orientations. Then the minimum (most negative) filter response over the orientations is run through a ramp function equal to 1 for low filter values and −1 for high filter values. The thresholds are user-specified. An example is shown in Fig. 1(b). 2.3
GVF Energy
Rather than using the oriented filtering specific image energy G(x) from Section 2.2 for snake evolution directly, we propose to combine the oriented filtering approach with Xu and Prince’s Gradient Vector Flow (GVF) method [8]. T The GVF is a vector field V GVF (x) = u(x) v(x) minimizing the energy functional GVF E(V )= μ(u2x (x) + u2y (x) + vx2 (x) + vy2 (x)) (6) Ω 2 2 ˜ ˜ V (x) − ∇I(x) dx, + ∇I(x) ∂u ∂v ∂v ˜ where ux = ∂u ∂x , uy = ∂y , vx = ∂x , vy = ∂y , and I is a preprocessed version of image I, typically an edge image of some kind. The first term inside the integral encourages a smooth vector field whereas the second term encourages fidelity to ˜ μ is a free parameter controlling the relative importance of the two terms. ∇I. Xu and Prince [8] experimented with several different methods for obtaining ˜ We propose to perform Canny edge detection on G (the result of oriented ∇I. filtering and thresholding, introduced in Section 2.2) to obtain a binary image I˜ for GVF, then to use the resulting GVF V GVF as an additional image energy for quadratic snake evolution. The binary Canny image is ideal because it only includes information about road-like edges that have survived sharpening by oriented filters. The GVF field is ideal because during quadratic snake evolution, it points toward road-like edges, pushing the snake in the right direction from a long distance away. This speeds evolution and makes it easier to find suitable parameters to obtain fast convergence. Fig. 1 compares our method to alternative GVF formulations based on oriented filtering or Canny edge detection alone.
A Family of Quadratic Snakes for Road Extraction
89
Fig. 1. Comparison of GVF methods. (a) Input image. (b) G(x) obtained from oriented filtering on I(x). (c) Image obtained from G(x) using threshold 0. (d) Canny edge detection on (c), used as I˜ for GVF. (e-f) Zoomed views of GVFs in region delineated ˜ (f) in (d). (e) Result of using the magnitude of the gradient ∇(Gσ ∗ I) to obtain I. ˜ Result of using Canny edge detection alone to obtain I. (g) GVF energy obtained using our proposed edge image. This field pushes most consistently toward the true road boundaries.
2.4
Family of Quadratic Snakes
A single quadratic snake is unable to extract enclosed regions and multiple disconnected networks in an image. We address this limitation by introducing a family of cooperating snakes that are able to split, merge, and disappear as necessary. In our formulation, due to the curvature term κγ (p) and the area constant α in Equation 5, specifying the points on γ in a counterclockwise direction creates a shrinking snake and specifying the points on γ in a clockwise direction creates a growing snake. An enclosed region (loop or a grid cell) can be extracted effectively by initializing two snakes, one shrinking snake covering the whole road network and another growing snake inside the enclosed region. On the one hand, our method is heuristic and dependent on somewhat intelligent user initialization, but it is much simpler than level set methods for the same problem [7], and, assuming a constant number of splits and merges per iteration, it does not increase the asymptotic complexity of the quadradic snake’s evolution. Splitting a Snake. We split a snake into two snakes whenever two of its arms are squeezed too close together, i.e. when the distance between two snake points is less than dsplit and those two points are at least k snake points from each other in both directions of traversal around the contour. dsplit should be less than 2η, where η is the maximum step size.
90
R. Marikhu et al.
Merging Two Snakes. Two snakes are merged when they have high curvature points within a distance dmerge of each other, the two snakes’ order of traversal (clockwise or counterclockwise) is the same, and the tangents at the two high curvature points are nearly antiparallel. High curvature points are those with where κmax is the maximum curvature for any point on γ. High κγ (p) > 0.6κmax γ γ curvature points are taken to ensure merging only occurs if two snakes have the semi-circular tip of their arms facing each other. Filtering out the low curvature points necessitates computing angle between the tangents at two points only for the high curvature points. When these conditions are fulfilled, the two snakes are merged by deleting the high curvature points and joining the snakes into a single snake while preserving the direction of traversal for the combined snake. Deleting a Snake. A snake γ is deleted if it has low compactness ( 4πA(γ) L(γ)2 ) and delete . a perimeter less than L 2.5
Experimental Design
We analyze extraction results on different types of road networks using the single quadratic snake proposed by Rochery et al. [7] and the proposed family of cooperating snakes. The default convergence criterion is when the minimum Es (γ) has not improved for some number of iterations. Experiments have been performed to analyze the extraction of tree-structured road networks and those with loops, grids and disconnected networks. We then analyze the effectiveness of GVF energy obtained from the proposed edge image in Experiment 4. For all the experiments, we digitize the images manually to obtain the ground truth data necessary to compute precision and recall.
3
Results
We have obtained several parameters emperically. For splitting a snake, dsplit should be less than d. k to be chosen depending on how far the two splitting points should be to ensure that the snakes formed after splitting have at least k points. In order to ensure that merging of snakes takes place only among the arms with the semi-circular tips facing each other, the tangents at the high curvature points are checked for antiparallel threshold of 130π/180.. The compactness should be greater than 0.2 to ensure that linear structured contours are not deleted. 3.1
Experiment 1: Simple (Tree-Structured) Road Networks
A single quadratic snake is well suited for tree-structured road networks as the snake will not need to change its topology during evolution (Figure 2). A family of snakes enable faster and better road extraction as non-road regions are eliminated using splitting and deletion of snakes.
A Family of Quadratic Snakes for Road Extraction
91
Fig. 2. Evolution of quadratic snake on roads with tree structure. Each column displays an image with initial contour in red and the extracted road network below it.
Fig. 3. Evolution of quadratic snake on roads with loops and disconnected networks. Each column displays an image with initial contour in red and the extracted road network below it.
92
3.2
R. Marikhu et al.
Experiment 2: Road Networks with Single Loop and Multiple Disconnected Networks
The family of quadratic snakes are able to extract disconnected networks with high accuracy (Figure 3) but are not able to extract enclosed regions automatically as the snakes are not able to develop holes inside it in the form of growing snakes. 3.3
Experiment 3: Complex Road Networks
A road network is considered complex if it has multiple disconnected networks and enclosed regions and large number of branches. With the appropriate user initialization (Figure 4), the snakes are able to extract the road networks with high accuracy and in less time. 3.4
Experiment 4: GVF Energy to Enable Faster Evolution
The Gradient Vector Flow Field [8] boosts the evolution process as we can see from the number of iterations required for each evolution in Experiment 4 with and without the use of GVF energy. From the evolution in the fifth column, we see that the snake was able to extract the network with greater detail. Also, from the evolution in the last column, we see that it is necessary for the quadratic image energy to enable robust extraction and thus the GVF weight and λ need to be balanced appropriately.
Fig. 4. Evolution of quadratic snake on roads with enclosed regions. Each column displays an image with initial contour in green and the extracted road network below it.
A Family of Quadratic Snakes for Road Extraction
93
Fig. 5. Evolution of quadratic snake on roads with enclosed regions. Each column displays an image with initial contour in green and the extracted road network below it.
4
Discussion and Conclusion
In Experiment 1, we found that the our modified quadratic snake is able to move into concavities to extract entire tree-structured road networks with very high accuracy. Experiment 2 showed that the family of quadratic snakes is effective at handling changes in topology during evolution, enabling better extraction of road networks. Currently, loops cannot be extracted automatically. We demonstrated the difficulty in extracting complex road networks with multiple loops and grids in Experiment 3. However, user initialization of a family of contours enable extraction of multiple closed regions and help the snake to avoid road-like regions. The level set framework could be used to handle change in topology enabling effective extraction of enclosed regions. Rochery et al. [10] evolved the contour using the level set methods introduced by Osher and Sethian. However, our method is faster, conceptually simpler, and a direct extension of Kass et al.’s computational approach. In Experiment 4, we found that faster and robust extraction is achieved using oriented filtering and GVF energy along with image energy of the quadratic snakes. Our proposed edge image obtained from oriented filtering is effective for computing GVF energy to enhance the process of extraction. We also found that our method for obtaining the GVF outperforms standard methods. Finally, we have developed a complete GUI environment for satellite image manipulation and quadratic snake evolution, based on the Matlab platform. The system is freely available as open source software [9].
94
R. Marikhu et al.
Future work will focus on possibilities to automate the extraction of enclosed regions. Digital elevation models could be integrated with image energy for increased accuracy.
Acknowledgments This research was supported by Thailand Research Fund grant MRG4780209 to MND. RM was supported by a graduate fellowship from the Nepal High Level Commission for Information Technology.
References 1. Fischler, M., Tenenbaum, J., Wolf, H.: Detection of roads and linear structures in low-resolution aerial imagery using a multisource knowledge integration technique. Computer Graphics and Image Processing 15, 201–223 (1981) 2. Geman, D., Jedynak, B.: An active testing model for tracking roads in satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(1), 1–14 (1996) 3. Fortier, A., Ziou, D., Armenakis, C., Wang, S.: Survey of work on road extraction in aerial and satellite images. Technical Report 241, Universit´e de Sherbrooke, Quebec, Canada (1999) 4. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1987) 5. Cohen, L.D., Cohen, I.: Finite-element methods for active contour models and balloons for 2-D and 3-D images. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 131–147 (1993) 6. Laptev, I., Mayer, H., Lindeberg, T., Eckstein, W., Steger, C., Baumgartner, A.: Automatic extraction of roads from aerial images based on scale space and snakes. Machine Vision and Applications 12(1), 23–31 (2000) 7. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher order active contours. International Journal of Computer Vision 69(1), 27–42 (2006) 8. Xu, C., Prince, J.L.: Gradient Vector Flow: A new external force for snakes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 66–71 (1997) 9. Marikhu, R.: A GUI environment for road extraction with quadratic snakes Matlab software (2007), available at http://www.cs.ait.ac.th/∼mdailey/snakes. 10. Rochery, M.: Contours actifs d’order sup´erieur et leur application ` a la d´etection de lin´eiques dans des images de t´el´ed´etection. PhD thesis, Universit´e de Nice, Sophia Antipolis — UFR Sciences (2005)
Multiperspective Distortion Correction Using Collineations Yuanyuan Ding and Jingyi Yu Department of Computer and Information Sciences University of Delaware Newark, DE 19716, USA {ding,yu}@eecis.udel.edu
Abstract. We present a new framework for correcting multiperspective distortions using collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. We show that image distortions in many previous models of cameras can be effectively reduced via proper collineations. To correct distortions in a specific multiperspective camera, we develop an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Experiments demonstrate that our system robustly corrects complex distortions without acquiring the scene geometry, and the resulting images appear nearly undistorted.
1
Introduction
A perspective image represents the spatial relationships of objects in a scene as they would appear from a single viewpoint. Recent developments have suggested that alternative multiperspective camera models [5,16] can combine what is seen from several viewpoints into a single image. These cameras provide potentially advantageous imaging systems for understanding the structure of observed scenes. However, they also exhibit multiperspective distortions such as the curving of lines, apparent stretching and shrinking, and duplicated projections of a single point [12,14]. In this paper, we present a new framework for correcting multiperspective distortions using collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. We show that image distortions in many previous cameras can be effectively reduced via proper collineations. To correct distortions in a specific multiperspective camera, we develop an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Compared with classical distortion correction methods [12,2,11], our approach does not require prior knowledge on scene geometry and it can handle highly Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 95–105, 2007. c Springer-Verlag Berlin Heidelberg 2007
96
Y. Ding and J. Yu
complex distortions. We demonstrate the effectiveness of our technique on various synthetic and real multiperspective images, including the General Linear Cameras [14], catadioptric mirrors, and reflected images from arbitrary mirror surfaces. Experiments show that our method is robust and reliable, thus the resulting images appear nearly undistorted.
2
Previous Work
In recent years, there has been a growing interest in designing multiperspective cameras which capture rays from different viewpoints in space. These multiperspective cameras include pushbroom cameras [5], which collect rays along parallel planes from points swept along a linear trajectory, the cross-slit cameras [8,16], which collect all rays passing through two lines, and the oblique cameras [7], in which each pair of rays are oblique. The recently proposed General Linear Cameras (GLC) uniformly model these multiperspective cameras as 2D linear manifolds of rays (Fig. 1). GLCs produce easily interpretable images, which are also amenable to stereo analysis [9]. However, these images exhibit multiperspective distortions [14]. In computer vision, image-warping has been commonly used to reduce distortions. Image-warping computes an explicit pixel-to-pixel mapping to warp the original image onto a nearly perspective image. For cameras that roughly maintain a single viewpoint [6], simple parametric functions are sufficient to eliminate perspective, radial, and tangential distortions [2,3]. However, for complex imaging systems, especially those exhibiting severe caustic distortions [12], the warping function is difficult to model and may not have a closed-form solution. Image-based rendering algorithms have also been proposed to reduce image distortions [10,4]. There, the focus has been to estimate the scene structure from a single or multiple images. Swaminathan and Nayar [13] have shown that simple geometry proxies, such as the plane, sphere, and cylinder, are often sufficient to reduce caustic distortions on catadioptric mirrors, provided that the prior on scene structure is known. We present a third approach based on multiperspective collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. For many multiperspective cameras such as the pushbroom [5] and the cross-slit [8], collineations can be uniformly modeled using the recently proposed General Linear Cameras (GLC) [15]. 2.1
GLC Collineation
In the GLC framework, every ray is parameterized by its intersections with the two parallel planes, where [u, v] is the intersection with the first and [s, t] the second, as shown in Fig. 1(a). This parametrization is often called a two-plane parametrization (2PP) [4,15]. We can reparameterize each ray by substituting σ = s − u and τ = t − v. In this paper, we will use this [σ, τ, u, v] parametrization to simplify our analysis. We also assume the default uv plane is at z = 0 and st plane at z = 1. Thus [σ, τ, 1] represents the direction of the ray.
Multiperspective Distortion Correction Using Collineations
97
r3 r2 t
(s2, t2)
(s3, t3)
(u2, v2)
s
r3
r1
(u3, v2) z
(s1, t1)
r3
r2
v
u
(u1, v1)
r2
r1
α⋅ r 1 + β ⋅ r 2 + (1 −α−β ) ⋅ r 3
(a)
Π1
r1
Π2
Π3
Π4
C
(b)
(c)
(d)
(e)
Fig. 1. General Linear Camera Models. (a) A GLC collects radiance along all possible affine combination of three rays. The rays are parameterized by their intersections with two parallel planes. The GLC model unifies many previous cameras, including the pinhole (b), the orthographic (c), the pushbroom (d), and the cross-slit (e).
A GLC is defined as the affine combination of three rays parameterized under 2PP: r = α[σ1 , τ1 , u1 , v1 ] + β[σ2 , τ2 , u2 , v2 ] + (1 − α − β)[σ3 , τ3 , u3 , v3 ], ∀α, β
(1)
Many well-known multiperspective cameras, such as pushbroom, cross-slit, linear oblique cameras are GLCs as shown in Fig. 1. If we assume uv is the image plane, we can further choose three special rays with [u, v] coordinates [0, 0], [1, 0], and [0, 1] to form a canonical GLC as: r[σ, τ, u, v] = (1 − α − β) · [σ1 , τ1 , 0, 0]+α · [σ2 , τ2 , 1, 0] + β · [σ3 , τ3 , 0, 1]
(2)
It is easy to see that α = u, β = v, and σ and τ are linear functions in u and v. Therefore, under the canonical form, every pixel [u, v] maps to a ray r(u, v) in the GLC. A GLC collineation maps every ray r(u, v) to a pixel [i, j] on the image plane Π[p, ˙ d1 , d2 ], where p˙ specifies the origin and d1 , and d2 specify the two spanning directions of Π. For every ray r[σ, τ, u, v], we can intersect r with Π to compute [i, j]: [u, v, 0] + λ[σ, τ, 1] = p˙ + id1 + jd2
(3)
Solving for i, j, and λ gives:
where
i=
y x z x (τ dz2 −dy 2 )(u−px )+(d2 −σd2 )(v−py )−(σd2 −τ d2 )pz γ
j=
y z z x x (dy 1 −τ d1 )(u−px )+(σd1 −d1 )(v−py )−(τ d1 −σd1 )pz γ
x x d1 d2 −σ y y γ = d1 d2 −τ dz1 dz2 −1
(4)
(5)
For a canonical GLC, since σ and τ are both linear functions in u and v, γ must be linear in u and v. Therefore, we can rewrite i and j as:
98
Y. Ding and J. Yu
i= j=
a1 u2 +b1 uv+c1 v 2 +d1 u+e1 v+f1 a3 u+b3 v+c3 2
(6)
2
a2 u +b2 uv+c2 v +d2 u+e2 v+f2 a3 u+b3 v+c3
˜ Π (u, v) of a GLC from the uv image plane to a new Thus, the collineation Col image plane Π is a quadratic rational function. Fig. 2 shows the images of a GLC under different collineations. It implies that image distortions may be reduced using a proper collineation. Π2
Π1 GLC
GLC
(a)
(b)
(c)
(d)
Fig. 2. The image of a cross-slit GLC (d) under collineation (c) appear much less distorted than the image (b) of the same camera under collineation (a)
3
Correct Distortions in GLCs
Given a specific GLC, our goal is to find the optimal collineation to minimize its distortions. Similar to previous approaches [12,11], we assume the rays captured by the camera are known. We have developed an interactive system to allow users to design their ideal undistorted images. Our system supports two modes. In the first mode, the user can select feature rays from the camera and position them at desirable pixels in the target images. In the second mode, the user can simply provide a reference perspective image. Our system then automatically matches the features points. Finally, the optimal collineation is estimated to fit the projections of the feature rays with the target pixels. 3.1
Interactive Distortion Correction
Given a canonical GLC, the user can first select n feature rays (blue crosses in Fig. 3(a)) from the source camera and then position them at desirable pixels (red crosses in Fig. 3(b)) on the target image. Denote [uk , vk ] as the uv coordinate of each selected ray rk in the camera and [ik , jk ] as the desired pixel coordinate ˙ d1 , d2 ] that of rk on the target image, we want to find the collineation Π[p, maps [u, v] as close to [i, j] as possible. We formalize it as a least squares fitting problem: n ˜ Π (uk , vk ) − [ik , jk ]||2 min ||Col (7) Π
k=1
Since each collineation Π[p, ˙ d1 , d2 ] has 9 variables, we need a minimal number of five ray-pixel pairs. This is not surprising because four pairs uniquely
Multiperspective Distortion Correction Using Collineations
99
determine a projective transformation, a degenerate collineation in the case of perspective cameras. Recall that the GLC collineations are quadratic rational functions. Thus, finding the optimal Π in Equation (7) requires using non-linear optimizations. To solve this problem, we use the Levenberg-Marquardt method. A common issue with the Levenberg-Marquardt method, however, is that the resulting optimum depends on the initial condition. To avoid getting trapped in a local minimum, we choose a near optimal initial condition by sampling different spanning directions of Π. We rewrite the spanning directions as: di = ηi · [cos(φi )cos(θi ), cos(φi )sin(θi ), sin(φi )],
i = 1, 2
(8)
We sample several θ1 , θ2 , φ1 , and φ2 and find the corresponding p, ˙ η1 , and η2 as the initial conditions. Finally, we choose the one with the minimum error. This preconditioned optimization robustly approximates a near optimal collineation that significantly reduces distortions as shown in Fig. 3(b).
(a)
(b)
Fig. 3. Interactive Distortion Correction. (a) The user selects feature rays (blue crosses) and positions them at desirable pixels (red crosses). (b) shows the new image under the optimal collineation. The distortions are significantly reduced. The green crosses illustrate the final projections of the feature rays.
3.2
Automatic Distortion Correction
We also present a simple algorithm to automatically reduce distortions. Our method consists of two steps. First, the user provides a target perspective image that captures the same scene. Next, we automatically select the matched features between the source camera and the target image and compute the optimal collineation by minimizing Equation (7). Recall that a GLC captures rays from different viewpoints in space and hence, its image may appear very different from a perspective image. To match the feature points, we use Scale Invariant Feature Transform (SIFT) to preprocess the two images. SIFT robustly handles image distortion and generates transformation-invariant features. We then perform global matching to find the potential matching pairs. Finally, we prune the outliers by using RANSAC with the homography model. To tolerate parallax, we use a loose inlier threshold of 20 pixels. In Fig. 4, we show our automatic distortion correction results on various GLCs including the pushbroom, the cross-slit, and the pencil cameras. The user inputs
100
Y. Ding and J. Yu Original Pushbroom
Original Cross-slit
Original Pencil
Target
(b) Corrected Pushbroom
(c) Corrected Cross-slit
(d) Corrected Pencil
(a)
(e)
(f)
(g)
Fig. 4. Automatic Distortion Correction. (a) Perspective reference image; (b),(c), and(d) are distorted images captured from pushbroom camera, cross-slit camera, and pencil camera. (e), (f), and (g) are the distortion corrected results of (b), (c), and (d) using the automatic algorithm.
a perspective image (Fig. 4(a)) and the corrected GLC images appear nearly undistorted using the optimal collineations (bottom row of Fig. 4).
4
Correcting Distortions on Catadioptric Mirrors
Next, we show how to correct multiperspective distortions on catadioptric mirrors. Conventional catadioptric mirrors place a pinhole camera at the focus of a hyperbolic or parabolic surface to synthesize a different pinhole camera with a wider field of view [6]. When the camera moves off the focus, the reflection images exhibit complex caustic distortions that are generally difficult to correct [12]. We apply a similar algorithm using multiperspective collineations. Our method is based on the observation that, given any arbitrary multiperspective imaging system that captures smoothly varying set of rays, we can map the rays onto a 2D ray manifold in the 4D ray space. The characteristics of this imaging system, such as its projection, collineation, and image distortions can be analyzed by the 2-D tangent ray planes, i.e., the GLCs [14]. This implies that a patch on an arbitrary multiperspective image can be locally approximated as a GLC. We first generalize the GLC collineation to arbitrary multiperspective imaging systems. Notice that not all rays in these systems can be parameterized as [σ, τ, u, v] (e.g., some rays may lie parallel to the parametrization plane). Thus, we use the origin o˙ and the direction l to represent each ray r. ˙ l] to a pixel [i, j] as: The collineation Π[p, ˙ d1 , d2 ] maps r[o, [ox , oy , oz ] + λ[lx , ly , lz ] = p˙ + id1 + jd2
(9)
Solving for i, j in Equation (9) gives: i=
x x z x x z y y x y y x z z (ly dz2 −lz dy 2 )(o −p )+(l d2 −l d2 )(o −p )+(l d2 −l d2 )(o −p ) γ∗
j=
y z x x x z z x y y y x x y z z (lz dy 1 −l d1 )(o −p )+(l d1 −l d1 )(o −p )+(l d1 −l d1 )(o −p ) γ∗
(10)
Multiperspective Distortion Correction Using Collineations
(a)
101
(b) (e)
(f) (c)
(d)
(g)
Fig. 5. Selecting different feature rays ((a) and (c)) produces different distortion correction results ((b) and (d)). (f) shows the automatic feature matching between a region (blue rectangle) on the spherical mirror and a perspective image. (g) is the final distortion corrected image. The holes are caused by the under-sampling of rays.
where
x x x d1 d2 −l ∗ γ = dy1 dy2 −ly dz1 dz2 −lz
(11)
˜ Π (o, ˙ l). We abbreviate Equation (10) as [i, j] = Col The user then selects n feature rays from the catadioptric mirror and positions them at target pixels [ik , jk ], k = 1 . . . n. Alternatively, they can provide a target perspective image (Fig. 5(f)) and our system will automatically establish feature correspondences using the SIFT-RANSAC algorithm. We then use the Levenberg-Marquardt method (equation (7)) with sampled initial conditions to ˜ Π. find the optimal collineation Col In the case of catadioptric mirrors, if the selected patch is too large, the resulting image may depend on which rays-pixel pairs are selected. In the kitchen scene example (Fig. 5(a)), selecting the rays from the right side of the spherical mirror produces different results than selecting the rays from the middle part, although distortions are reduced in both cases. This is because the rays inside the patch cannot be approximated as a single GLC model.
5
Results
We have experimented our system on various multiperspective images. We modify the PovRay [18] ray tracer to generate both GLC images and reflected images on catadioptric mirrors. Fig. 3 shows an image of a cross-slit camera in
102
Y. Ding and J. Yu
(b)
(c)
(a)
(d)
(e)
Fig. 6. Correcting distortions on a spherical mirror. The user selects separate regions on the sphere (a) to get (b) and (d). (c) and (e) are the resulting images by matching the selected features (blue) and target pixels (red) in (b) and (d) using collineations.
which the two slits form an acute angle. The user then selects feature rays (blue) from the GLC image and positions them at desirable pixels (red). Our system estimates the optimal collineation and re-renders the image under this collineation as shown in Fig. 3(b). The distortions in the resulting image are significantly reduced. Next, we apply our algorithm to correct reflection distortions on a spherical mirror shown in Fig. 6. It has been shown [14] that more severe distortions occur near the boundary of the mirror than at the center. Our algorithm robustly corrects both distortions in the center region and near the boundary. In particular, our method is able to correct the highly curved silhouettes of the refrigerator (Fig. 6(d)). The resulting images are rendered by intersecting the rays inside the patch with the collineation plane, thus, containing holes due to the undersampling of rays. Our algorithm can further correct highly complex distortions on arbitrary mirror surfaces. In Fig. 7, we render a reflective horse model of 48, 000 triangles at two different poses. Our system robustly corrects various distortions such as stretching, shrinking, and duplicated projections of scene points in the reflected image, and the resulting images appear nearly undistorted. We have also experimented our automatic correction algorithm on both the GLC models and catadioptric mirrors. In Fig. 4, the user inputs a target perspective image 4(a) and our system automatically matches the feature points between the GLC and the target image. Even though the ray structures in the GLCs are significantly different from a pinhole camera, the corrected GLC images appear close to perspective. In Fig. 5(f), a perspective image of a kitchen scene is used to automatically correct distortions on a spherical mirror. This
Multiperspective Distortion Correction Using Collineations
(a)
(b)
(c)
(d)
(e)
(f)
103
Fig. 7. Correcting complex distortions on a horse model. We render a reflective horse model under two different poses (a) and (d) and then select regions (b) and (e). (c) and (f) are the resulting images by matching the selected features (blue) and target pixels (red) in (b) and (e) using collineations.
(a)
(b)
(c)
(d)
Fig. 8. Correcting reflection distortions. (a) and (c) are two captured reflected images on a mirror sphere. Our algorithm not only reduces multiperspective distortions but also synthesizes strong perspective effects (b) and (d).
implies that our collineation framework has the potential for benefiting automatic catadioptric calibrations. Finally, we have applied our algorithm on real reflected images of a mirror sphere in a deep scene. We position the viewing camera far away from the sphere so that it can be approximated as an orthographic camera. We then calculate the corresponding reflected ray for each pixel and use our collineation algorithm to correct the distortions. Our system not only reduces multiperspective distortions but also synthesizes strong perspective effects as shown in Fig. 8.
104
6
Y. Ding and J. Yu
Discussions and Conclusion
We have presented a new framework for correcting multiperspective distortions using collineations. We have shown that image distortions in many previous cameras can be effectively reduced via proper collineations. To find the optimal collineation for a specific multiperspective camera, we have developed an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Experiments demonstrate that our system robustly corrects complex distortions without acquiring the scene geometry, and the resulting images appear nearly undistorted.
(a)
(b)
(c)
(d)
Fig. 9. Comparing collineations with the projective transformation. The user selects feature rays (blue) and target pixels (red). (c) is the result using the optimal collineation. (d) is the result using the optimal projective transformation.
It is important to note that a collineation computes the mapping from a ray to a pixel whereas image warping computes the mapping from a pixel to a pixel. One limitation of using collineations is that we cannot compute the inverse mapping from pixels to rays. Therefore, if the rays in the source camera are undersampled, e.g., in the case of a fixed-resolution image of the catadioptric mirrors, the collineation algorithm produces images with holes. As for future work, we plan to explore using image-based rendering algorithms such as the push-pull method [4] to fill in the holes in the ray space. We have also compared our collineation method with the classical projective transformations. In Fig. 9, we select the same set of feature points (rays) from a reflected image on the horse model. Fig. 9(c) computes the optimal projective transformation and Fig. 9(d) computes the optimal collineation, both using the Levenberg-Marquardt method for fitting the feature points. The optimal collineation result is much less distorted and is highly consistent with the pinhole image while the projective transformation result remains distorted. This is because multiperspective collineation describes a much broader class of warping functions than the projective transformation.
Acknowledgement This work has been supported by the National Science Foundation under grant NSF-MSPA-MCS-0625931.
Multiperspective Distortion Correction Using Collineations
105
References 1. Chahl, J., Srinivasan, M.: Reflective surfaces for panoramic imaging. Applied Optics 37(8), 8275–8285 (1997) 2. Chen, S.E.: QuickTime VR – An Image-Based Approach to Virtual Environment Navigation. Computer Graphcs 29, 29–38 (1995) 3. Derrien, S., Konolige, K.: Approximating a single viewpoint in panoramic imaging devices. International Conference on Robotics and Automation, 3932–3939 (2000) 4. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: ’The Lumigraph. SIGGRAPH 1996, 43–54 (1996) 5. Gupta, R., Hartley, R.I.: Linear Pushbroom Cameras. IEEE Trans. Pattern Analysis and Machine Intelligence 19(9), 963–975 (1997) 6. Nayar, S.K.: Catadioptric Omnidirectional Cameras. In: Proc. CVPR, pp. 482–488 (1997) 7. Pajdla, T.: Stereo with Oblique Cameras. Int’l J. Computer Vision 47(1/2/3), 161–170 (2002) 8. Pajdla, T.: Geometry of Two-Slit Camera, Research Report CTU–CMP–2002–02, March (2002) 9. Seitz, S., Kim, J.: The Space of All Stereo Images. In: Proc. ICCV, pp. 26–33 (July 2001) 10. Shum, H., He, L.: Rendering with concentric mosaics. Computer Graphcs 33, 299– 306 (1999) 11. Stein, G.P.: Lens distortion calibration using point correspondences. In: Proc. CVPR, pp. 143–148 ( June 1997) 12. Swaminathan, R., Grossberg, M.D., Nayar, S.K.: Caustics of Catadioptric Cameras. In: Proc. ICCV, pp. 2–9 (2001) 13. Swaminathan, R., Grossberg, M.D., Nayar, S.K.: A Perspective on Distortions. In: Proc. IEEE Computer Vision and Pattern Recognition, Wisconsin (June 2003) 14. Yu, J., McMillan, L.: Multiperspective Projection and Collineation. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 15. Yu, J., McMillan, L.: Modelling Reflections via Multiperspective Imaging. In: Proc. IEEE Computer Vision and Pattern Recognition, San Diego (June 2005) 16. Zomet, A., Feldman, D., Peleg, S., Weinshall, D.: Mosaicing New Views: The Crossed-Slits Projection. IEEE Trans. on PAMI, 741–754 (2003) 17. Zorin, D., Barr, A.H.: Correction of Geometric Perceptual Distortions in Pictures. Computer Graphics 29, 257–264 (1995) 18. POV-Ray: The Persistence of Vision Raytracer, http://www.povray.org/
Camera Calibration from Silhouettes Under Incomplete Circular Motion with a Constant Interval Angle Po-Hao Huang and Shang-Hong Lai Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan {even,lai}@cs.nthu.edu.tw
Abstract. In this paper, we propose an algorithm for camera calibration from silhouettes under circular motion with an unknown constant interval angle. Unlike previous silhouette-based methods based on surface of revolution, the proposed algorithm can be applied to sparse and incomplete image sequences. Under the assumption of circular motion with a constant interval angle, epipoles of successive image pairs remain constant and can be determined from silhouettes. A pair of epipoles formed by a certain interval angle can provide a constraint on the angle and focal length. With more pairs of epipoles recovered, the focal length can be determined from the one that most satisfies the constraints and determine the interval angle concurrently. The rest of camera parameters can be recovered from image invariants. Finally, the estimated parameters are optimized by minimizing the epipolar tangency constraints. Experimental results on both synthetic and real images are shown to demonstrate its performance. Keywords: Circular Motion, Camera Calibration, Shape Reconstruction.
1 Introduction Reconstructing 3D model from image sequences has been studied for decades [1]. In real applications, for instance 3D object digitization in digital museum, modeling from circular motion sequences is a practical and widely used approach in computer vision and computer graphic communities. Numerous methods, which focus on circular motion, have been proposed and they can be classified into two camps; namely, the feature-based [2,3,4,5] and silhouette-based [6,7,8,9] approaches. In the feature-based approaches, Fitzgibbon et al. [2] proposed a method that makes use of the fundamental matrices and trifocal tensors to uniquely determine the rotation angles and determine the reconstruction up to a two-parameter family. Jiang et al. [3,4] further developed a method that avoids the computation of multiview tensors to recover the circular motion geometry by either fitting conics to tracked points in at least five images or computing a plane homography from minimally two points in four images. Cao et al. [5] aimed at the problem of varying focal lengths under circular motion with a constant but unknown rotation angle. However, it is difficult to establish accurate feature correspondences from the image sequences for objects of texture-less, semi-transparent, or reflective materials, such as jade. Instead of feature correspondences, the silhouette-based approach Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 106–115, 2007. © Springer-Verlag Berlin Heidelberg 2007
Camera Calibration from Silhouettes Under Incomplete Circular Motion
107
integrates the object contours to recover the 3D geometry. In [6], Mendonca and Cipolla addressed the problem of estimating the epipolar geometry from apparent contours under circular motion. Under the assumption of constant rotation angle, there are only two common epipoles of successive image pairs needed to be determined. The relation between the epipolar tangencies and the image of rotation axis is used to define a cost function. Nevertheless, the initialization of the epipole positions can influence the final result and make the algorithm converge to a local minimum. In [7], Mendonca et al. exploited the symmetry properties of the surface of revolution (SoR) swept out by the rotating object to obtain an initial guess of image invariants, followed by several one-dimensional searching steps to recover the epipolar geometry. Zhang et al. [8] further extended this method to achieve auto-calibration. The rotation angle is estimated from three views which sometimes results in inaccurate estimation. In [9], they formulated the circular motion as 1D camera geometry to achieve more robust motion estimation. Most of the silhouette-based methods are based on the SoR to obtain an initial guess of image invariants, thus making them infeasible when the image sequence is sparse (interval angle larger than 20 degree [7]) or incomplete. In this paper, we propose an algorithm for camera calibration from silhouettes of an object under circular motion with a sparse and incomplete sequence. In our approach, we first use the same cost function, as proposed in [6], to determine the epipoles of successive image pairs from silhouettes. Thus, constant interval angle is the main assumption of our algorithm. In addition, we propose a method for initializing the positions of epipoles, which is important in practice. A pair of epipoles formed by a certain interval angle can provide a constraint on the angle and focal length. With more pairs of epipoles recovered, the focal length can therefore be determined as the one that best satisfies these constraints and the angle is determined concurrently. After obtaining the camera intrinsic parameters, the rotation matrix about camera center can be recovered from the image invariants up to two sign ambiguities which can be further resolved by making the sign of back-projection epipoles consistent with the camera coordinate. Finally, the epipolar tangency constraints for all pairs of views are minimized to refine the camera parameters by using all determined parameters as an initial guess in the nonlinear optimization process. The remainder of this paper is organized as follows. Section 2 describes the image invariants under circular motion. Section 3 describes the epipolar tangency constraints and explains how to extract epipoles from contours. The estimation of camera parameters is described in section 4. Experimental results on both synthetic and real data are given in section 5. Finally, we conclude this paper in section 6.
2 Image Invariants Under Circular Motion The geometry of circular motion can be illustrated in Fig. 1(a). A camera C rotates about an axis Ls, and its track forms a circle on a plane Πh that is perpendicular to Ls and intersects at the circle center Xs. Without loss of generality, we assume the world coordinate system be centered at Xs with Y-axis along Ls and C be placed on the negative Z-axis. If the camera parameters are kept constant under circular motion, the
108
P.-H. Huang and S.-H. Lai
Ls Πi
ls
Z Πh C
Xs
Πi
X
lh
xs
vx
vy
Y (a)
(b)
Fig. 1. (a) The geometry of circular motion. (b) The image invariants under circular motion.
image Πi of C will contain invariant entities of the geometry as shown in Fig. 1(b). Line lh (ls) is the projection of Πh(Ls). The three points, vx, vy, and xs, are the vanishing points of X-axis, Y-axis and Z-axis, respectively. Similar description of the image invariants can also be found in [2,3,4,8]. Assume the camera intrinsic parameters and the rotation matrix about camera center, which will be referred as camera pose in the rest of this paper, be denoted as K and R, respectively. The camera projection matrix P can be written as:
P = KR [ R y (θ ) | − C ] .
(1)
where R= [r1 r2 r3], Ry(θ) is the rotation matrix about Ls with angle θ, and C=[0 0 –t]T. In mathematical expression, the three points can be written as:
[v x
vy
xs ] ~ KR = K [r1
r2
r3 ] .
(2)
where the symbol “~” denotes the equivalence relation in the homogenous coordinate.
3 Epipolar Geometry from Silhouettes Epipoles can be obtained by computing the null vectors of the fundamental matrix when feature correspondences between two views are available. However, from only silhouettes, it takes more efforts to determine the epipoles. In this section, the relationship between the epipoles and the silhouettes are discussed. 3.1 Constraints on Epipoles and Silhouettes Under Circular Motion In two-view geometry, frontier point is the intersection of contour generators and its projection will be located at the epipolar tangency of the object contour as shown in Fig. 2(a). Hence, the tangent points (lines) induced by epipoles can be regarded as corresponding points (epipolar lines). In addition, as mentioned in [6], under circular motion, the intersection of the corresponding epipolar lines will lie on ls when two views are put in the same image, as shown in Fig. 2(b). This property provides constraints on epipoles and silhouettes. In [6], the cost function is defined as the distance between the intersections of corresponding epipolar
Camera Calibration from Silhouettes Under Incomplete Circular Motion
ls
frontier point contour generator
l1 t1 x2
lh e2
e1 contour C2 C1
e1
e2
contour C1
t2
109
l2 lh
x1 e2
e1
contour C1
contour C2
C2
(a)
(b)
(c)
Fig. 2. (a) Frontier point and the epipolar tangencies. (b) Epipolar tangencies and ls under circular motion (c) Epipolar tangency constraints.
lines and ls. In general, a pair of views has two epipoles with four unknowns but provides only two constraints (intersections), which are not enough to unique determine the answer. Therefore, they assume the interval angle of adjacent views is kept constant, thus reducing the number of epipoles needed to be estimated to only two with four unknowns. Given the epipoles, ls can be determined by line fitting the intersections. In their method, with appropriate initialization of epipoles, the cost function is iteratively minimized to determine the epipoles. (see [6] for details). 3.2 Initialization of Epipoles In [6], they only showed experiments on synthetic data. In practice, it is crucial to obtain good initial positions of epipoles. In our algorithm, the assumption of constant interval angle is also adopted to reduce the unknowns. When taking an image sequence under the turn-table environment, the camera pose is usually close to the form R=Rz(0)Ry(0)Rx(θx)=Rx(θx), therefore the harmonic homography derived in [7] can be reduced to bilateral symmetry as follows: ⎡− 1 0 2u0 ⎤ W = I − 2Kr r K = ⎢⎢ 0 1 0 ⎥⎥ . ⎢⎣ 0 0 1 ⎥⎦ T 11
−1
(3)
where r1=[1 0 0]T and u0 is the x-coordinate of optical center. Assume the optical center coincide with the image center, we can obtain a roughly harmonic homography from (3). Using this harmonic homography, the initial position of epipoles can be obtained from the epipole estimation step as described in [7]. In fact, given contours C1 and C2, and a harmonic homography W, the corresponding epipole e1 (e2) can be directly located from the bi-tangent lines of the contours WC1 and C2 (C1 and WC2) without performing several one-dimensional searching steps as in [7]. Note that here WC means the contour C is transformed by W. 3.3 Epipolar Tangency Constraints In the silhouette-based approach, the most common energy function to measure the model is the epipolar tangency constraints, which can be illustrated in Fig. 2(c). In
110
P.-H. Huang and S.-H. Lai
Fig. 2(c), a pair of contours is put on the same image. The epipole e1(e2) is the projection of C1 (C2) onto the camera C2 (C1), and x1 (x2) is the tangent point induced by epipole e2 (e1) with tangent line t1 (t2). As mentioned in section 3.1, the tangent points x1 and x2 are considered as the correspondence points. The dashed line l1 (l2) is the corresponding epipolar line of x2 (x1). Ideally, l1 (l2) should be the same as t1 (t2). Assume the projection matrix of camera C1 (C2) be denoted as P1 (P2). The error associated with the epipolar tangency constraints can be written as:
err(x1 , x2 , P1 , P2 ) = d (x1 , l1 ) + d ( x2 , l2 ) .
(4)
where the function d(.,.) gives the Euclidean distance from a point to a line, + l1 = P1 P2+ x2 × e2 , l2 = P2 P1+ x1 × e1 , and P is the pseudo-inverse of the projection matrix. Given a set of silhouettes S and its corresponding projection matrices P, the overall cost function can be written as:
Cost (P, S ) =
∑ err (x , x , P , P ) .
∑
(Si ,S j )∈S p ( xa , xb )∈Tpi , j
a
b
i
(5)
j
where the set Sp contains all contour pairs that are used in the cost function, and Tpi,j is the set of tangent points induced by epipoles (of Pi and Pj) with contours (Si and Sj).
4 Camera Calibration In the previous section, the method to extract epipoles from silhouettes under circular motion with a constant interval angle is presented. In this section, we describe how to compute the camera parameters from epipoles. For simplification, the camera is assumed to be zero skew, unit aspect ratio and principle point at the image center, which is a reasonable assumption for current cameras. 4.1 Recovery of Focal Length and Interval Angle The geometry of circular motion under a constant interval angle can be illustrated in Fig. 3(a). In Fig. 3(a), Xs is the circle center, cameras distribute on the circle with a constant interval angle θ. With a certain interval angle, a pair of determined epipoles Xs θ C1
θ θ
θ
optical center
Πh
e2 C5
e1 e5
e4
C4
C2 C3 (a)
Πi
C3 (b)
Fig. 3. (a) Circular motion with a constant interval angle. (b) The image of one camera.
Camera Calibration from Silhouettes Under Incomplete Circular Motion
111
can provide a constraint on the angle and focal length. For instance, for the image of C3 as shown in Fig. 3(b), epipoles formed by the angle θ are e2 and e4. Therefore, we can derive a relationship of the angle and focal length with epipoles as follows:
θ = π − ang(K −1e2 , K −1e4 ) .
(6)
where the function ang(.,.) gives the angle between two vectors. In equation (6), we have one constraint but two unknowns, which are the angle θ and the focal length, the constraint is not enough to determine the unknowns. Recall that, the image sequence is taken under a constant interval angle. Take Fig. 3(a) for example, C1-C2-C3-C4-C5 is an image sequence with a constant interval angle θ. Also, C1-C3-C5 is an image sequence with a constant interval angle 2θ, which can provide another constraint as following:
(
)
2θ = π − ang K −1e1 , K −1e5 .
(7)
Two pairs of epipoles are sufficient to determine the unknowns. With more pairs of epipoles formed by different interval angles to be recovered, more constraints similar to equation (6) and (7) can be applied to precisely determine the focal length. In our implementation, a linear search on the angle θ is performed to find the one that best satisfies these constraints. For instance, given an angle θ, the focal length can be determined by solving a quadratic equation from equation (6). Substituting the estimated focal length into right-hand side of equation (7), we have the difference between left-hand side and right-hand side of equation (7). Therefore, the interval angle and focal length are determined as the one that best satisfies these constraints. 4.2 Recovery of Image Invariants From the extracted epipoles, lh can be computed by line fitting these epipoles, and also ls can be determined concurrently in the epipole extraction stage. Then, xs is the intersection of lh and ls. After camera intrinsic parameter K is obtained, vx can be computed from the pole-polar relationship [1], i.e. vx~KKTls. 4.3 Recovery of Camera Pose From equation (2), with the camera intrinsic parameters and image invariants known, the camera pose R can be computed up to two sign ambiguities as follows:
( ) . = β × norm (K x ) , β = ±1
r1 = α × norm K −1v x , α = ±1 r3
−1
s
(8)
r2 = r3 × r1
where the function norm(.) normalizes a vector to unit norm. Notice that, the sign of rotation axis has no difference for projection to the image coordinates, but back-projection of image points will lead to a sign ambiguity. This ambiguity can be resolved by back-projecting the epipole, which is obtained from image, and checking the sign with the corresponding camera position which is transformed to the camera coordinate system from the world coordinate system by using the determined camera pose R. Because the camera position in the world
112
P.-H. Huang and S.-H. Lai
coordinate system is irrelevant to camera pose R, we still can recover the camera position in the presence of rotation ambiguity. Furthermore, the Gram-Schmidt process is applied to obtain the orthogonal basis. 4.4 Summarization of the Proposed Algorithm INPUT: n object contours, from S1 to Sn, under circular motion with an unknown constant interval angle. OUTPUT: Camera parameters and 3D model. 1. Choose a frame interval Δv, the contours Sv and Sv+Δv are considered as contour pair for determining the two epipoles formed by the interval, where v=1...(n-Δv). 2. Initialize the two common epipoles by the method described in section 3.2. 3. Extract epipoles with a nonlinear minimization step as described in section 3.1. 4. Choose different frame intervals and repeat step 1-3 to extract more epipoles. 5. Use the epipoles extracted from step 1-4 as initial guesses and perform the nonlinear minimization step again to uniquely determine the ls. 6. Recover camera parameters as mentioned in section 4. 7. Set the projection matrices according to equation (1) as initial guesses. 8. Minimize the overall epipolar tangency errors as described in section 3.3. 9. Generate the 3D model using the image-based visual hull technique [10].
5 Experimental Results In this section, we show some experimental results on both synthetic and real data sets of applying the proposed silhouette-based algorithm to reconstruct 3D object models from sparse and incomplete image sequences under circular motion.
(a)
(b)
Fig. 4. (a) Experimental images for reconstructing the bunny model. (b) Intersections of corresponding epipolar lines and the estimated rotation axis ls before and after minimization. Table 1. Accuracy of the recovered camera parameters error avg. std.
Δe (%) 0.565 0.270
Δθx (o) 0.301 0.315
Δθy (o) 0.046 0.038
Δθz (o) 0.080 0.059
Δθi (o) 0.279 0.244
Δf (%) 1.822 1.484
Camera Calibration from Silhouettes Under Incomplete Circular Motion
113
5.1 Synthetic Data In this part, we used the Stanford bunny model to randomly generate 100 synthetic data sets to test the algorithm. Each set contain 12 images of size 800x600 pixels with interval angle, θi = 30o, which means each sequence is sparsely located and the methods based on SoR will fail. Example images of one set are depicted in Fig. 4(a). The generated range of focal length, f, is 1500~5000 pixels, three angles, θx, θy, and θz, of camera pose are within -10o~-50o, -5o~5o, and -5o~5o, respectively. Two different frame intervals, 2 and 3, are chosen to extract the epipoles. The comparison of the recovered parameters and the ground truth are listed in Table 1. In Table 1, the error of angles is in degrees. The error of focal length is expressed as percentage, which is the difference divided by the ground truth. The error of the epipoles is expressed as the difference divided by the length of ground truth to the image center. The experimental results show that the proposed algorithm can provide a good initial guess for the camera parameter optimization. Fig. 4(b) shows an example result of the epipole extraction stage before and after iteratively minimization. The dashed line is the initial ls according to ‘x’ points, which are the intersections induced by the initial epipoles as described in Fig. 2(b). The solid line is the estimated ls after minimization and the intersections (‘o’ points) are close to ls. The obtained ls is very close to the ground truth, expressed by dashed-dot line, as shown in the enlarged figure. 5.2 Real Data In the experiments on real data, two image sequences are used. Example images are shown in Fig. 5. Fig. 5(top) is the Oxford dinosaur sequence, which contains 36 images of size 720x576 pixels. Fig. 5(down) is a sequence of jadeite that contains 36 images of size 2000x1303 pixels and it is very difficult to establish feature correspondences for this kind of material. In both sequences, only the silhouette information is used for reconstruction. Different views of the reconstructed models are shown in Fig. 6 and Fig. 7, respectively. After overall optimization, the RMS errors of the recovered interval angles in two sequences are 0.192o and 0.247o, respectively. In addition, when only the first 18 images of the sequence used, which means the image sequence is incomplete, the estimated results are similar. Due to space limitation, we cannot give the details of the experimental results.
Fig. 5. Example images of (top) the Oxford dinosaur sequence and (down) the jadeite sequence
114
P.-H. Huang and S.-H. Lai
Fig. 6. Different views of the reconstructed Oxford dinosaur model
Fig. 7. Different views of the reconstructed jadeite model
6 Conclusion In this paper, we propose a novel silhouette-based algorithm for camera calibration and 3D reconstruction from sparse and incomplete image sequences of objects under circular motion with an unknown but constant interval angle. Different from previous silhouette-based methods, the proposed algorithm does not require either dense image sequences or the assumption of known camera intrinsic parameters in advance. Under the assumption of constant interval angle, the epipoles of successive images are kept constant and can be determined from silhouettes by a nonlinear optimization process. With more pairs of epipoles recovered from silhouettes, constraints on the interval angle and focal length can be provided to determine the camera parameters. Experimental results on synthetic and real data sets are presented to demonstrate the performance of the proposed algorithm. Acknowledgments. This work was supported by National Science Council, Taiwan, under the grant NSC 95-2221-E-007-224.
References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D Model Construction for TurnTable Sequences. In: Proceedings of SMILE Workshop on 3D Structure from Multiple Images of Large-Scale Environments, pp. 155–170 (1998)
Camera Calibration from Silhouettes Under Incomplete Circular Motion
115
3. Jiang, G., Tsui, H.T., Quan, L., Zisserman, A.: Single Axis Geometry by Fitting Conics. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1343–1348 (2002) 4. Jiang, G., Quan, L., Tsui, H.T.: Circular Motion Geometry Using Minimal Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 721–731 (2004) 5. Cao, X., Xiao, J., Foroosh, H., Shah, M.: Self-calibration from Turn-table Sequences in Presence of Zoom and Focus. Computer Vision and Image Understanding 102, 227–237 (2006) 6. Mendonca, P.R.S., Cipolla, R.: Estimation of Epipolar Geometry from Apparent Contours: Affine and Circular Motion Cases. In: Proceedings of Computer Vision and Pattern Recognition, pp. 9–14 (1999) 7. Mendonca, P.R.S., Wong, K.-Y.K., Cipolla, R.: Epipolar Geometry from Profiles under Circular Motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 604–616 (2001) 8. Zhang, H., Zhang, G., Wong, K.-Y.K.: Auto-Calibration and Motion Recovery from Silhouettes for Turntable Sequences. In: Proceedings of British Machine Vision Conference, pp. 79–88 (2005) 9. Zhang, G., Zhang, H., Wong, K.-Y.K.: 1D Camera Geometry and Its Application to Circular Motion Estimation. In: Proceedings of British Machine Vision Conference, pp. 67–76 (2006) 10. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-Based Visual Hulls. In: Proceedings of SIGGRAPH, pp. 369–374 (2000)
Mirror Localization for Catadioptric Imaging System by Observing Parallel Light Pairs Ryusuke Sagawa, Nobuya Aoki, and Yasushi Yagi Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki-shi, Osaka, 567-0047, Japan {sagawa,aoki,yagi}@am.sanken.osaka-u.ac.jp
Abstract. This paper describes a method of mirror localization to calibrate a catadioptric imaging system. While the calibration of a catadioptric system includes the estimation of various parameters, we focus on the localization of the mirror. The proposed method estimates the position of the mirror by observing pairs of parallel lights, which are projected from various directions. Although some earlier methods for calibrating catadioptric systems assume that the system is single viewpoint, which is a strong restriction on the position and shape of the mirror, our method does not restrict the position and shape of the mirror. Since the constraint used by the proposed method is that the relative angle of two parallel lights is constant with respect to the rigid transformation of the imaging system, we can omit both the translation and rotation between the camera and calibration objects from the parameters to be estimated. Therefore, the estimation of the mirror position by the proposed method is independent of the extrinsic parameters of a camera. We compute the error between the model of the mirror and the measurements, and then estimate the position of the mirror by minimizing this error. We test our method using both simulation and real experiments, and evaluate the accuracy thereof.
1 Introduction For various applications, e.g. robot navigation, surveillance and virtual reality, a special field of view is desirable to accomplish the task. For example, omnidirectional imaging systems [1,2,3] are widely used in various applications. One of the main methods to obtain a special field of view, is to construct a catadioptric imaging system, which observes rays reflected by mirrors. By using various shapes of mirrors, different fields of view are easily obtained. There are two types of catadioptric imaging systems; central and noncentral. The former has a single effective viewpoint, and the latter has multiple ones. Though central catadioptric systems have an advantage in that the image can be transformed to a perspective projection image, they have strong restrictions on the shape and position of the mirror. For example, it is necessary to use a telecentric camera and a parabolic mirror whose axis is parallel to the axis of the camera. Thus, misconfiguration can be the reason that a catadioptric system is not a central one. To obtain more flexible fields of view, several noncentral systems [4,5,6,7,8] have been proposed for various purposes. For geometric analysis with catadioptric systems, it is necessary to calibrate both camera and mirror parameters. Several methods of calibration have been proposed for Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 116–126, 2007. c Springer-Verlag Berlin Heidelberg 2007
Mirror Localization for Catadioptric Imaging System
117
central catadioptric systems. Geyer and Daniilidis [9] have used three lines to estimate the focal length, mirror center, etc. Ying and Hu [10] have used lines and spheres to calibrate the parameters. Mei and Rives [11] have used a planar marker to calibrate the parameters, which is based on the calibration of a perspective camera [12]. However, since these methods assume that the system has a single viewpoint, they cannot be applied to noncentral systems. On the other hand, several methods have also been proposed to calibrate noncentral imaging systems. Aliaga [13] has estimated the parameters of a catadioptric system with a perspective camera and a parabolic mirror using known 3D points. Strelow et al. [14] have estimated the position of a misaligned mirror using known 3D points. Micus´ık and Pajdla [15] have fitted an ellipse to the contour of the mirror and calibrated a noncentral camera by approximating it to a central camera. Mashita et al. [16] have used the boundary of a hyperboloidal mirror to estimate the position of a misaligned mirror. However, all of these methods are restricted to omnidirectional catadioptric systems. There are also some approaches for calibrating more general imaging systems. Swaminathan et al. [17] computed the parameters of noncentral catadioptric systems by estimating a caustic surface from known camera motion and the point correspondences of unknown scene points. Grossberg and Nayar [18] proposed a general imaging model and computed the ray direction for each pixel using two planes. Sturm and Ramalingam [19] calibrated the camera of a general imaging model by using unknown camera motion and a known object. Since these methods estimate both the internal and external parameters of the system, the error of measurement affects the estimated result of all of the parameters. In this paper, we focus on the localization of the mirror in the calibration of catadioptric systems. Assumptions of the other parameters are as follows: – The intrinsic parameters, such as the focal length and principal point of a camera, are known. – The shape of the mirror is known. The only remaining parameters to be estimated are the translation and rotation of the mirror with respect to the camera. If we calibrate the parameters of an imaging system by observing some markers, it is necessary to estimate the extrinsic parameters, such as rotation and translation, with respect to the marker. If we include these parameters as parameters to be estimated, the calibration results are affected by them. We proposed a method to localize a mirror by observing a parallel light [20] that estimates the mirror parameters independently of the extrinsic parameters. Since the translation between a marker and a camera is omitted from the estimation, this method can reduce the number of parameters. The method however, needs a rotation table to observe a parallel light from various directions. Instead of using a rotation table, the method proposed in this paper observes pairs of parallel lights as calibration markers. We can therefore, omit both rotation and translation from the estimation and reduce the number of parameters that are affected by the measurement error in the calibration. We describe the geometry of projection of two parallel lights in Section 2. Next, we propose an algorithm for mirror localization using pairs of parallel lights in Section 3. We test our method in Section 4 and finally summarize this paper in Section 5.
118
R. Sagawa, N. Aoki, and Y. Yagi Mirror surface x’
Parallel light
x
Mirror surface
x1
v
O
x2’
v1
v2’
v1’
NR,t (x) Image plane
x1’
x2
Image plane
m Camera
m1
m2 O
Fig. 1. Projecting a parallel light onto a catadioptric imaging system
m’1
v2 m’2
O’
Fig. 2. Projecting a pair of parallel lights with two different camera positions and orientations
2 Projecting a Pair of Parallel Lights onto a Catadioptric Imaging System In this section, we first explain the projection of a parallel light, which depends only on the rotation of a camera. Next, we describe the projection of a pair of parallel lights and the constraint on the relative angle between them. 2.1 Projecting a Parallel Light First, we explain the projection of a parallel light. Figure 1 shows the projection of a parallel light onto a catadioptric system. Since a parallel light is not a single ray, but a bunch of parallel rays, such as sunlight, it illuminates the whole catadioptric system. v is the vector of the incident parallel light. m is the vector at the point onto which the light is projected. m is computed as follows: ˆ, m = K −1 p
(1)
ˆ = (px , py , 1) is the point onto which the light is projected in the homogeneous where p image coordinate system. K is a 3×3 matrix that represents the intrinsic parameters of the camera. Although the incident light is reflected at every point on the mirror surface where the mirror is illuminated, the reflected light must go through the origin of the camera to be observed. Since the angle of the incident light is the same as that of the reflected light, the camera only observes the ray reflected at a point x. Therefore, the equation of projection becomes −v =
m m + 2(NR,t (x) · )N (x), m m R,t
(2)
where NR,t (x) is the normal vector of the mirror surface at the point x. R and t are the rotation and translation, respectively, of the mirror relative to the camera. 2.2 Projecting a Pair of Parallel Lights Since the direction of the incident parallel light is invariant even if it is observed from different camera positions, the direction of the light relative to the camera depends only
Mirror Localization for Catadioptric Imaging System
119
on the orientation of the camera. Now, if we observe two parallel lights simultaneously, the relative angle between these parallel lights does not change irrespective of the camera orientation. Figure 2 shows a situation, in which a pair of parallel lights is projected onto a catadioptric system, and which has two different camera positions and orientations. The relative position of the mirror is fixed to the camera. The two parallel lights are reflected at the points x1 , x2 , x2 and x1 , respectively. The reflected rays are projected onto the points m1 , m2 , m2 and m1 in the image plane, respectively. Since the relative angle between the pair of parallel lights is invariant, we obtain the following constraint: (3) v 1 · v 2 = v 1 · v 2 , where v 1 and v 2 are represented in a different camera coordinate system from v 1 and v 2 , which are computed by (2).
3 Mirror Localization Using Pairs of Parallel Lights This section describes an algorithm to estimate mirror position by observing pairs of parallel lights. 3.1 Estimating Mirror Position by Minimizing Relative Angle Error By using the constraint (3), we estimate the mirror position by minimizing the following cost function: v i1 · v i2 − cos αi 2 , (4) E1 = i
where i is the number of the pair and αi is the angle of the i-th pair. If we do not know the angle between the parallel lights, we can use E2 =
v i1 · v i2 − v j1 · v j1 2 .
(5)
i=j
The parameters of these cost functions are R and t, which are the rotation and translation, respectively, of the mirror relative to the camera. Since minimizing (4) or (5) is a nonlinear minimization problem, we estimate R, t and RC by a nonlinear minimization method, such as the Levenberg-Marquardt algorithm. Our algorithm can then be described as follows: 1. 2. 3. 4. 5. 6. 7.
Set initial parameters of R and t. Compute the intersecting point x for each image point m. Compute the normal vector NR,t (x) for each intersecting point x. Compute the incident vector v for each intersecting point x. Compute the cost function (4) or (5). Update R and t by a nonlinear minimization method. Repeat steps 2-6 until convergence.
120
R. Sagawa, N. Aoki, and Y. Yagi
Concave parabolic mirror Collimator 1
Collimator 2
Light source
Pinhole
Fig. 3. Two collimators generate a pair of parallel lights. Each collimator consists of a light source, a pinhole and a concave parabolic mirror.
In the current implementation, the initial parameters are given by user. We set them so that the every image point m has the intersecting point x. As described in Section 3.2, computing the intersecting points is high cost if a mirror surface is represented by a mesh model. Therefore, we describe a GPU-based method for steps 2-4 to directly compute the incident vectors to reduce the computational time. For updating the parameters, we numerically compute the derivatives required in the Levenberg-Marquardt algorithm. To keep so that every image point has the intersecting point, if an image point has no intersecting point, we penalize it with a large value instead of computing (4) or (5). 3.2 Computing the Incident Vector The important step in this algorithm is the computation of the incident vector v, for which there are two methods. The first of these computes x by solving a system of equations. If the mirror surface is represented as a parametric surface, x is obtained by simultaneously solving the equations of the viewing ray and the mirror surface, because the intersecting point x is on both the viewing ray and the mirror surface. Once x is computed, the normal vector NR,t (x) is obtained by the cross product of two tangential vectors of the mirror surface at x, and then the incident vector v is computed by (2). However, it is high cost to solve the simultaneous equations if the mirror surface is an intricate shape or non-parametric surface. If a mirror surface is represented as a mesh model, it is necessary to search the intersecting point for each image point by solving the equations for each facet of the model. To accommodate any mirror shape, the second method computes x by projecting the mirror shape onto the image plane of the camera with R, t and the intrinsic parameter K. Since this operation is equivalent to rendering the mirror shape onto the image plane, it can be executed easily using computer graphics techniques if the mirror shape is approximated by a mesh model. Furthermore, if we use recent graphics techniques, the incident vector v is computed directly by the rendering process. The source code to compute v for every pixel is shown in Appendix A. 3.3 Generating a Pair of Parallel Lights Our proposed method requires observation of parallel lights. A parallel light can be viewed by adopting one of the following two approaches: – Use a feature point of a distant marker. – Generate a collimated light.
Mirror Localization for Catadioptric Imaging System
121
In the former approach, a small translation of camera motion can be ignored because it is much smaller than the distance to the marker. Thus, the ray vector from the feature point is invariant even if the camera moves. The issue of this approach is a lens focus problem. When the focus setting of the camera is not at infinite focus, the image is obtained with a minimum aperture and long shutter time to avoid a blurred image. Instead of using distant points to obtain two parallel lights, vanishing points can be used. Some methods [21,22,23] was proposed for the calibration of a perspective camera. In the latter approach, a parallel light is generated by a collimator. A simple method is to use a concave parabolic mirror and a point-light source. Figure 3 shows an example of such a system. By placing pinholes in front of the light sources, they become point-light sources. Since pinholes are placed at the focus of the parabolic mirrors, the reflected rays are parallel. The illuminated area is indicated in yellow in the figure. The advantage of this approach is that a small and precise system can be constructed although optical apparatus is required.
4 Experiments 4.1 Estimating Accuracy by Simulation We first evaluate the accuracy of our method by simulation. In this simulation, we estimate the position of a parabolic mirror relative to a perspective camera. The intrinsic parameter K of the perspective camera is represented as ⎞ ⎛ f 0 cx (6) K = ⎝ 0 f cy ⎠ . 00 1 1 The shape of the mirror is represented as z = 2h (x2 + y 2 ), where h is the radius of a paraboloid. In this experiment, the image size is 512×512 pixels and f = 900, cx = cy = 255 and h = 9.0. The ground truths of the rotation and translation of the mirror are R = I and t = (0, 0, 50), respectively. We tested two relative angles between two incident parallel lights, namely 30 and 90 degrees. 24 pairs of the incident lights are used by rotating the camera and mirror around the y- and z-axes. We estimate R and t by adding noise to the position of the input points. The added Gaussian noise has standard deviations of 0, 0.1, 0.5, and 1.0 pixels. As for E1 , since the relative angle α between the two points has to be given, we add noise to α, which has standard deviations of 0, 0.1, and 0.5 degrees. To evaluate the accuracy of the estimated parameters, we compute the root-mean-square (RMS) errors between the input points and the reprojection of the incident lights. Figure 4 shows the RMS errors of E1 and E2 . It is clear that the results obtained with the relative angle equal to 90 degrees are better than those for 30 degrees. A reason for this may be that the constraint is weaker when the relative angle is smaller and the projected points are close to each other. The error depends mainly on the noise of the input points, as the effect of the noise of the relative angle is small. Since the accuracy of E2 is similar to that of E1 , we can apply our method even if we do not know the relative angle. Next, we evaluate the error if the intrinsic parameter K is different from the ground truth. Figure 5 shows the RMS errors of E1 with varying values of f and cx . The
122
R. Sagawa, N. Aoki, and Y. Yagi 4
3.5 °
f σ =0.0
E 30 σ =0.0
α
α
1
3.5
°
E1 30 σα=0.1
x
E 90 σ =0.0 α
1
2.5
α
°
E1 90 σα=0.5
RMS Error (pixels)
RMS Error (pixels)
1
°
E2 30 2
α
cx σα=0.1
°
E 90° σ =0.1 2.5
α
c σ =0.0
α
1
3
f σ =0.1
3
E 30° σ =0.5
°
E2 90
1.5
2
1.5
1 1
0.5
0.5
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 −50
−40
−30
Standard deviation of noise (pixels)
−20 −10 0 10 20 Difference from Ground Truth (pixels)
30
40
50
Fig. 4. The RMS errors with respect to the Fig. 5. The RMS errors with respect to the ernoise of image points ror of the intrinsic parameters
Side2
Side3
Mirrors
Camera
Side4
Side1
Center
Fig. 6. Compound parabolic mirrors attached Fig. 7. An example image from compound parabolic mirrors to a camera
other parameters are fixed to the ground truth. The horizontal axis means the difference between the ground truth and f or cx . The results show that the error from reprojecting the incident lights is significantly affected by cx , while the effect of f is small. This shows that the principal point (cx , cy ) must be computed accurately before minimizing E1 and that the error of f is more acceptable than that of the principal point. 4.2 Localizing Mirrors from Real Images In the next experiment, we compute the mirror positions of a catadioptric system with compound parabolic mirrors [24,25] as shown in Figure 6. Figure 7 shows an example of an image obtained from such a system. Our system has 7 parabolic mirrors and a perspective camera, PointGrey Scorpion, which has 1600 × 1200 pixels and about 22.6◦ field of view. The distortion of the lens is calibrated by the method described in [26], and the intrinsic parameters of the camera are already calibrated. With this setup, the catadioptric system is not single viewpoint. The radii h of a center mirror and the side mirrors are 9.0mm and 4.5mm, respectively. The diameter and height of the center mirror are 25.76mm and 9.0mm, respectively, and the diameter and height of the side mirrors are 13.0mm and 4.5mm, respectively. The diameters of the center and side mirrors projected onto the image are 840 and 450 pixels, respectively.
Mirror Localization for Catadioptric Imaging System Side2
Side1
123
Side3
Side4
Center
Fig. 8. A distant point used as a parallel light Fig. 9. The mirror positions estimated by the source proposed method
Table 1. The RMS errors of (7) are computed using the estimated mirror positions Mirror Number of Pairs RMS Error (pixels) Center 78 0.84 Side1 21 0.87 Side2 45 1.05 Side3 45 1.16 Side4 21 0.59
To localize the mirrors from real images, we experimented with different ways of acquiring parallel lights, namely distant markers and collimated lights. In the first case, we chose points on a distant object in the image. Figure 8 shows the chosen point, which is a point on a building that is about 260 meters away from the camera. We rotated the catadioptric system and obtained 78 pairs of parallel lights. The relative angles of the pairs of parallel lights vary between 15 degrees and 170 degrees. We estimated the positions of the center and four side mirrors independently. Figure 9 shows the estimated mirror positions by rendering the mirror shapes from the viewpoint of the camera. Since we do not know the ground truth of the mirror position and the incident light vectors, we estimate the accuracy of the estimated parameters by the following criterion. If the observed points of a pair of parallel lights are p1 and p2 , and the corresponding incident vectors, as computed by (2), are v 1 and v 2 , respectively, (7) min p2 − q 2 subject to v q · v 1 = cos α, q where v q is the incident vector corresponding to an image point q. This criterion computes the errors in pixels. Table 1 shows the estimated results. Since some of the lights are occluded by the other mirrors, the number of lights used for calibration varies for each mirror. The error is computed by the RMS of (7). Since the position of a feature point is considered to have 0.5 pixel error, the error computed by using the estimated position of the mirrors is appropriate. Next, we tested our method by observing collimated lights generated by the system shown in Figure 3. The relative angle of the two collimated lights is 87.97 degrees.
124
R. Sagawa, N. Aoki, and Y. Yagi struct VS_OUT { float4 Pos : POSITION; float3 Tex : TEXCOORD0; }; VS_OUT VS(float4 Pos : POSITION, float4 Nor : NORMAL) { VS_OUT Out = (VS_OUT)0; float3 tmpPos, tmpNor, v; float a; tmpPos = normalize(mul(Pos, T)); tmpNor = mul(Nor, R); a = dot(-tmpPos, tmpNor); v = tmpPos + 2 * a * tmpNor; Out.Pos = mul(Pos, KT); Out.Tex = normalize(v); return Out; } float4 PS(VS_OUT In) : COLOR { float4 Col = 0; Col.rgb = In.Tex.xyz; return Col; }
Fig. 10. Top: an example of the acquired Fig. 11. The source code for computing the incident image. Bottom: the image of two collimated vector in HLSL lights after turning off the room light.
We acquired 60 pairs of parallel lights. Figure 10 shows an example of an image, onto which two collimated lights are projected. In this experiment, we estimated the position of the center mirror. The RMS error of (7) is 0.35 pixels, which is smaller than that obtained using distant markers. This shows that the accuracy of the estimated results is improved by using the collimated lights.
5 Conclusion This paper describes a method of mirror localization to calibrate a catadioptric imaging system. In it, we focused on the localization of the mirror. By observing pairs of parallel lights, our method utilizes the constraint that the relative angle of two parallel lights is invariant with respect to the translation and rotation of the imaging system. Since the translation and rotation between a camera and the calibration objects are omitted from the parameters, the only parameter to be estimated is the rigid transformation of the mirror. Our method estimates the rigid transformation by minimizing the error between the model of the mirror and the measurements. Since our method makes no assumptions about the mirror shape or its position, the proposed method can be applied to noncentral systems. If we compute the incident light vector by projecting the mirror shape onto an image, our method is able to accommodate any mirror shape. Finally, to validate the accuracy of our method, we tested our method in a simulation and in real experiments. For future work, we plan to apply the proposed method to various shapes of mirrors using the collimated lights and analyzing the best settings for the parallel lights.
Mirror Localization for Catadioptric Imaging System
125
References 1. Ishiguro, H., Yamamoto, M., Tsuji, S.: Omni-directional stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 257–262 (1992) 2. Yamazawa, K., Yagi, Y., Yachida, M.: Obstacle detection with omnidirectional image sensor hyperomni vision. In: IEEE The International Conference on Robotics and Automation, Nagoya, pp. 1062–1067. IEEE Computer Society Press, Los Alamitos (1995) 3. Nayar, S.: Catadioptric omnidirectional camera. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 482–488. IEEE Computer Society Press, Los Alamitos (1997) 4. Gaspar, J., Decco, C., Okamoto Jr., J., Santos-Victor, J.: Constant resolution omnidirectional cameras. In: Proc. The Third Workshop on Omnidirectional Vision, pp. 27–34 (2002) 5. Hicks, R., Perline, R.: Equi-areal catadioptric sensors. In: Proc. The Third Workshop on Omnidirectional Vision, pp. 13–18 (2002) 6. Swaminathan, R., Nayar, S., Grossberg, M.: Designing Mirrors for Catadioptric Systems that Minimize Image Errors. In: Fifth Workshop on Omnidirectional Vision (2004) 7. Kondo, K., Yagi, Y., Yachida, M.: Non-isotropic omnidirectional imaging system for an autonomous mobile robot. In: Proc. 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, IEEE Computer Society Press, Los Alamitos (2005) 8. Kojima, Y., Sagawa, R., Echigo, T., Yagi, Y.: Calibration and performance evaluation of omnidirectional sensor with compound spherical mirrors. In: Proc. The 6th Workshop on Omnidirectional Vision, Camera Networks and Non-classical cameras (2005) 9. Geyer, C., Daniilidis, K.: Paracatadioptric camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 687–695 (2002) 10. Ying, X., Hu, Z.: Catadioptric camera calibration using geometric invariants. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10), 1260–1271 (2004) 11. Mei, C., Rives, P.: Single view point omnidirectional camera calibration from planar grids. In: Proc. 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, pp. 3945–3950. IEEE Computer Society Press, Los Alamitos (2007) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 13. Aliaga, D.: Accurate catadioptric calibration for realtime pose estimation of room-size environments. In: Proc. IEEE International Conference on Computer Vision, vol. 1, pp. 127–134. IEEE Computer Society Press, Los Alamitos (2001) 14. Strelow, D., Mishler, J., Koes, D., Singh, S.: Precise omnidirectional camera calibration. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 689–694. IEEE Computer Society Press, Los Alamitos (2001) 15. Micus´ık, B., Pajdla, T.: Autocalibration and 3d reconstruction with non-central catadioptric cameras. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington US, vol. 1, pp. 58–65. IEEE Computer Society Press, Los Alamitos (2004) 16. Mashita, T., Iwai, Y., Yachida, M.: Calibration method for misaligned catadioptric camera. In: Proc. The Sixth Workshop on Omnidirectional Vision (2005) 17. Swaminathan, R., Grossberg, M., Nayar, S.: Caustics of catadioptric camera. In: Proc. IEEE International Conference on Computer Vision, vol. 2, pp. 2–9. IEEE Computer Society Press, Los Alamitos (2001) 18. Grossberg, M., Nayar, S.: The raxel imaging model and ray-based calibration. International Journal on Computer Vision 61(2), 119–137 (2005) 19. Sturm, P., Ramalingam, S.: A generic camera calibration concept. In: Proc. European Conference on Computer Vision, Prague, Czech, vol. 2, pp. 1–13 (2004)
126
R. Sagawa, N. Aoki, and Y. Yagi
20. Sagawa, R., Aoki, N., Mukaigawa, Y., Echigo, T., Yagi, Y.: Mirror localization for a catadioptric imaging system by projecting parallel lights. In: Proc. IEEE International Conference on Robotics and Automation, Rome, Italy, pp. 3957–3962. IEEE Computer Society Press, Los Alamitos (2007) 21. Caprile, B., Torre, V.: Using vanishing points for camera calibration. International Journal of Computer Vision 4(2), 127–140 (1990) 22. Daniilidis, K., Ernst, J.: Active intrinsic calibration using vanishing points. Pattern Recognition Letters 17(11), 1179–1189 (1996) 23. Guillemaut, J., Aguado, A., Illingworth, J.: Using points at infinity for parameter decoupling in camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(2), 265–270 (2005) 24. Mouaddib, E., Sagawa, R., Echigo, T., Yagi, Y.: Two or more mirrors for the omnidirectional stereovision? In: Proc. of The second IEEE-EURASIP International Symposium on Control, Communications, and Signal Processing, Marrakech, Morocco, IEEE Computer Society Press, Los Alamitos (2006) 25. Sagawa, R., Kurita, N., Echigo, T., Yagi, Y.: Compound catadioptric stereo sensor for omnidirectional object detection. In: Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, vol. 2, pp. 2612–2617 (2004) 26. Sagawa, R., Takatsuji, M., Echigo, T., Yagi, Y.: Calibration of lens distortion by structuredlight scanning. In: Proc. 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, Canada, pp. 1349–1354 (2005)
A Source Code for Rendering Incident Vectors The reflected vector for each pixel is computed using the source code in Figure 11. It is written in High-Level Shader Language (HLSL) and executed by graphics hardware. The shape of a mirror is represented by a mesh model that consists of vertices and triangles. The inputs of the vertex shader (VS) are the positions of vertices of the mirror (Pos) and the normal vectors of the vertices (Nor). R, T and KT are constant matrices given by a main program. R is the rotation matrix of the mirror, and T = [R|t], where t is the translation vector of the mirror. KT is the projection matrix computed as KT = K[R|t], where K is the intrinsic matrix of the camera. The reflected vector v is computed for each vertex. Since it is interpolated by the rasterizer of the graphics hardware, the pixel shader (PS) outputs the reflected vector for each pixel.
Calibrating Pan-Tilt Cameras with Telephoto Lenses Xinyu Huang, Jizhou Gao, and Ruigang Yang Graphics and Vision Technology Lab (GRAVITY) Center for Visualization and Virtual Environments University of Kentucky, USA {xhuan4,jgao5,ryang}@cs.uky.edu http://www.vis.uky.edu/∼gravity
Abstract. Pan-tilt cameras are widely used in surveillance networks. These cameras are often equipped with telephoto lenses to capture objects at a distance. Such a camera makes full-metric calibration more difficult since the projection with a telephoto lens is close to orthographic. This paper discusses the problems caused by pan-tilt cameras with long focal length and presents a method to improve the calibration accuracy. Experiments show that our method reduces the re-projection errors by an order of magnitude compared to popular homographybased approaches.
1 Introduction A surveillance system usually consists of several inexpensive wide fields of view (WFOV) fixed cameras and pan-tilt-zoom (PTZ) cameras. The WFOV cameras are often used to provide an overall view of the scene while a few zoom cameras are controlled by pan-tilt unit (PTU) to capture close-up views of the subject of interest. The control of PTZ camera is typically done manually using a joystick. However, in order to automate this process, calibration of the entire camera network is necessary. One of our driving applications is to capture and identify subjects using biometric features such as iris and face over a long range. A high-resolution camera with a narrow field of view (NFOV) and a telephoto lens is used to capture the rich details of biometric patterns. For example, a typical iris image should have 100 to 140 pixels in iris radius to obtain a good iris recognition performance [1]. That means, in order to capture the iris image over three meters using a video camera (640×480), we have to use a 450mm lens assuming sensor size is 4.8 × 3.6 mm. If we want to capture both eyes (e.g., the entire face) at once, then the face image resolution could be as high as 5413×4060 pixels–well beyond the typical resolution of a video camera. In order to provide adequate coverage over a practical working volume, PTZ cameras have to be used. The simplest way to localize the region of interest (ROI) is to pan and tilt the PZT camera iteratively until the region is approximately in the center of field of view [2]. This is time-consuming and only suitable for still objects. However, if the PTZ cameras are fully calibrated, including the axes of rotation, the ROI can be localized rapidly with a single pan and tilt operation. In this paper we discuss the degeneracy caused by cameras with telephoto lenses and develop a method to calibrate such a system with significantly improved accuracy. The remaining of this paper is organized as the following. We first briefly overview the related work in section 2. In section 3, we describe our system and a calibration Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 127–137, 2007. c Springer-Verlag Berlin Heidelberg 2007
128
X. Huang, J. Gao, and R. Yang
method for long focal length cameras. Section 4 contains experimental results. Finally, a summary is given in section 5. We also present in the appendix a simple method to calculate the pan and tilt angle when the camera coordinate is not aligned with the pan-tilt coordinate.
2 Related Work It is generally considered that camera calibration reached its maturity in the late 90’s. A lot of works have been done in this area. In the photogrammetry community, a calibration object with known and accurate geometry is required. With markers of known 3D positions, camera calibration can be done efficiently and accurately (e.g., [3], [4]). In computer vision, a planar pattern such as a checkerboard pattern is often used to avoid the requirement of a 3D calibration object with a good precision (e.g., [5], [6]). These methods estimate intrinsic and extrinsic parameters including radial distortions from homographies between the planar pattern at different positions and the image plane. Self-calibration estimates fixed or varying intrinsic parameters without the knowledge of special calibration objects and with unknown camera motions (e.g., [7], [8]). Furthermore, self-calibration can compute a metric reconstruction from an image sequence. Besides the projective camera model, the affine camera model in which camera center lies on the plane at infinity is proposed in ([9], [10]). Quan presents a self-calibration method for an affine camera in [11]. However, the affine camera model should not be used when many feature points at different depths [12]. For the calibrations of PTZ cameras, Hartley proposed a self-calibration method for stationary cameras with purely rotations in [13]. Agapito extended this method in [14] to deal with varying intrinsic parameters of a camera. Sinha and Pollefeys proposed a method for calibrating pan-tilt-zoom cameras in outdoor environments in [15]. Their method determines intrinsic parameters over the full range of zoom settings. These methods above approximate PTZ cameras as rotating cameras without translations since the translations are very small compared to the distance of scene points. Furthermore, These methods are based on computing absolute conic from a set of inter-image homographies. In [16], Wang and Kang present an error analysis of intrinsic parameters caused by translation. They suggest self-calibrate using distant scenes, larger rotation angles, and more different homographies in order to reduce effects from camera translation. The work most similar to ours is proposed in [17,18]. In their papers, Davis proposed a general pan-tilt model in which the pan and tilt axes are arbitrary axes in 3D space. They used a bright LED to create a virtual 3D calibration object and a Kalman filter tracking system to solve the synchronization between different cameras. However, they did not discuss the calibration problems caused by telephoto lens. Furthermore, their method cannot be easily applied to digital still cameras with which it will be tedious to capture hundreds or even thousands of frames.
3 Method In this section, we first describe the purpose of our system briefly. Then, we discuss the calibration for long focal length camera in details.
Calibrating Pan-Tilt Cameras with Telephoto Lenses
129
3.1 System Description The goal of our system is to capture face or iris images over a long range with a resolution high enough for biometric recognitions. As shown in Fig. 1, a prototype of our system consists of two stereo cameras and a NFOV high resolution (6M pixels) still camera. The typical focal length for the pan-tilt camera is 300mm, while previous papers dealing with pan-tilt cameras have reported the use of lenses between 1mm to 70mm. When a person walks into the working area of the stereo cameras, facial features are detected in each frame and their 3D positions can be easily determined by triangulation. The pan-tilt camera is steered so that the face is in the center of the observed image. A high-resolution image with enough biometric details then can be captured. Since the field of view of pan-tilt camera is only about 8.2 degrees, the ROI (e.g., the eye or entire face) is likely to be out of the field of view if the calibration is not accurate enough.
1) Pan-tilt Camera (Nikon 300mm)
6
3
2) Laser Pointer 3) Stereo Camera (4mm)
2 5 1
4) Pan Axis 5) Tilt Axis 6) Flash
3 4
Fig. 1. System setup with two WFOV cameras and a pan-tilt camera
3.2 Calibration In [5], given one homography H = [h1 , h2 , h3 ] between a planar pattern at one position and the image plane, two constraints on the absolute conic ω can be formulated as in Eq.(1). hT1 ωh2 = 0 hT1 ωh1 = hT2 ωh2
(1)
By imaging the planar pattern n times at different orientations, a linear system Ac = 0 is formed, where A is a 2n × 6 matrix from the observed homographies and c represents ω as a 6 × 1 vector. Once c is solved, intrinsic matrix K can be solved by Cholesky factorization since ω = (KK T )−1 . Equivalently, one could rotate the camera instead of moving a planar pattern. This is the key idea in self-calibration of pan-tilt cameras ( [15], [13]). First, inter-image homographies are computed robustly. Second, the absolute conic ω is estimated by a
130
X. Huang, J. Gao, and R. Yang
linear system ω = (H i )−T ω(H i )−1 , where H i is the homography between each view i and a reference view. Then, Cholesky decomposition of ω is applied to compute the intrinsic matrix K. Furthermore, a Maximum Likelihood Estimation (MLE) refinement could be applied using the above close-form solution as the initial guesses. However, the difference is small between the close-form solution and that from MLE refinement [12]. As mentioned in [5], the second homography will not provide any new constraints if it is parallel to the first homography. In order to avoid this degeneracy and generate a over-determined system, the planar pattern has to be imaged many times with different orientations. This is also true for the self-calibration of rotating cameras. If conditions are near singular, the matrix A formed from the observed homographies will be illconditioned, making the solution inaccurate. Generally, the degeneracy is easy to avoid when the focal length is short. For example, we only need to change the orientation for each position of planar pattern. However, this is not true for long focal length cameras. When the focal length increases and the filed-of-view decreases, the camera’s projection becomes less projective and more orthographic. The observed homographies contain a lot of depth ambiguities that make the matrix A ill-conditioned and solution is very sensitive to a small perturbation. If the projection is purely orthographic, then observed homographies can not provide any depth information no matter where we put the planar pattern or how we rotate the camera. In summary, traditional calibration methods based on observed homographies are in theory not accurate for long focal length camera. We will also demonstrate this point with real data in the experiments section. X
x1 O
x
Stereo Cameras
( px , p y )
World
x2
R, T PTU
R*
Camera (pan=0, tilt=0)
Fig. 2. Pan-tilt camera model
The best way to calibrate a long focal length camera is to create 2D-3D correspondences directly. One could use a 3D calibration object but this approach is not only costly, but also un-practical given the large working volume we would like to cover. In our system 3D feature points are triangulated by stereo cameras, therefore it will not induce any ambiguities caused by the methods based on observed homographies. With a set of known 2D and 3D features, we can estimate intrinsic parameters and the relative
Calibrating Pan-Tilt Cameras with Telephoto Lenses
131
transformation between the camera and the pan-tilt unit. The pan-tilt model is shown in Fig. 2 and is written as x = KR∗−1 Rtilt Rpan R∗ [R|t]X
(2)
where K is intrinsic parameters, R and t are extrinsic parameters of pan-tilt camera at reference view that is pan = 0 and tilt = 0 in our setting. X and x are 3D and 2D feature points. Rpan and Rtilt are rotation matrices around pan and tilt axes. R∗ is the rotation matrix between coordinates of the camera and the pan-tilt unit. We did not consider the pan and tilt axes as two arbitrary axes in 3D space as in [17] since translation between the two coordinates are very small (usually only a few millimeters in our setting) and a full-scale simulation shows that adding the translational offset yield little accuracy improvement. Based on the pan-tilt model in Eq.(2), we could estimate the complete set of parameters using MLE to minimize the re-projected geometric distances. This is given by the following functional: argminR∗ ,K,R,t
n m
xij − x ˆij (K, R∗ , Rpan , Rtilt , R, t, Xij )2
(3)
i=1 j=1
The method of acquiring of calibration data in [17] is not applicable in our system because that our pan-tilt camera is not a video camera that could capture a video sequence of LED points. Typically a commodity video camera does not support both long focal length and high-resolution image. Here we propose another practical method to acquire calibration data from a still camera. We attach a laser pointer close enough to the pan-tilt camera as shown in Fig.1. The laser’s reflection on scene surfaces generates a 3D point that can be easily tracked. The laser pointer rotates with the pan-tilt camera simultaneously so that its laser dot can be observed by the pan-tilt camera at most of pan and tilt settings. In our set-up, we mount the laser pointer on the tilt axis. A white board is placed at several positions between the near plane and the far plane within the working area of two wide-view fixed cameras. For each pan and tilt step, three images are captured by the pan-tilt camera and two fixed cameras respectively. A 3D point is created by triangulation from the two fixed cameras. The white board does not need to be very large since we can always move it around during the data acquisition process so that 3D points cover the entire working area. The calibration method is summarized as Algorithm 1. In order to compare our method with methods based on observed homographies, we formulate a calibration framework similar to the methods in [15] and [12]. The algorithm is summarized in Algorithm 2. An alternative of step 4 in Algorithm 2 is to build a linear system ω = (H i )−T ω(H i )−1 and solve ω. Intrinsic matrix K is solved by Cholesky decomposition ω = KK T . However, this closed-form solution often fails since infinite homography is hard to estimate with narrow fields of view. After calibration step, R∗ , intrinsic and extrinsic parameters of three cameras are known. Hence, We can solve the pan and tilt angles easily (see Appendix A for the details) for almost arbitrary 3D points triangulated by stereo cameras.
132
X. Huang, J. Gao, and R. Yang
Algorithm 1. Our calibration method for a pan-tilt camera with a long focal length Input: observed laser point images by three cameras. Output: intrinsic matrix K extrinsic parameters R, t, and rotation matrix R∗ between coordinates of camera and PTU. 1. Calibrate the stereo cameras and reference view of the pan-tilt camera using [19]. 2. Rectify stereo images such that epipolar lines are parallel with the y-axis (optional). 3. Capture laser points on a 3D plane for three cameras at each pan and tilt setting in the working area. 4. Based on blob detection and epipolar constraint, find two laser points in the stereo cameras. Generate 3D points by triangulation of two laser points. 5. Plane fitting for each plane position using RANSAC. 6. Remove outliers of 3D points based on the fitted 3D plane. 7. Estimate R∗ , K, R, t by minimizing Eq.(3).
Algorithm 2. Calibration method for a pan-tilt camera with a long focal length based on homographies Input: images captured at each pan and tilt setting. Output: intrinsic matrix K and rotation matrix R∗ between coordinates of camera and PTU. 1. Detect features based on Scale-invariant feature transform (SIFT) in [20] and find correspondences between neighboring images. 2. Robust homography estimation using RANSAC. 3. Compute homography between each image and reference view (pan = 0, tilt = 0). 4. Estimate K using Calibration Toolbox [19]. 5. Estimate R∗ and refine intrinsic matrix K by minimizing argminR∗ ,K
n m
xij − KR∗−1 Rtilt Rpan R∗ K −1 xiref 2
(4)
i=1 j=1
where xij and xiref are ith feature point at jth and reference view respectively.
4 Experiments Here we present experimental results from two fixed cameras (Dragonfly2 DR2-HICOL with resolution 1024 × 768) and a pan-tilt still camera (Nikon D70 with resolution 3008 × 2000). First, we compare the calibration accuracy with short and long focal length lenses using traditional homograph-based method. Then, we demonstrate that our calibration method significantly improves accuracy for telephoto lenses. In order to validate the calibration accuracy, we generate about 500 3D testing points that are randomly distributed cover the whole working area following step 2 to step 4 in Algorithm 1, i.e., tracking and triangulating the laser dot. The testing points are different from the points used for calibration. First we present the calibration results of the still camera with a short (18mm) and a long focal length (300mm) lenses. For simplicity, we assume the coordinate systems
Calibrating Pan-Tilt Cameras with Telephoto Lenses
133
Table 1. The comparison between short and long focal length cameras. α and β are focal length. μ0 and ν0 are principal point. The uncertainties of principal point for 300mm camera cannot be estimated by Calibration Toolbox [19]. focal length α β μ0 ν0 RMS (in pixels) 300mm 40869.2 ± 1750.2 41081.7 ± 1735.1 1503.5 ± ∗ 999.5 ± ∗ 3.85 18mm 2331.2 ± 9.1 2339.7 ± 9.1 1550.8 ± 12.1 997.9 ± 14.4 2.11
of pan-tilt camera and pan-tilt unit are aligned perfectly. This means R∗ is an identity matrix. We use the Calibration Toolbox [19] to do the calibration for the reference view of pan-tilt camera and stereo cameras. In order to reduce the ambiguities caused by the long focal length, we capture over 40 checkerboard patterns at different orientations for pan-tilt camera. Table 1 shows results of the intrinsic matrix K and RMS of calibration data. The uncertainties of the focal length with the 300mm lens is about 10 times larger than that with a 18mm lens although the RMS of calibration data for both cases are similar. Fig.3 shows distributions of re-projection errors for the 500 testing points with 18mm and 300mm cameras. From this figure, we find that calibration is quite accurate for short focal length camera even that we assume R∗ is an identity matrix. Given the high resolution image, the relative errors from 18mm and 300mm cameras are about 1.3% and 30% respectively. This is computed as the ratio of the mean pixel error and the image width. Furthermore, many of the test points are out of field of view of the 300mm camera. Focal Length: 18mm
0
10
20
30
40
50
Focal Length: 300mm
60
Error in Pixels, Mean: 37.9 Variance: 265.9
70
0
500
1000
1500
2000
Error in Pixels, Mean:898.6, Variance: 2.5206e+05
Fig. 3. Distributions of re-projection error (in pixels) based on 500 testing data for 18mm and 300mm pan-tilt cameras
We then recalibrate the 300mm case with methods outlined in Algorithm 1 and 2, both of which include the estimation of R∗ . About 600 3D points are sampled for calibration over the working area in Algorithm 1. We pre-calibrate the reference view for the pan-tilt camera as the initial guess. After calibration, we validate the accuracy with 500 3D points. Fig. 4 shows the distributions of re-projection errors from the two different methods. Our method is about 25 times better than the homography-based one. The
134
X. Huang, J. Gao, and R. Yang Calibration Based on Algorithm 1
0
20
40
60
80
100
120
Error in Pixels, Mean: 34.6, Variance: 424.3
Calibration Based on Algorithm 2
140
0
500
1000
1500
2000
Error in Pixels, Mean: 880.0 Variance: 2.2946e+05
Fig. 4. Distributions of re-projection error based on 500 testing data (in pixels) for Algorithm 1 and 2 Table 2. The comparison between Algorithm 1 and 2. α and β are focal length. μ0 and ν0 are principal point. θx and θy are rotation angles between pan-tilt camera and pan-tilt unit. Algorithm α β μ0 ν0 θx θy 1 40320.9 39507.7 1506.6 997.7 −0.14 −1.61 2 39883.3 40374.6 1567.5 1271.5 1.99 1.41
relative errors from Algorithm 1 and 2 are about 1.2% and 29% respectively. It should be noted that R∗ can not be estimated accurately from observed homographies. Hence, the percentage error from Algorithm 2 remains very large. In fact, the improvement over assuming an identity R∗ is little. Table 2 shows the results for intrinsic matrix K, θx , and θy after MLE refinement. Here we decompose R∗ into two rotation matrices. One is the rotation around x axis for θx degree, and the other is the rotation around y axis for θy degree.
5 Conclusion This paper shows that calibration methods based on observed homographies are not suitable for cameras with telephoto (long-focal-length) lenses. This is caused by the ambiguities induced by the near-orthographic projection. We develop a method to calibrate a pan-tilt camera with long focal length in a surveillance network. In stead of using a large precisely-manufactured calibration object, our key idea is to use fixed stereo cameras to create a large collection of 3D calibration points. Using these 3D points allows full metric calibration over a large area. Experimental results show that the re-projection relative error is reduced from 30% to 1.2% with our method. In future work, we plan to extend our calibration method to auto-zoom cameras and build a complete surveillance system that can adjust zoom settings automatically by estimating the object’s size.
Calibrating Pan-Tilt Cameras with Telephoto Lenses
135
References 1. Daugman, J.: How Iris Recognition Works. In: ICIP (2002) 2. Guo, G., Jones, M., Beardsley, P.: A System for Automatic Iris Capturing. In: MERL TR2005-044 (2005) 3. Tsai, R.Y.: A Versatile Camera Calibration Technique for High-accuracy 3D Machine Vision Metrology Using Off-The-Shelf TV Cameras and Lenses. IEEE Journal of Robotics and Automation 4(3), 323–344 (1987) 4. Faugeras, O.: Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, Cambridge (1993) 5. Zhang, Z.: A Flexible New Technique for Camera Calibration. PAMI 22, 1330–1334 (2000) 6. Heikkila, J., Silven, O.: A Four-Step Camera Calibration Procedure with Implicit Image Correction. In: Proceedings of CVPR, pp. 1106–1112 (1997) 7. Pollefeys, M., Koch, R., Gool, L.V.: Self-Calibration and Metric Reconstruction in spite of Varying and Unknown Internal Camera Parameters. In: Proceedings of ICCV, pp. 90–95 (1997) 8. Pollefeys, M.: Self-Calibration and metric 3D reconstruction from uncalibrated image sequences. PhD thesis, K.U.Leuven (1999) 9. Mundy, J., Zisserman, A.: Geometric Invariance in Computer Vision. MIT Press, Cambridge (1992) 10. Aloimonos, J.Y.: Perspective Approximations. Image and Vision Computing 8, 177–192 (1990) 11. Quan, L.: Self-Calibration of an Affine Camera from Multiple Views. International Journal of Computer Vision 19(1), 93–105 (1996) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry. Cambridge University Press, Cambridge (2000) 13. Hartley, R.I.: Self-Calibration of Stationary Cameras. International Journal of Computer Vision 1(22), 5–23 (1997) 14. de Agapito, L., Hayman, E., Reid, I.: Self-Calibration of a Rotating Camera with Varying Intrinsic Parameters. In: BMVC (1998) 15. Sinha, N., Pollefeys, M.: Towards Calibrating a Pan-Tilt-Zoom Camera Network. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, Springer, Heidelberg (2004) 16. Wang, L., Kang, S.B.: Error Analysis of Pure Rotation-Based Self-Calibration. PAMI 2(26), 275–280 (2004) 17. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proceedings of ICCV, vol. 1, pp. 144–150 (2003) 18. Chen, X., Davis, J.: Wide Area Camera Calibration Using Virtual Calibration Objects. In: Proceedings of CVPR (2000) 19. Bouguet, J.Y.: Camera Calibration Toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib doc/ 20. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 20, 91–110 (2003)
Appendix A: Solving Pan and Tilt Angles Here we discuss how to solve the pan and tilt angles so that the projection of an arbitrary point X in the 3D space is in the center of the image plane. We assume there is a rotation between the pan-tilt coordinate system and the camera’s. Because of the dependency of the pan-tilt unit, that is, the tilt axis depends on the pan axis, the solution is not as
136
X. Huang, J. Gao, and R. Yang
simple as it appears. In order to address this problem, we back project the image center ˜ 2 . The center of projection and point X form another line L ˜ 1 . After the to a line L ˜ ˜ calibration steps described in Section 3, L1 and L2 are transformed into L1 and L2 in the coordinate system of the pan-tilt unit. Hence, the problem is simplified to panning around y-axis and tilting around x-axis to make L1 and L2 coincident or as close as possible to each other. If L1 and L2 are represented by the Pl¨uker matrices, one method to compute the transformation of an arbitrary 3D line to another line by performing only rotations around x and y axes could be a minimization of the following functional, argminRx ,Ry ,λ L1 − λL2 2
(5)
where λ is a scalar, L2 is a 6 × 1 vector of the Pl¨uker coordinates of L2 , and L1 is the 6 × 1 Pl¨uker coordinates of the multiplication of (Ry Rx )L1 (Ry Rx )T , where Rx and Ry are rotation matrices around the x and y axes. Y
-2
( x2 , y 2 , z 2 )
C2 ( a 2 , b2 , c 2 )
L2
-1
(a1 , b1 , c1 )
O
L1
M2
M1
X
( x1 , y1 , z1 )
Z
C1
Fig. 5. Solve pan and tilt angles from L1 to L2
However, the problem can be further simplified because L1 and L2 are intersected in the origin of the pan-tilt unit in our model. As shown in Fig. 5, we want to pan and tilt line L1 to coincide with another line L2 . Assuming both of the two lines have unit lengths, the tilt angles are first computed by Eq. (6). y1 y2 ) − arctan ( ) z1 r y1 y2 ) ϕ2 = arctan ( ) − arctan ( z1 −r r = y12 + z12 − y22
ϕ1 = arctan (
(6)
If (y12 + z12 − y22 ) is less than 0, two conics C1 and C2 are not intersected that means no exact solution exists. However, it almost never happens in practice since the rotation
Calibrating Pan-Tilt Cameras with Telephoto Lenses
137
between the pan-tilt unit and the camera is small. After tilting, (x1 , y1 , z1 ) is rotated to (a1 , b1 , c1 ) or (a2 , b2 , c2 ). Then the pan angles are computed by Eq. (7). z2 c1 ) − arctan ( ) x2 a1 z2 c2 ϑ2 = arctan ( ) − arctan ( ) x2 a2
ϑ1 = arctan (
(7)
Hence, two solutions, (ϕ1 , ϑ1 ) and (ϕ2 , ϑ2 ), are obtained. We choose the minimum rotation angles as the final solution.
Camera Calibration Using Principal-Axes Aligned Conics Xianghua Ying and Hongbin Zha National Laboratory on Machine Perception Peking University, Beijing, 100871 P.R. China {xhying,zha}@cis.pku.edu.cn
Abstract. The projective geometric properties of two principal-axes aligned (PAA) conics in a model plane are investigated in this paper by utilized the generalized eigenvalue decomposition (GED). We demonstrate that one constraint on the image of the absolute conic (IAC) can be obtained from a single image of two PAA conics even if their parameters are unknown. And if the eccentricity of one of the two conics is given, two constraints on the IAC can be obtained. An important merit of the algorithm using PAA is that it can be employed to avoid the ambiguities when estimating extrinsic parameters in the calibration algorithms using concentric circles. We evaluate the characteristics and robustness of the proposed algorithm in experiments with synthetic and real data. Keywords: Camera calibration, Generalized eigenvalue decomposition, Principal-axes aligned conics, Image of the absolute conic.
1 Introduction Conic is one of the most important image features like point and line in computer vision. The motivation to study the geometry of conics arises from the facts that conics have more geometric information, and can be more robustly and more exactly extracted from images than points and lines. In addition, conics are very easy to be produced and identified than general algebraic curves, though general algebraic curves may have more geometric information. Unlike a large number of researches have been developed on points and lines, there are just several algorithms proposed based on conics for pose estimation [2][10], structure recovery [11][15][7][17][13], object recognition [8] [14][5], and camera calibration [18][19][3]. Forsyth et al. [2] discovered the projective invariants for pairs of conics then developed an algorithm to determine the relative pose of a scene plane from two conic correspondences. However the algorithm requires solving quartics and has no closed form solutions. Ma [10] developed an analytical method based on conic correspondences for motion estimation and pose determination from stereo images. Quan [14] discovered two polynomial constraints from corresponding conics in two uncalibrated perspective images and applied them to object recognition. Weiss [18] demonstrated that two conics are sufficient for calibration under the affine projection and derived a nonlinear calibration algorithm. Kahl and Heyden [7] proposed an algorithm for epipolar geometry estimation from conic correspondences. They found that one conic correspondence gives two independent constraints on the fundamental matrix and a method to Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 138–148, 2007. © Springer-Verlag Berlin Heidelberg 2007
Camera Calibration Using Principal-Axes Aligned Conics
139
estimate the fundamental matrix from at least four corresponding conics was presented. Sugimoto [17] proposed a linear algorithm for solving the homography from conic correspondences, but it requires at least seven correspondences. Mudigonda et al. [13] shown that two conic correspondences are enough for solving the homography but requires solutions of polynomial equations. The closest works to that proposed here are [19] and [3]. Yang et al. [19] presented a linear approach for camera calibration from concentric conics on a model plane. They showed that 2 constraints could be obtained from a single image of these concentric conics. However, it requires at least three concentric conics, and the equations of all these conics must be given in advance. Gurdjos et al. [3] utilized the projective and Euclidean properties of confocal conics to perform camera calibration. These properties are that the line conic consisted of the images of the circular points should belong to the conic range of these confocal conics. Two constraints on the IAC can be obtained from a single image of the confocal conics. Gurdjos et al. [3] claimed that the important reason to propose confocal conics for camera calibration is that there exist ambiguities in the calibration methods using concentric circles [9][6] when recovering the extrinsic parameters of the camera, and the algorithms using the confocal conics can avoid such ambiguities. In this paper, we discover a novel useful pattern, PAA. And the properties of two arbitrary PAA conics with unknown or known eccentricities are deeply investigated and discussed in this paper.
2 Basic Principles 2.1 Pinhole Camera Model Let X = [X Y Z 1]T be a world point and ~ x = [u v 1]T be its image point, both in the homogeneous coordinates, they satisfy:
μ~x = PX ,
(1)
where P is a 3× 4 projection matrix describing the perspective projection process. μ is an unknown scale factor. The projection matrix can be decomposed as:
P = K [R | t ] ,
(2)
where
⎡ fx K = ⎢⎢ 0 ⎢⎣ 0
s fy 0
u0 ⎤ v0 ⎥⎥ . 1 ⎥⎦
(3)
Here the matrix K is the matrix of the intrinsic parameters, and (R, t ) denote a rigid transformation which indicate the orientation and position of the camera with respect to the world coordinate system.
140
X. Ying and H. Zha
2.2 Homography Between the Model Plane and Its Image Without loss of generality, we assume the model plane is on Z = 0 of the world coordinate system. Let us denote the i th column of the rotation matrix R by ri . From (1) and (2), we have,
μ~x = K [r1 r2 We denote x = [X homography H :
⎡X ⎤ ⎢Y ⎥ t ]⎢ ⎥ = K [r1 r2 ⎢0⎥ ⎢ ⎥ ⎣1⎦
r3
⎡X ⎤ t ]⎢⎢ Y ⎥⎥ . ⎢⎣ 1 ⎥⎦
(4)
Y 1]T , then a model point x and its image ~ x is related by a 2D
μ~x = Hx ,
(5)
where H = K [r1 r2
t] .
(6)
Obviously, H is defined up to a scale factor. 2.3 Standard Forms for Conics All conics are projectively equivalent under the projective transformation [16]. This means that any conic can be converted into any anther conic by some projective transformations. A conic is an ellipse (including circle), a parabola or a hyperbola, respectively, if and only if its intersection with the line at infinity on the projective plane consists of 2 imaginary points, 2 repeated real points or 2 real points, respectively. In cases of central conics (ellipses and hyperbolas), by moving the coordinate origin to the center and choosing the directions of the coordinate axes coincident with the socalled principal axes (axes of symmetry) of the conic, we can obtain that the equation in standard form for an ellipse is X 2 a 2 + Y 2 b 2 = 1 , where a 2 ≥ b 2 , and the equation in standard form for a hyperbola is X 2 a 2 − Y 2 b 2 = 1 . These equations can be written in a simpler form: AX 2 + BY 2 + C = 0 ,
(7)
and rewritten in matrix form, we obtain, xT Ax = 0 ,
(8)
⎡A ⎤ ⎢ ⎥. A=⎢ B ⎥ ⎢⎣ C ⎥⎦
(9)
where
Camera Calibration Using Principal-Axes Aligned Conics
141
For a parabola, let the unique axis of symmetry of the parabola coincident with the X-axis, and let the Y-axis pass through the vertex of the parabola, then the equation of the parabola is brought into the form: Y 2 = 2 pX ,
(10)
xT Bx = 0 ,
(11)
− p⎤ ⎡ ⎥. B = ⎢⎢ 1 ⎥ ⎢⎣− p ⎥⎦
(12)
or
where
Equation (12) can be rewritten in a homogenous form: ⎡ B = ⎢⎢ ⎢⎣ E
D
E⎤ ⎥. ⎥ ⎥⎦
(13)
2.4 Equations for the Images of Conics in Standard Form Given the homography H between the model plane and its image, from (5) and (8), we can obtain the image of a central conic in standard form satisfies: ~ ~ x T A~ x=0, (14) where ~ A = H −T AH −1 .
Similarly, the image of a parabola in standard form satisfies, ~ ~ x T B~ x =0,
(15)
(16)
where ~ B = H −T BH −1 .
(17)
3 Properties of PAA Conics 3.1 Properties of Two Conics Via the GED Conics are still conics under an arbitrary 2D projective transformation [16]. An interesting property of two conics is that the GED of the two conics is projectively invariant [12]. This property is interpreted in details as follows: Given two point ~ ~ conic pairs ( A1 , A 2 ) and ( A1 , A 2 ) , they are related by a plane homography H , i.e., ~ A i ~ H −T A i H −1 , i = 1,2 . If x is the generalized eigenvector of ( A1 , A 2 ) i.e.,
142
X. Ying and H. Zha
~ ~ A1x = λA 2 x , then ~ x = Hx must be the generalized eigenvector of ( A1 , A 2 ) , i.e., ~ ~~ A1~ x = λ A 2~ x . In general, there are 3 generalized eigenvectors for two 3 × 3 matrixes. Therefore, for a point conic pair, we may obtain three points (i.e., the three generalized eigenvectors of a point conic pair), which are projectively invariant under the 2D projective transformation in the projective plane. Similarly, for a line conic pair, we may obtain three lines (i.e., the three generalized eigenvectors of a line conic pair), which are projectively invariant under the 2D projective transformation in the projective plane.
3.2 Properties of Two PAA Central Conics Two PAA central conics (point conics) in standard form are: ⎡ A1 A1 = ⎢⎢ ⎢⎣
B1
⎤ ⎡ A2 ⎥, A =⎢ 2 ⎥ ⎢ ⎢⎣ C1 ⎥⎦
B2
⎤ ⎥. ⎥ C 2 ⎥⎦
(18)
The GED of the two conics is:
A1x = λA 2 x .
(19)
It is not difficult to find that, the generalized eigenvalues and the generalized eigenvectors of A1 and A 2 are as follows:
⎡1 ⎤ ⎡0 ⎤ ⎡0 ⎤ B1 C1 A1 ⎢ ⎥ ⎢ ⎥ λ1 = , x 1 = ⎢0 ⎥ , λ 2 = , x 2 = ⎢1⎥ , λ3 = , x 3 = ⎢⎢0⎥⎥ , A2 B2 C2 ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣1⎥⎦
(20)
where x1 is the directional vector in the X-axis, x 2 is the directional vector in the Y-axis, and x 3 is the homogeneous coordinates of the common center of the two central conics. From the projective geometric properties of two point conics via the GED as presented in Section 3.1, we obtain,
Proposition 1. From the images of two PAA central conics, we can obtain the image of the directional vector in the X-axis, the image of the directional vector in the Y-axis, and the image of the common center of the two central conics via the GED. 3.3 Properties and Ambiguities in Concentric Circles Two concentric circles in standard form are: ⎡ A1 A1 = ⎢⎢ ⎢⎣
A1
⎤ ⎡ A2 ⎥, A =⎢ 2 ⎥ ⎢ ⎢⎣ C1 ⎥⎦
A2
⎤ ⎥. ⎥ C 2 ⎥⎦
(21)
It is not difficult to find that, the generalized eigenvalues and the generalized eigenvectors of A1 and A 2 are as follows:
Camera Calibration Using Principal-Axes Aligned Conics
143
⎡1 ⎤ ⎡0 ⎤ ⎡1⎤ ⎡0 ⎤ ⎡0 ⎤ C1 A1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ λ1 = λ2 = , x1 = ρ1 ⎢0⎥ + μ1 ⎢1⎥ , x 2 = ρ 2 ⎢0⎥ + μ 2 ⎢1⎥ , λ3 = , x 3 = ⎢⎢0⎥⎥ , (22) A2 C2 ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣1⎥⎦ where ρ1 , μ1 , ρ 2 , μ 2 are four real constants which are only required to satisfy that x1 ≠ x 2 up to a scale factor. This means ρ1 , μ1 , ρ 2 , μ 2 cannot be determined uniquely. There are infinitely many solutions for ρ1 , μ1 , ρ 2 , μ 2 , thus infinitely many solutions for x1 and x 2 . x1 and x 2 are two points at infinity, and x 3 is the homogeneous coordinates of the common center of the two central conics. The ambiguities in x1 and x 2 can be comprehended from the facts that we cannot establish a unique XY coordinate system from the two concentric circles on the model plane because there exists a degree of freedom in the 2D rotation around the common center. However for two general central PAA conics, it is very easy to establish a XY coordinate system in the supporting plane without any ambiguities because we can choose the coordinate axes coincident with the principal axes of two PAA conics.
Proposition 2. From the images of two concentric circles, we can obtain the image of the common center, and the image of the line at infinity of the supporting plane via the GED.
4 Calibration 4.1 Dual Conic of the Absolute Points from Conics in Standard Form The eccentricity e is one of the very important parameters in a conic. If e = 0 , the conic is a circle. If 0 < e < 1 , the conic section is an ellipse. If e = 1 , it is a parabola. If e > 1 , it is a hyperbola. The equation in standard form for an ellipse is: 2 2 2 2 2 2 X 2 a 2 + Y 2 b 2 = 1 , then e = c a , where c = a − b , thus, b = (1 − e )a . ThereT fore, we can obtain that the line at infinity l∞ = (0,0,1) of the supporting plane intersects the ellipse at two imaginary points: 1 ⎡ ⎤ ⎡ 1 ⎤ ⎢ ⎢ 2 ⎥ I E = ⎢ 1 − e i ⎥ , J E = ⎢ − 1 − e 2 i ⎥⎥ . ⎢⎣ ⎥⎦ ⎢⎣ 0 ⎥⎦ 0
(23)
The equation in standard form for a hyperbola is X 2 a 2 − Y 2 b 2 = 1 , then e = c a , where c 2 = a 2 + b 2 , thus, b 2 = (e 2 − 1)a 2 . Therefore, we can obtain that the line at infinity l∞ = (0,0,1)T of the supporting plane intersects the hyperbola at two real points: 1 ⎡ ⎤ ⎡ 1 ⎤ I H = ⎢⎢ e 2 − 1 ⎥⎥ , J H = ⎢⎢ − e 2 − 1 ⎥⎥ . ⎢⎣ ⎥⎦ ⎢⎣ 0 ⎥⎦ 0
(24)
The equation in standard form for a parabola is Y 2 = 2 pX , it is not difficult to obtain that the line at infinity l∞ = (0,0,1)T of the supporting plane intersects the parabola
144
X. Ying and H. Zha
at two repeated real points, or say that the line at infinity is tangent to the parabola at one real point:
⎡1 ⎤ I P = J P = ⎢⎢0⎥⎥ . ⎢⎣0⎥⎦
(25)
From discussions above, we obtain:
Definition 1. The line at infinity intersects a conic in standard form at two points, which are called the absolute points of a conic in standard form: 1 ⎡ 1 ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ 2 I A = ⎢ e − 1 ⎥ , J A = ⎢− e 2 − 1 ⎥⎥ . ⎢⎣ 0 ⎥⎦ ⎢⎣ ⎥⎦ 0
(26)
For a circle ( e = 0 ), the two absolute points is the well-known circular points, I = [1 i 0]T and J = [1 − i 0]T .
Definition 2. The conic C*∞ = I A J TA + J AITA
(27)
is the conic dual to the absolute points. The conic C*∞ is a degenerate (rank 2 or 1) line conic, which consists of the two absolute points. In a Euclidean coordinate system it is given by ⎡1 C*∞ = I A J TA + J A I TA = ⎢⎢ 1 − e 2 ⎢⎣
⎤ ⎥. ⎥ 0⎥⎦
(28)
The conic C*∞ is fixed under scale and translation transformation. The reasons are as follows: Under the point transformation ~ x = Hx , where H is a scale and translation transformation, one can easily verify that, ~ C*∞ = HC*∞ H T = C*∞ . (29) The converse is also true, and we have,
Proposition 3. The dual conic C*∞ is fixed under the projective transformation H if and only if H is a scale and translation transformation. For circles, C*∞ is fixed not only under scale and translation transformation, but also fixed under rotation transformation [4].
4.2 Calibration from Unknown PAA Central Conics Given the images of two PAA central conics, from Proposition 1, we can determine the images of the directional vectors in the X-axis and Y-axis, then denote them as ~ x1
Camera Calibration Using Principal-Axes Aligned Conics
145
and ~ x 2 , respectively. From [4] we know, the vanishing points of lines with perpendicular directions satisfy: ~ x1T ω~ x2 = 0 ,
(30)
where ω = K −T K −1 is the IAC [4]. Therefore, we have:
Proposition 4. From a single image of two PAA conics, if the parameters of the two conics are both unknown, one constraint can be obtained on the IAC. Given 5 images taken in general positions, we can linearly recover the IAC ω . The intrinsic parameter matrix K can be obtained by the Cholesky factorization of the IAC ω . After the intrinsic parameters are known, it is not difficult to obtain the images of the circular points for each image by intersecting the image of the line at infinity and the IAC ω . From the images of the circular points, the image of the common center, and the images of the directional vectors in the X-axis and Y-axis, we can obtain the extrinsic parameters without ambiguities [4].
4.3 Calibration from Eccentricity-Known PAA Central Conics Assume that the eccentricity of one of the PAA central conics is known, from Proposition 2, we can determine the image of the line at infinity from the images of the two PAA conics. Then we can obtain the images of the absolute points of the conic with known eccentricity by intersecting the image of the line at infinity and the image of ~ this conic. Thus we can obtain the image of the conic dual to the absolute points, C*∞ . Actually, a suitable rectifying homography may be obtained directly from the identi~ fied C*∞ in an image using the eigenvalue decomposition, and after some manipulation, we can obtain, ⎡1 ~* C ∞ = U ⎢⎢ 1 − e 2 ⎢⎣
⎤ ⎥ UT . ⎥ 0⎥⎦
(31)
The rectifying projectivity is H = U up to a scale and translation transformation.
Proposition 5. Once the dual conic C*∞ is identified on the projective plane then projective distortion may be rectified up to a scale and translation transformation. After performing the rectification, we can translate the image so that the coordinate origin is coincident with the common center. Thus we obtain the 2D homography between the supporting plane and its image while the coordinate system in the supporting plane is established whose axes are coincident with the principal axes of the central PAA conics. Let us denote H = [h1 h 2 h 3 ] , from (6), we have,
H = [h1 h 2
h 3 ] = K [r1 r2
t] .
(32)
146
X. Ying and H. Zha
Using the fact that r1 and r2 are orthonormal, we have [20],
h1T K −T K −1h 2 = 0 , i.e., h1T ωh 2 = 0 , h1T K −T K −1h1 = h T2 K −T K −1h 2 , i.e., h1T ωh1 = h T2 ωh 2 .
(33) (34)
These are 2 constraints on the intrinsic parameters from one homography. If the eccentricities of two PAA central conics are both known, we can obtain a least squares solution for the homography. From discussions above, we have,
Proposition 6. From a single image of two PAA conics, if the eccentricity of one of the two conics is known, two constraints can be obtained on the IAC. Given 3 images taken in general positions, we can obtain the IAC ω . The intrinsic parameter matrix K can be obtained by the Cholesky factorization of the IAC ω . Once the intrinsic parameter matrix K is obtained, the extrinsic parameters for each image can be recovered without ambiguity as proposed in [20].
5 Experiments We perform a number of experiments, both simulated and real, to test our algorithms with respect to noise sensitivity. Due to lack of space, the simulated experimental results are not shown here. In order to demonstrate the performance of our algorithm, we capture an image sequence of 209 real images, with resolution 800 × 600 , to perform augmented reality. Edges were extracted using Canny’s edge detector and the ellipses were obtained using a least squares ellipse fitting algorithm [3]. Some augmented realities examples are shown in Fig. 1 to illustrate the calibration results.
Fig. 1. Some augmented realities results
6 Conclusion A very deep investigation in the projective geometric properties of the principal-axes aligned conics is given in this paper. These properties are obtained by utilizing the generalized eigenvalue decomposition of two PAA conics. We define the absolute
Camera Calibration Using Principal-Axes Aligned Conics
147
points of a conic in standard form, which is analogy of the circular points of a circle. Furthermore, we define the dual conic consisted of the two absolute points, which is analogy of the dual conic consisted of the circular points. By using the dual conic consisted of the two absolute points, we propose a linear algorithm to obtain the extrinsic parameters of the camera. We also discovered a novel example of the PAA conics, which is consisted of a circle and a conic concentric with each other while the parameters of the circle and the conic are both unknown, and two constraints on the IAC can be obtained from a single image of this pattern. Due to lack of space, these are not discussed in this paper. To explore more novel patterns containing conics is our ongoing work.
Acknowledgements This work was supported in part by the NKBRPC 973 Grant No. 2006CB303100, the NNSFC Grant No. 60605010, the NHTRDP 863 Grant No. 2006AA01Z302, and the Key grant Project of Chinese Ministry of Education No. 103001.
References 1. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least squares fitting of ellipses. IEEE Trans. Pattern Analysis and Machine Intelligence 21(5), 476–480 (1999) 2. Forsyth, D., Mundy, J.L., Zisserman, A., Coelho, C., Heller, A., Rothwell, C.: Invariant descriptors for 3-D object recognition and pose. IEEE Trans. Pattern Analysis and Machine Intelligence 13(10), 971–991 (1991) 3. Gurdjos, P., Kim, J.-S., Kweon, I.-S.: Euclidean Structure from Confocal Conics: Theory and Application to Camera Calibration. In: Proc. IEEE. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 1214–1222. IEEE Computer Society Press, Los Alamitos (2006) 4. Hartley, R., Zisserman, A.: Multiple View Geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge, UK (2003) 5. Heisterkamp, D., Bhattacharya, P.: Invariants of families of coplanar conics and their applications to object recognition. Journal of Mathematical Imaging and Vision 7(3), 253– 267 (1997) 6. Jiang, G., Quan, L.: Detection of Concentric Circles for Camera Calibration. In: Proc. Int’l. Conf. Computer Vision, pp. 333–340 (2005) 7. Kahl, F., Heyden, A.: Using conic correspondence in two images to estimate the epipolar geometry. In: Proc. Int’l. Conf. Computer Vision, pp. 761–766 (1998) 8. Kanatani, K., Liu, W.: 3D Interpretation of Conics and Orthogonality. Computer Vision and Image Understanding 58(3), 286–301 (1993) 9. Kim, J.-S., Gurdjos, P., Kweon, I.-S.: Geometric and Algebraic Constraints of Projected Concentric Circles and Their Applications to Camera Calibration. IEEE Trans. Pattern Analysis and Machine Intelligence 27(4), 637–642 (2005) 10. Ma, S.: Conics-Based Stereo, Motion Estimation, and Pose Determination. Int’l J. Computer Vision 10(1), 7–25 (1993) 11. Ma, S., Si, S., Chen, Z.: Quadric curve based stereo. In: Proc. of The 11th Int’l. Conf. Pattern Recognition, vol. 1, pp. 1–4 (1992)
148
X. Ying and H. Zha
12. Mundy, J.L., Zisserman, A. (eds.): Geometric Invariance in Computer Vision. MIT Press, Cambridge (1992) 13. Mudigonda, P., Jawahar, C.V., Narayanan, P.J.: Geometric structure computation from conics. In: Proc. Indian Conf. Computer Vison, Graphics and Image Processing (ICVGIP), pp. 9–14 (2004) 14. Quan, L.: Algebraic and geometric invariant of a pair of noncoplanar conics in space. Journal of Mathematical Imaging and Vision 5(3), 263–267 (1995) 15. Quan, L.: Conic reconstruction and correspondence from two views. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(2), 151–160 (1996) 16. Semple, J.G., Kneebone, G.T.: Algebraic Projective Geometry. Oxford University Press, Oxford (1952) 17. Sugimoto, A.: A linear algorithm for computing the homography from conics in correspondence. Journal of Mathematical Imaging and Vision 13, 115–130 (2000) 18. Weiss, I.: 3-D curve reconstruction from uncalibrated cameras. In: Proc. of Int’l. Conf. Pattern Recognition, vol. 1, pp. 323–327 (1996) 19. Yang, C., Sun, F., Hu, Z.: Planar Conic Based Camera Calibration. In: Proc. of Int’l. Conf. Pattern Recognition, vol. 1, pp. 555–558 (2000) 20. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)
3D Intrusion Detection System with Uncalibrated Multiple Cameras Satoshi Kawabata, Shinsaku Hiura, and Kosuke Sato Graduate School of Engineering Science, Osaka University, Japan
[email protected], {shinsaku,sato}@sys.es.osaka-u.ac.jp
Abstract. In this paper, we propose a practical intrusion detection system using uncalibrated multiple cameras. Our algorithm combines the contour based multi-planar visual hull method and a projective reconstruction method. To set up the detection system, no advance knowledge or calibration is necessary. A user can specify points in the scene directly with a simple colored marker, and the system automatically generates a restricted area as the convex hull of all specified points. To detect an intrusion, the system computes intersections of an object and each sensitive plane, which is the boundary of the restricted area, by projecting an object silhouette from each image to the sensitive plane using 2D homography. When an object exceeds one sensitive plane, the projected silhouettes from all cameras must have some common regions. Therefore, the system can detect intrusion by any object with an arbitrary shape without reconstruction of the 3D shape of the object.
1
Introduction
In this paper, we propose a practical system for detecting 3D volumetric intrusion in a predefined restricted area using uncalibrated multiple cameras. Intrusion detection techniques (e.g., person–machine collision prevention, offlimits area observation, etc.) are important for establishing safe, secure societies and environments. Today, equipment which detects the blocking of a light beam, referred to as a light curtain, are widely used for this purpose. Although the light curtain is useful to achieve very safe environments which were previously considered dangerous, it is excessive for widespread applications. For example, the light curtain method requires us to set equipment at both sides of a rectangle for detection, which leads to higher cost, limited shape of the detection plane and set-up difficulty. In the meantime, surveillance cameras have been installed into many various environments; however, the scenes observed by these cameras are used only for recording or visual observation by distant human observers, and they are merely used to warn a person in a dangerous situation or to immediately halt a dangerous machine. There are many computer enhancements that recognize events in a scene [1], but it is difficult to completely detect dangerous situations, including unexpected phenomena. Furthermore, we do not have sufficient knowledge and methodologies to use the recognition result from these systems to ensure safety. Therefore, our proposed system simply detects an intrusion Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 149–158, 2007. c Springer-Verlag Berlin Heidelberg 2007
150
S. Kawabata, S. Hiura, and K. Sato
in a specific area in 3D space using multiple cameras. We believe this system will help establish a safe and secure society. As mentioned above, flexibility and ease in setting up the equipment and detection region are important factors to the cost and practical use. However, there are two problems in image based intrusion detection: one is the necessity of the complex and nuisance calibration for a multiple camera system, and the other is the intuitiveness for defining a restricted area. Thus, we propose a method to complete the calibration and the restricted area definition simultaneously by simply moving a colored marker in front of the cameras.
2
Characterization and Simplification of the Intrusion Detection Problem
In the last decade of computer vision, there have been many studies to measure or recognize a scene taken by cameras in an environment. In particular, methods to extract or track a moving object in an image have been investigated with great effort and have rapidly progressed. In most of this research, the region of an object can be detected without consideration of the actual 3D shape. Therefore, although these techniques may be used for rough intrusion detection, they cannot handle detailed motion and deformation, such as whether a person is reaching for a dangerous machine or an object of value. On the other hand, there has been other research to reconstruct the whole shape of a target object from images taken by multiple cameras. Using this method, it is possible to detect the intrusion of an object in a scene by computing the overlapping region of the restricted area and target object. This approach is not reasonable because the reconstruction computation generally needs huge CPU and memory resources, and, as described later, the approach involves unnecessary processes to detect an intrusion. In addition, it is not easy for users to set up such a system because the cameras must be calibrated precisely. Thus, we resolve these issues by considering two characteristics of the intrusion detection problem. The first is the projective invariance of the observed space in intrusion detection. The state of intrusion, that is, the existence of an overlapping region of a restricted area and object, is invariant if the entire scene is projectively transformed. Hence, we can use weak calibration, instead of full calibration, to detect an intrusion. Furthermore, setting the restricted area can be done simultaneously with the calibration, because the relationship between the area and cameras can also be represented in a projective space. Although the whole shape of an intruding object has projective indefiniteness, it doesn’t affect the detection of intrusion. The second characteristic is that a restricted area is always a closed region. Consequently, we do not have to check the total volume of a restricted area; it is sufficient to observe only the boundary of the restricted area. This manner of thinking is one of the standard approaches for ensuring safety, and is also adopted by the abovementioned light curtain. Our system detects an intrusion by projecting the silhouette on each camera image onto the boundary plane, then
3D Intrusion Detection System with Uncalibrated Multiple Cameras
151
computing the common region of all the silhouettes. This common region on the boundary plane is equivalent to the intersection of the reconstructed shape of an object by the visual hull method and the shape of the boundary plane. The remainder of this paper is organized as follows. In the next section, the principle of our approach is described. We explain our approach in more detail in Section 3. In Section 4, we derive the simultaneous initialization (calibration and restricted area setting). We describe an experiment of intrusion detection in Section 5. In Section 6 we present our conclusion.
3 3.1
Detection of an Intruding Object The Visual Hull Method
To decide if an object exists in a specific area, the 3D shape of the object in the scene must be obtained. We adopt the visual hull method for shape reconstruction. In the visual hull method, the shape of an object can be reconstructed by computing the intersection of all cones, which are defined by a set of rays through the viewpoint and one point on the edge of the silhouette on an image plane. This method has the advantage that the texture of an object does not affect the reconstructed shape, because there is no necessity to search the corresponding points between images. However, this method tends to reconstruct a shape larger than the real shape, particularly with concave surfaces. Also, an invisible area from any of the cameras can also make it impossible to measure the shape. Although this is a common problem for image-based surveillance, our approach is always safe because the proposed system handles the invisible area as a part of the object. Although the visual hull method has great merit for intrusion detection, it needs large computational resources for the set operation in 3D space. Therefore, it is difficult to construct an intrusion detection system that is reasonable and works in real time. 3.2
Section Shape Reconstruction on a Sensitive Plane
As mentioned above, it is sufficient to observe only sensitive planes, the boundary of a restricted area, for intrusion detection. Accordingly, only the shape of the intersection region on a sensitive plane is reconstructed by homography based volume intersection [2]. In this case, the common region of projected silhouettes on the plane is equivalent to the intersection of the visual hull and the plane. Therefore, when an object exceeds a sensitive plane, the common region appears on the plane (Fig. 1). In this way, the 3D volumetric intrusion detection problem is reduced to efficient processes of inter-plane projection and common region computation in 2D space. 3.3
Vector Representation of the Silhouette Boundary
The visual hull method only uses information of the boundary of a silhouette. Therefore, the amount of data can be decreased by replacing the boundary with
152
S. Kawabata, S. Hiura, and K. Sato
(a) Non-intruding
(b) Intruding
Fig. 1. Intrusion detection based on the existence of an intersection
Fig. 2. Vector representation of silhouette contours
vector representation by tracking the edge of the silhouette in an image (Fig. 2). In the vector representation, the projection between planes is achieved by transforming a few vertices on the edge. It is easy for the common region computation to decide whether each vertex is inside or outside the other contour. With this representation, we are able to reduce the computational costs for the transformation and common region calculation, and it is not necessary to adjust the resolution of the sensitive plane to compute the common region with sufficient preciseness. In a distributed vision system, it is possible to reduce the amount of communication data because many camera-connected nodes extract silhouette contours and one host gathers the silhouette data and computes the common region. 3.4
Procedure of the Proposed System
For summarization, intrusion detection on the boundary is realized by the following steps: 1. 2. 3. 4. 5. 6.
Defining sensitive planes. Extracting the silhouette of a target object. Generating the vector representation from the silhouette. Projecting each silhouette vector onto sensitive planes. Computing the common region. Deciding the intrusion.
In the next section, we discuss step 1.
4
Construction of a Restricted Area
Using the following relationship, the silhouette of an object on an image plane can be transformed onto a sensitive plane. Let x(∈ 2 ) be the coordinate of a
3D Intrusion Detection System with Uncalibrated Multiple Cameras
153
㪭㫀㪼㫎㫇㫆㫀㫅㫋
㪠㫄㪸㪾㪼㩷㫇㫃㪸㫅㪼
㪪㪼㫅㫊㫀㫋㫀㫍㪼㩷㫇㫃㪸㫅㪼
Fig. 3. Homography between two planes
point on a sensitive plane. The corresponding point on the image plane can be calculated, as follows: ˜, μ˜ x = H x ⎡ ⎤ h11 h12 h13 H = ⎣h21 h22 h23 ⎦ h31 h32 h33
(1) (2)
˜ is the notation of homogeneous coordinates of x. Matrix H is referred where x to as a homography matrix, which has only 8 DOF for the scale invariant. From Eq. (2), the homography matrix can be determined by more than four pairs of corresponding points which are specified by a user. However, this method is a burden to users, who must set up the system in proportion to the product of the number of cameras and the number of sensitive planes. Also, it is not easy for users to define an arbitrarily restricted area without a reference object. Therefore, in the next section, we introduce a more convenient method for setting a sensitive plane. 4.1
Relation of the Homography Matrix and Projection Matrix
Instead of specifying the points on an image from a camera view, it is easy to place a small marker in the real observed space so that we obtain the corresponding points using cameras. However, in this case, it is difficult to point out the four points on a plane in real 3D space. Therefore, we consider the method in which users input enough ‘inner’ points of the restricted area so that the system automatically generates a set of sensitive planes which cover all the input points. Now, when we know the projection matrix P , which translates a coordinate in a scene onto an image plane, the relationship between X, a point in 3D space, and x, a point on an image plane, is given by ˜ λ˜ x = P X.
(3)
Likewise, as shown in Fig. 4, a point on the plane Π in 3D space is projected onto the image plane as follows.
154
S. Kawabata, S. Hiura, and K. Sato
㪊㪄㪛㩷㫇㫃㪸㫅㪼 㪭㫀㪼㫎㫇㫆㫀㫅㫋 㪠㫄㪸㪾㪼㩷㫇㫃㪸㫅㪼 Fig. 4. A plane in 3D space projected onto the image plane
˜ 0) λ˜ x = P (α˜ e1 + β˜ e2 + π ⎡ ⎤ α ˜1 e ˜2 π ˜ 0 ⎣β ⎦ =P e 1
(4) (5)
where e1 , e2 are bases of Π in 3D, and π 0 , (α, β) are the origin and parameter of Π, respectively. From Eq. (5), we can compute the homography matrix between an arbitrary plane in 3D and the image plane by ˜1 e ˜2 π ˜0 . H=P e (6) Therefore, when we know the projection matrices of the cameras and are given three or more points on a plane in 3D, it is possible to define the plane as a sensitive plane, except in a singular case (e.g., all points are on a line.). For example, the three adjacent points X 0 , X 1 , X 2 make one plane: ⎧ ⎨ e1 := X 1 − X 0 , e2 := X 2 − X 0 , (7) ⎩ π0 := X 0 . As mentioned above, a set of homogeneous matrices can be automatically generated from each given camera projection matrix and the vertices of the sensitive planes in 3D space. However, in our problem, we assume both the camera parameters and 3D points are unknown. Therefore, we have to calculate both by the projective reconstruction technique [3] using the given corresponding points between cameras. 4.2
Generation of Sensitive Planes from Reconstructed Inner Points
Now we have the projection matrices and many reconstructed 3D points which reside in the restricted area, so we have to determine enough pairs of 3D points as the vertices of the sensitive planes. We compute the convex hull, which handles all the input points for generating sensitive planes. The system defines a restricted
3D Intrusion Detection System with Uncalibrated Multiple Cameras
155
5GPUKVKXG2NCPG5GVWR +PRWVVKPIRQKPVUD[COCTMGT 2TQLGEVKXGTGEQPUVTWEVKQP %QPXGZJWNNECNEWNCVKQP )GPGTCVKQPQHUGPUKVKXGRNCPGU
+PVTWUKQP&GVGEVKQP 5KNJQWGVVGGZVTCEVKQP 5KNJQWGVVGXGEVQTK\CVKQP 2TQLGEVKQPQPVQUGPUKVKXGRNCPGU %QOOQPTGIKQPEQORWVCVKQP
Fig. 5. Points and their convex hull Fig. 6. Flow chart of the proposed system (2D case)
area as the boundary of the convex hull computed using qhull [4] (Fig. 5). The reconstructed points, except on the boundary, are removed because they do not make a sensitive plane.
5
Experiment
We implemented the proposed intrusion detection method in a multiple-camera system. From the users’ view, the system has two phases: one is setting the sensitive planes and the other is executing intrusion detection (see Fig. 6). Since the latter phase is completely automated, users need only to input corresponding points with a simple marker. Therefore, any complicated technical process, such as calibration of the multiple camera system, is already managed for setting the actual sensitive plane. In this experiment, we confirm the proposed method of sensitive plane generation and intrusion detection in projective space. The system consists of three cameras (SONY DFW-VL500) and a PC (Dual Intel Xeon @ 3.6 [GHz] w/ HT). We set the cameras at an appropriate position so that each camera can observe the whole region to detect an intrusion (Fig. 7). 5.1
Input of Sensitive Plane Using a Colored Marker
We use a simple red colored marker to input corresponding points among all image planes. First, the user specifies the color of the marker by clicking on the area of the marker, then the system computes the mean and the variance of the area. According to the Mahalanobis distance between an input color at each pixel and the reference color, the system extracts similar pixels by thresholding the distance. For noise reduction, the center of gravity of the largest region is
156
S. Kawabata, S. Hiura, and K. Sato
Fig. 7. Cameras and observed Fig. 8. Setting of restricted area (top: camera space view, bottom: extracted marker position)
Fig. 9. Inputted points and generated convex hull
calculated as the marker position (Fig. 8). The user in a real scene moves the marker position to set up the restricted area. Fig. 9 shows an example of the sensitive planes generated from inputted points. In this case, 16 sensitive planes are generated from 10 of 12 inputted points, and remaining two points of them are removed because they are inside of the convex hull. 5.2
Intrusion Detection
In this experiment, we input eight points on the vertices of a hexahedron. Fig. 10 depicts the generated set of sensitive planes from the input points. In this case, 12 planes are generated by the proposed method. The result of the intrusion detection is shown in Fig. 11. In our implementation, we use a statistical background subtraction method [5] to extract a silhouette of the object from an image. The silhouette is transformed into vector representation by tracking the edge and projected onto each sensitive
3D Intrusion Detection System with Uncalibrated Multiple Cameras
157
Fig. 10. Generated sensitive planes
Fig. 11. Detection result (top: intrusion of a leg, bottom: intrusion of a wrist, reaching for the object)
plane. Then, the system computes the common region on each sensitive plane. In the figure, the leg or wrist of the intruder is detected on the boundary of the restricted area. Although one can see some false positive extraction areas of the silhouette (e.g., the shadow cast in the image of the top row, third column), our method has a robustness against such noise because of the common region computation of all extracted silhouettes.
158
6
S. Kawabata, S. Hiura, and K. Sato
Conclusion
In this paper, we introduce an intrusion detection system for an arbitrary 3D volumetric restricted area using uncalibrated multiple cameras. Although our algorithm is based on the visual hull method, the whole shape of intruding object does not need to be reconstructed; instead, the system can efficiently detect an intrusion by perspective projections in 2D space. In general, an intricate calibration process for a distributed camera system has been necessary, but the proposed system automatically calibrates the cameras when users input corresponding points through the restricted region setting. Furthermore, the user does not need any previous knowledge about cameras because of the projective reconstruction. Also, any combination of cameras having varying intrinsic camera parameters can be used. Therefore, non-expert users can intuitively operate the proposed system for intrusion detection by only setting the cameras in place.
References 1. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., et al.: A system for video surveillance and monitoring (VSAM project final report). Technical report, CMU Technical Report CMU-RI-TR-00 (2000) 2. Wada, T., Wu, X., Tokai, S., Matsuyama, T.: Homography Based Parallel Volume Intersection: Toward Real-Time Volume Reconstruction Using Active Cameras. In: Proc. Computer Architectures for Machine Perception, pp. 331–339 (2000) 3. Mahamud, S., Hebert, M.: Iterative projective reconstruction from multiple views. In: Proc. CVPR, vol. 2, pp. 430–437 (2000) 4. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Mathematical Software (TOMS) 22(4), 469–483 (1996), http://www.qhull.org 5. Horprasert, T., Harwood, D., Davis, L.S.: A statistical approach for real-time robust background subtraction and shadow detection. In: ICCV 1999, pp. 1–19 (1999)
Non-parametric Background and Shadow Modeling for Object Detection Tatsuya Tanaka1 , Atsushi Shimada1, Daisaku Arita1,2 , and Rin-ichiro Taniguchi1 1
Department of Intelligent Systems, Kyushu University, 744, Motooka, Nishi-ku, Fukuoka 819–0395 Japan 2 Institute of Systems & Information Technologies/KYUSHU 2–1–22, Momochihama, Sawara-ku, Fukuoka 814–0001 Japan
Abstract. We propose a fast algorithm to estimate background models using Parzen density estimation in non-stationary scenes. Each pixel has a probability density which approximates pixel values observed in a video sequence. It is important to estimate a probability density function fast and accurately. In our approach, the probability density function is partially updated within the range of the window function based on the observed pixel value. The model adapts quickly to changes in the scene and foreground objects can be robustly detected. In addition, applying our approach to cast-shadow modeling, we can detect moving cast shadows. Several experiments show the effectiveness of our approach.
1 Introduction Background subtraction technique has been traditionally applied to detection of objects in image. Without prior information about the objects, we can get object regions by subtracting a background image from an observed image. However, when simple background subtraction technique is applied to video-based surveillance which usually captures outdoor scenes, it often detects not only objects but also a lot of noise regions. This is because it is quite sensitive to small illumination changes caused by moving clouds, swaying tree leaves, etc. There are many approaches to handle these background changes [1,2,3,4]. Han et al. proposed a background estimation method, in which mixture-of-Gaussians is used to approximate background model, and the number of Gaussians is variable in each pixel. Their method can handle variations in lighting since a Gaussian is inserted or deleted according to the illuminant condition. However, it takes a long time to estimate background model. There are also several approaches to estimate background model in shorter time [5,6]. For example, Stauffer et al. proposed a fast estimation method to avoid a costly matrix inversion by ignoring covariance components of multi-dimensional Gaussians [6]. However, the number of Gaussians is constant in their background model. When recently observed pixel values frequently change, a constant number of Gaussians is not always enough to estimate the background model accurately, and it is very difficult to determine the appropriate number of Gaussians in advance. Shimada et al proposed a fast method in which the number of Gaussians are changed dynamically to adapt to the change of the lighting condition [7]. However, in principle, Gaussian Mixture Model (GMM) can not make a well-suited background model and can not detect foreground objects accurately when the intensity Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 159–168, 2007. c Springer-Verlag Berlin Heidelberg 2007
160
T. Tanaka et al.
of the background changes frequently. Especially when the intensity distribution of the background is very wide, it is not easy to represent the distribution with a set of Gaussians. In addition, if the number of Gaussians is increased, the computation time to estimate the background model is also increased. Thus, GMM is not powerful enough to represent the various changes of the lighting condition. To solve the problem, Elgammal et al employed non-parametric representation of the background intensity distribution, and estimated the distribution by Parzen density estimation [2]. However, in their approach, the computation cost of the estimation is quite high, and it is not easy to apply it to real-time processing. Another problem of background subtraction is that detected foreground regions generally include not only objects to be detected but their cast shadows since the shadow intensity differs from that of the modeled background. This misclassification of shadow regions as foreground objects can cause various unwanted behavior such as object shape distortion and object merging, affecting surveillance capability like target counting and identification. To obtain better segmentation quality, detection algorithms must correctly separate foreground objects from the shadows they cast. Then, various approachs has been proposed [8,9,10,11,12]. Martel-Brisson et al. proposed a shadow detection method [12], in which detection of moving cast shadows is incorporated into a background subtraction algorithm. However, they use GMM to model background and shadow, and the aforementioned problem of GMM remains. In this paper, we propose a fast algorithm to estimate non-parametric probability distribution based on Parzen density estimation, which is applied to background modeling. Also, applying our approach to chast-shadow modeling the shadow models, we can detect moving cast shadow. Several experiments show its effectiveness, i.e., its accuracy and computation efficiency.
2 Background Estimation by Parzen Density Estimation 2.1 Basic Algorithm At first, we describe basic background model estimation and object detection process. The backgrounmd model is established to represent recent pixel information of an input image sequence, reflecting the change of intensity, or pixel-value, distribution as quickly as possible. We consider values of a particular pixel (x, y) over time as a “pixel process”, which is a time series of pixel values, e.g. scalars for gray values and vectors for color images. Each pixel is judged to be a foreground pixel or a background pixel by observing the pixel process. In Parzen density estimation, or the kernel density estimation, the probability density function (PDF) of a pixel value is estimated referring to the latest pixel process, and, here, we assume that a pixel process consists of the latest N pixel values. Let X be a pixel value observed at pixel (x, y), and {X 1 , · · · , X N } be the latest pixel process. The PDF of the pixel value is estimated with the kernel estimator K as follows P (X) =
N 1 K(X − X i ) N i=1
(1)
Non-parametric Background and Shadow Modeling for Object Detection
161
Usually a Gaussian distribution function N (0, Σ) is adopted for the estimator K 1 . In this case the equation (1) is reduced to as follows: N 1 1 1 −1 T exp − (X − X i ) Σ (X − X i ) P (X) = N i=1 (2π) d2 |Σ| 12 2
(2)
where d is the dimension of the distribution (for example, d = 3 in color image pixels). To reduce the computation cost, the covariance matrix in equation (2) is often approximated as follows: ⎞ ⎛ 2 σ1 0 · · · 0 ⎜ . ⎟ ⎜ 0 σ22 . . . .. ⎟ ⎟ (3) Σ=⎜ ⎟ ⎜ .. . . . . ⎝ . . . 0⎠ 0 · · · 0 σd2 . This means that each dimension of the distribution is independent from one another. By this approximation, equation (2) is reduced into the following.
2 N d 1 1 1 (X − Xi )j P (X) = exp − (4) N i=1 j=1 (2πσj2 ) 12 2 σj2 This approximation might make the density estimation error a little bigger, but the computation is considerably reduced. The detailed algorithm of background model construction and foreground object detection is summarized as follows: 1. When a new pixel value X N +1 is observed, P (X N +1 ), the probability that X N +1 occurs is estimated by equation (4). 2. If P (X N +1 ) is greater than a given threshold, the pixel is judged to be a background pixel. Otherwise, it is judged to be a foreground pixel. 3. The newly observed pixel value X N +1 is kept in the “pixel process,” while the oldest pixel value X 1 is removed from the pixel process. Applying the above calculation to every pixel, the background model is generated and distinction between a background pixel and a foreground pixel is accomplished. 2.2 Fast Algorithm When we estimate the generation probability of pixel value X in every frame using equation (4) and estimate the background model, its computation cost becomes quite large. To reduce the computation, Elgammal et al. computed the kernel, K, for all possible (X − X i ) in advance, which is stored in a look-up table. However, in their method, computation cost of N -times addition in the kernel K() in equation (1) is not small, which makes the computation time for background estimation large. To solve this problem, we have developed a fast estimation scheme of the PDF as follows. 1
Here, Σ works as the smoothing parameter.
162
T. Tanaka et al.
P( X )
Pt ( X ) = Pt −1 ( X ) +
1 ⎛ | X − X N +1 | ⎞ 1 ⎛ | X − X1 | ⎞ Ȁ⎜ Ȁ⎜ ⎟− ⎟ d Nh d ⎝ h h ⎠ ⎠ Nh ⎝
Update within the range of the window function
h=5
Background
K(u)
d=1
1 h
Threshold
u
䈅䈅䈅 − h
2
0
h 2
Fig. 1. Kernel function of our algorithm
Pixel Value
Observed value
Oldest data
Fig. 2. Update of background model
At first, we use a kernel with rectangular shape, or hypercube, instead of Gaussian distribution function. For example, in 1-dimensional case, the kernel is represented as follows (see Figure 1). 1 if − 12 ≤ u ≤ h2 K(u) = h (5) 0 otherwise where h is a parameter representing the width of the kernel,i.e., some smoothing parameter [13]. Using this kernel, equation (1) is represented as follows: N |X − X i | 1 1 ψ (6) P (X) = N i=1 hd h where, |X − X i | means the chess-board distance in d-dimensional space, and ψ(u) is calculated by the following formula. 1 if u ≤ 12 ψ(u) = (7) 0 otherwise When an observed pixel value is inside of the kernel located at X, ψ(u) is 1; otherwise ψ(u) is 0. Thus, we estimate the PDF based on equation (6), and P (X) is calculated by enumerating pixels in the latest pixel process whose values are inside of the kernel located at X. However, if we calculate the PDF, in a naive way, by enumerating pixels in the latest pixel process whose values are inside of the kernel located at X, the computational time is proportional to N . Instead, we propose a fast algorithm to compute the PDF, whose computation cost does not depend on N . In background modeling we estimate P(X) referring to the latest pixel process consisting of pixel values of the latest N frames. Let us suppose that at time t we have a new pixel value X N +1 , and that we estimate an updated PDF P t (X) referring to the new X N +1 . Basically, the essence of PDF estimation is accumulation of the kernel
Non-parametric Background and Shadow Modeling for Object Detection
163
estimator, and, when a new value, X N +1 , is acquired the kernel estimator corresponding to X N +1 should be accumulated. At the same time, the oldest one, i.e., the kernel estimator at N frames earlier, should be discarded, since the length of the pixel process is constant, N . This idea leads to reduction of the PDF computation into the following incremental computation: 1 1 |X − X N +1 | |X − X 1 | )− ) (8) ψ( ψ( N hd h N hd h where Pt−1 is the PDF estimated at the previous frame. The above equation means that the PDF when a new pixel value is observed can be acquired by: Pt (X) = Pt−1 (X) +
– increasing the probabilities of pixel values which are inside of the kernel located at the new pixel value X N +1 by N1hd – decreasing those which are inside of the kernel located at the oldest pixel value, a pixel value at N frames earlier, X 1 by N1hd . In other words, the new PDF is acquired by local operation of the previous PDF, assuming the latest N pixel values are stored in the memory, which achieves quite fast computation of PDF estimation. Figure 2 illustrates how the PDF, or the background model, is modified.
3 Cast-Shadow Modeling by Parzen Density Estimation In this section, we propose a method to detect moving cast shadows in a background subtraction algorithm. We have developed a cast shadow detection method, which is based on a similar idea to [12], and it is based on the observation that a shadow cast on a surface will equally attenuate the values of three components of its YUV color. We first estimate this attenuation ratio referring to the component Y , and, then, we examine whether both U and V components are reduced by a similar ratio. More specifically, if color vector X represents the shadow cast on a surface whose background color vector is B, we have XY αmin < αY < 1 with αY = BY min{|X U |, |X V |} > αY − X U < Λ BU αY − X V < Λ BV
(9) (10) (11) (12)
where B means a pixel value of the highest probability. αmin is a threshold on maximum luminance reduction. This threshold is important when the U and V components of a color are small, in which case any dark pixel value would be labeled as a shadow for a light color surface. is a threshold for minimum value of the U and V components. If either X U or X V does not satisfy the equation (10), we use only equation (9). Λ represents the tolerable chromaticity fluctuation around the surface value B U ,
164
T. Tanaka et al.
B V . If these conditions are satisfied, the pixel value is regarded “pseudo-shadow”, and the shadow model is updated with the procedure which is similar to those which are expressed in section 2.2 making use of the pixel value. The detailed algorithm of shadow model construction and shadow detection is summarized as follows: 1. Background subtraction is done with the dynamic background model described in section 2.2. 2. If the pixel is labeled as foreground, PS (X N +1 ), the probability that X N +1 belongs to a cast shadow is estimated. If PS (X N +1 ) is greater than a given threshold, the pixel is judged to be a shadow pixel. Otherwise, it is judged to be an object pixel. However, in the shadow model, there is a possibility that the number of “pseudoshadow” pixel value which is necessary to approximate the shadow model is not enough, because the shadow model is updated only when the observed pixel is regarded as “pseudo-shadow”. Therefore in such pixel, equation(9)‘(12) are used for shadow detection. 3. When the observed pixel value satisfies the equation (9)‘(12), the pixel value is regarded as “pseudo-shadow”, the shadow model is updated by a similar way expressed in section 2.2.
4 Experiment 4.1 Experiment 1: Experiment on the Dynamic Background Model In our experiment verifying the effectiveness of the proposed method, we have used data set of PETS2001 2 after the image resolution is reduced into 320 × 240 pixels. The data set includes images in which people and cars are passing through streets, tree leaves are flickering, i.e., the illumination condition are varying rapidly. Using this data set, we have compared the proposed method with Adaptive GMM (GMM in which the number of Gaussians is adaptively changed) [7], and Elgammal’s method based on Parzen density estimation [2]. In this experiment, supposing that R, G, B components of pixel values are independent of one another, we estimate a one-dimensional PDF of each component. Then, we have judged a pixel as a foreground pixel when at least the probability of one component, either R, G or B, is below a given threshold. For the evaluation of computation speed, we have used a PC with a Pentium IV 3.3GHz and 2.5GB memory. Next, we have evaluated computation time to process one image frame. For the proposed algorithm, we have used h = 5, N = 500, Figure 3 shows comparison between the proposed method and the adaptive GMM method, where the horizontal axis is the frame-sequence number and where the vertical axes are the processing speed (left) and the average of number of Gaussians assigned to each pixel. In this experiment, after the 2500th frame, the number of Gaussians increases, where people and cars, i.e., foreground objects begin to appear in the scene. 2
Benchmark data of International Workshop on Performance Evaluation of Tracking and Surveillance. From ftp://pets.rdg.ac.uk/PETS2001/ available.
Non-parametric Background and Shadow Modeling for Object Detection Gaussian Mixture Model
2.5
Processing time (msec)
100 80
2
60
1.5
40
1
20
0.5
0
Processing time (msec)
Traditional approach
3 Number of Gaussians
Proposed method Number of Gaussians 120
0 0
1000
2000
Frame
3000
Accuracy (%)
Fig. 3. Processing time of adaptive GMM and average number of Gaussians
Proposed method
350 300 250 200 150 100 50 0 100
4000
165
200
300 400 Number of sample data
500
Fig. 4. The number of samples, or N , and require processing time
100 90 80 70 60 50 40 30 20 10 0
Recall Precission
Propose method
Gaussian Mixture Model
Traditional approach
Fig. 5. Recall and Precision
In the adaptive GMM method, the number of Gaussians is increased so that changes of pixel values are properly represented in GMM. However, when the number of Gaussians increases, the computation time also increases. On the other hand, the computation time of the proposed method does not change depending on the scene, which shows that the real-time characteristic, i.e., invariance of processing speed, of the proposed method is much better than the adaptive GMM method. Next, figure 4 shows comparison between the proposed method and Elgammal’s method based on Parzen density estimation. In the Elgammal’s method, the computation time is almost proportional to the length of the pixel process in which the PDF is estimated, and, from the viewpoint of real-time processing, we can not use long image sequence to estimate the PDF. For example, when we use a standard PC environment, like our experiment, only up to 200 frames can be used for the PDF estimation in the Elgammal’s method. On the other hand, in our method, when we estimate the PDF, we just update it in the local region, i.e., in the kernel located at the oldest pixel value and in the kernel located at the newly observed pixel value, and the computation cost does not depend on the length of the pixel process at all. Finally, to evaluate the object detection accuracy, we examine precision and recall rates of object detection. Precision and recall are respectively defined as follows: precision =
recall =
# correctly detected objects # of detected objects
# of correctly detected objects # of objects which should be detected
(13)
(14)
166
T. Tanaka et al.
(a) Input image
(b) Background image (c) Proposed method
(d) GMM method
Fig. 6. Object detection by the proposed method and GMM-based method
When we apply our proposed method and Elgammal’s method, we set N = 500. In addition, we set h = 5 in our method. Figure 5 shows precision and recall when the data set is processed by the proposed method, adaptive GMM method, and Elgammal’s method, where the vertical axis means recall and precision rate. This shows that the proposed method outperform the adaptive GMM method. Also, it is shown that the proposed method gives almost the same performance as Elgammal’s method, although, in the proposed method, we use a simple kernel function, i.e., rectangular function shown in Figure 5. We have achieved a recall rate with 94.38% and a precision rate with 88.89%. Figure 6 shows results of object detection by the proposed method. Figure 6(a) is an input image frame. Figure 6(b) is acquired background model when that input image frame is acquired, which shows a pixel value having the highest probability at each pixel. Figure 6(c) shows detected objects. Figure 6(d) shows object detection result acquired by the adaptive GMM method. Comparing these two result, the proposed method exhibits very good result. 4.2 Experiment 2: Experimtnt on the Dynamic Shadow Model We took indoor scenes where people were walking on the floor. Those images include shadows with various darkness which is cast from the pedestrian. The size of image is 320 × 240 and each pixel had a 24-bit RGB value. We have compared the proposed method with Adaptive Gaussian Mixture Shadow Model(GMSM) [12]. Furthermore, respectively, the dynamic background model which uses Parzen density estimation and which uses GMM are used to object detection. Figure 7 shows results of shadow detection by the proposed method. Figure 7(a) is an input image frame. Figure 7(b) shows shadow detection result acquired by the
Non-parametric Background and Shadow Modeling for Object Detection
(a) Input image
(b) Proposed method
167
(c) GMSM method
Fig. 7. Shadow detection by the proposed method and GMSM-based method
proposed method. Figure 7(c) shows shadow detection result acquired by the GMSM method. The red colored pixels represent pixels judged to be a shadow pixels. In Figure 7(b), the green colored pixels represent pixels judged to be shadow pixels just according to equation (9)‘(12), where they can not be examined by the probabilistic model, because the number of pseudo-shadow pixels is not enough to estimate the probability distribution of the shadow pixel value. Comparing these two results with each other, the proposed method exhibits a good result. In addition, the computation time of the proposed method is superior to that of GMSM, i.e., the former is 88 msec per image frame while the latter is 97 msec.
5 Conclusion In this paper, we have proposed a fast computation method to estimate non-parametric background model using Parzen density estimation. We estimate the PDF of background pixel value at each pixel position. In general, to estimate the PDF at every image frame, a pixel value sequence of the latest N frames, or a pixel process, should be referred to. In our method, using a simple kernel function, the PDF can be estimated from the PDF at the previous frame using local operations on the PDF. This much improves the computation cost of the PDF estimation. Comparison of our method with GMM-based method and Elgammal’s method based on Parzen density estimation shows that our method has the following merits: small computation cost, real-time characteristic (invariance of computation speed), object detection accuracy.
168
T. Tanaka et al.
In addition, applying our approach to shadow modeling, we can construct shadow model and detect moving cast shadows correctly. Comparison of our method with GMSM-based method shows its effectiveness, i.e., its accuracy and computation speed. Future works are summarized as follows: – Reduction of memory space. – Precision improvement of shadow detection.
References 1. Han, B., Comaniciu, D., Davis, L.: Sequential Kernel Density Approximation through Mode Propagation: Applications to Background Modeling. In: Asian Conference on Computer Vision 2004, pp. 818–823 (2004) 2. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance. In: Proceedings of the IEEE, vol. 90, pp. 1151–1163 (2002) 3. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principle and Practice of Background Maintenance. In: International Conference on Computer Vision, pp. 255–261 (1999) 4. Harville, M.: A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-ofGaussian Background Models. In: the 7th European Conference on Computer Vision, vol. III, pp. 543–560 (2002) 5. Lee, D.-S.: Online Adaptive Gaussian Mixture Learning for Video Applications. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 105–116. Springer, Heidelberg (2004) 6. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. Computer Vision and Pattern Recognition 2, 246–252 (1999) 7. Shimada, A., Arita, D., Taniguchi, R.i.: Dynamic Control of Adaptive Mixture-of-Gaussians Background Model. In: Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance 2006 (2006) 8. Salvador, E., Cavallaro, A., Ebrahimi, T.: SHADOW IDENTIFICATION AND CLASSIFICATION USING INVARIANT COLOR MODELS. In: Proc. of IEEE International Conference on Acoustics, vol. 3, pp. 1545–1548 (2001) 9. Cucchiara, R., Grana, C., Piccardi, M., Prati, A., Sirotti, S.: Improving Shadow Suppression in Moving Object Detection with HSV Color Information. In: IEEE Intelligent Transportation Systems Conference Proceedings, pp. 334–339 (2001) 10. Schreer, O., Feldmann, I., Golz, U., Kauff, P.: FAST AND ROBUST SHADOW DETECTION IN VIDEOCONFERENCE APPLICATION. 4th IEEE Intern. Symposium on Video Proces. and Multimedia Comm, 371–375 (2002) 11. Bevilacqua, A.: Effective Shadow Detection in Traffic Monitoring Applications. In: The 11th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (2003) 12. Martel-Brisson, N., Zaccarin, A.: Moving Cast Shadow Detection from a Gaussian Mixture Shadow Model. IEEE Computer Society International Conference on Computer Vision and Pattern Recognition (2005) 13. Parzen, E.: On the estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076 (1962)
Road Sign Detection Using Eigen Color Luo-Wei Tsai1, Yun-Jung Tseng1, Jun-Wei Hsieh2, Kuo-Chin Fan1, and Jiun-Jie Li1 1
Department of CSIE, National Central University Jung-Da Rd., Chung-Li 320, Taiwan
[email protected] 2 Department of E. E., Yuan Ze University 135 Yuan-Tung Road, Chung-Li 320, Taiwan
[email protected]
Abstract. This paper presents a novel color-based method to detect road signs directly from videos. A road sign usually has specific colors and high contrast to its background. Traditional color-based approaches need to train different color detectors for detecting road signs if their colors are different. This paper presents a novel color model derived from Karhunen-Loeve(KL) transform to detect road sign color pixels from the background. The proposed color transform model is invariant to different perspective effects and occlusions. Furthermore, only one color model is needed to detect various road signs. After transformation into the proposed color space, a RBF (Radial Basis Function) network is trained for finding all possible road sign candidates. Then, a verification process is applied to these candidates according to their edge maps. Due to the filtering effect and discriminative ability of the proposed color model, different road signs can be very efficiently detected from videos. Experiment results have proved that the proposed method is robust, accurate, and powerful in road sign detection.
1 Introduction Traffic sign detection is an important and essential task in a driver support system. The texts on road signs carry much useful information like limited speed, guided direction, and current traffic situation for helping the drivers drive safely and comfortably. However, it is very challenging to detect road signs directly from still images or videos due to the large changes of environmental conditions. In addition, when the camera is moving, the perspective effects will make a road sign have different sizes, shapes, contrast changes, or motion blurs. Moreover, sometimes it will be occluded with some natural objects like trees. To tackle the above problems, there have been many works [1]-[9]proposed for automatic road sign detection and recognition. Since a road sign usually has a high-contrast color and regular shape, these approaches can be categorized into color-based or shape-based ones. For the color-based approach, in [1], Escalera et al. used a color threshoding technique to separate road sign regions from the background in the RGB color domain. In addition to the RGB space, other color spaces like YIQ and HSV are also good for road sign detection. For example, in [2], Kehtarnavaz and Ahmad used a discriminant analysis on the YIQ color space for Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 169–179, 2007. © Springer-Verlag Berlin Heidelberg 2007
170
L.-W. Tsai et al.
detecting desired road signs from the background. Since road signs have different colors (like red, blue, or green) for showing different warning or direction messages, different color detectors should be designed for tackling the above color variations. In addition to color, shape is another important feature for detecting road signs. In [7], Barnes and Zelinsky adopted the fast radial symmetry detector to detect possible road sign candidates and then to verify them using a correlation technique. Wu et al. [6] used the corner feature and a vertical plane criterion to cluster image data for finding possible road sign candidates. Blancard [9] used an edge linking technique and the contour feature to locate all possible road sign candidates and then verified them according to their perimeters and curvature features. Usually, different shapes of road signs represent different warning functions. Different shape detectors should be designed and make the detection process become very time-consumed. Therefore, there are some hybrid methods proposed for road sign detection. For example, Bahlmann et al. [8] used a color representation, integral features, and the AdaBoost algorithm for training a stronger classifier such that a real-time traffic sign detector can be achieved. Furthermore, Fang et al. [3] used fuzzy neural networks and gradient feature to locate and track road signs. The major disadvantage of the shape-based approach is that a road sign has large shape variations when the camera is moving. This paper presents a novel hybrid method to detect road signs from videos using eigen color and shape feature. First of all, this paper proposes a novel eigen color model for searching possible road signs candidates from videos. The model can make road sign colors be more compact and thus sufficiently concentrated on a smaller area. It is learned by observing how the road sign colors change in static images under different lighting conditions and cluttered backgrounds. It is global and doesn’t need to be re-estimated for any new road signs or new input images. Without prior knowledge of surface reflectance, weather condition, and view geometry is used in the training phase, the model still performs very efficiently to locate road sign pixels from the background. Even though road signs have different colors, only one single model is needed. After the transformation, the RBF network is used for finding the best hyper-plane to separate the road sign color from the background. Then, a verification engine is built to verify these candidates using their edge maps. The engine records appearance characteristic of road signs and has good discriminative properties to verify the correctness of each candidate. In this system, the eigen color model can filter out most of background pixels in advance. Only few candidates need to be further checked. In addition, no matter what color the road sign is, only one eigen color model is needed for color classification. Due to the filtering effect and discriminative abilities of the proposed method, different road signs can be effectively detected from videos. Experiment results have proved the superiority of the proposed method in road sign detection.
2 Eigen Color Detection A road sign usually has a specific color which is high contrast to the background. The color information can be used to narrow down the searching area for finding the road signs more efficiently. For example, in Fig. 1(a), the road sign has a specific “green” color. Then, we can use a green color detector to filter out all non-green color objects.
Road Sign Detection Using Eigen Color
(a)
171
(b)
Fig. 1. Green color detection in HIS color space. (a) Original image. (b) Result of green color detection.
However, after simple green color classification, there are many non-road-sign objects (with green color) to be detected as shown in Fig. 1(b). Precise color modeling method is necessary for road sign detection. In addition, different road signs have different specific colors (green, red, or blue). In contract to most previous system which designed different “specific” color detectors, this paper presents a single eigen color model to detect all kinds of road signs. 2.1 Eigen Color Detection Using Dimension Reduction Our idea is to design a single eigen-color transform model for detecting road sign pixels from the background. At first, we collect a lot of road sign images from various highways, roads, and nature images under different lighting and weather conditions. Fig. 2(a) shows parts of our training samples. Assume that there are N training images. Through a statistic analysis, we can get the covariance matrix ∑ of the color
(a)
(b)
Fig. 2. Parts of training samples. (a) Road sign images. (b) Non-road sign images.
distributions of R, G, and B channels from these N images. Using the Karhunen-Loeve(KL) transform, the eigenvectors and eigenvalues of ∑ can be further obtained and represented as ei and λi , respectively, for i = 1, 2, and 3. Then, three new color features Ci can be formed and defined, respectively,
172
L.-W. Tsai et al.
Ci = eir R + eig G + eib B for i =1, 2, and 3,
(1)
where ei = ( eir , eig , eib ) . The color feature C1 with the largest eigenvalue is the one used for color-to-gray transform, i.e.,
1 1 1 C1 = R + G + B . 3 3 3
(2)
Other two color features C2 and C3 are orthogonal to C1 and have the following forms: C2 =
2(R - B ) R +B-2G and C3 = . 5 5
(3)
In [10], Healey used the similar idea for image segmentation and pointed out that the colors of homogeneous dielectric surfaces will move close along the axis directed by Eq.(2), i.e., (1/3, 1/3, 1/3). In other words, if we try to project all the road sign colors to a plane which is perpendicular to the axis pointed by C1 , the road sign colors will concentrate around a small area. The above principal component analysis (PCA) gives us an inspiration to analyze road signs so that a new color model can be found.
(a)
(b)
Fig. 3. Eigen color re-projection. (a) Original image. (b) Result of projection on eigen color map.
This paper defines the plane ( C2 , C3 ) as a new color space (u, v). Then, given an input image, we first use Eq.(3) to project all color pixels on the (u, v) space. Then, the problem becomes a 2-class separation problem, which tries to find a best decision boundary from the (u, v) space such that all road sign color pixels can be well separated from non-road sign ones. Fig. 3(b) shows the projection result of road sign pixels and non-road sign pixels. The green and red regions denote the results of re-projection of green and red road signs, respectively. The blue region is the result of background. We also re-project the tree region and green road signs (shown in Fig. 3(a)) on the (u, v) space. Although these two regions are both “green”, they can be easily separated on the (u, v) space if a proper classifier is designed for finding the best separation boundary. In what follows, road sign pixels are fed into the RBF network for this classification task.
Road Sign Detection Using Eigen Color
173
2.2 Eigen Color Pixel Classification Using Radial Basis Function Network
A RBF network’s structure is similar to multilayer perceptrons. The RBF network we used includes an input layer, one hidden layer, and an output layer. Each hidden neuron is associated with a kernel function. The most commonly used kernel function (also called an activation function) is Gaussian. The output units is approximated as a linear combination of a set of kernel functions, i.e., R
ψ i ( x ) = ∑ wijϕ j ( x ) , for i=1, …, C, j =1
where wij is the connection weight between the jth hidden neuron and ith output layer neuron, and C the number of outputs. The output of the radial basis function is limited to the interval (0, 1) by a sigmoid function:
Fi ( x ) =
1 . 1 + exp(-ψ i ( x ))
When training the RBF network, we use the back-propagation rule to adjust the output connection weights, the mean vector, and the variance vectors of the hidden layer. The parameters wij of the RBF networks are computed by the gradient descent method such that the cost function is minimized: E=
1 N
N
C
∑∑ ( y (x ) − F ( x )) k =1 i =1
i
k
i
k
2
,
where N is the number of inputs and yi ( xk ) the ith output associated with the input sample xk from the training set. Then, if a pixel belongs to the road sign class, it will be labeled to 1; otherwise, 0. When training, all pixels in the (R, G, B) domain are first transformed to the (u, v) domain using Eq. (3).
3 Candidate Verification After color segmentation, different road sign candidates can be extracted. For verifying these candidates, we use road sign’s shape to filter out impossible candidates. The verification process is a coarse-to-fine scheme to gradually remove impossible candidates. At the coarse stage, two criteria are first used to roughly eliminate a large number of impossible candidates. The first criterion requires the dimension of a road sign R being large enough. The second criterion requires the road sign having enough edge pixels: ER / AreaR < 0.02 , where ER and Area R are the number of edge pixels and the area of R, respectively.
174
L.-W. Tsai et al.
(a)
(b)
(c)
Fig. 4. Result of distance transform. (a) Original Image. (b) Edge map. (c) Distance transform of (b).
After the coarse verification, a fine verification procedure is further applied to verifying each candidate using its shape. Assume that BR is a set of boundary pixels extracted from R. Then, the distance transform of a pixel p in R is defined as
DTR ( p ) = min d ( p, q) , q∈BR
(4)
where d ( p, q) is the Euclidian distance between p and q. In order to enhance the strength of distance changes, Eq.(4) is further modified as follows DT R ( p ) = min d ( p, q) × exp(κ d ( p, q)) , q∈BR
(5)
where κ = 0.1 . Fig. 4 shows the result of the distance transform. (a) is an image R of road sign and (b) is its edge map. Fig. 4(c) is the result of the distance transform of Fig. 4(b). If we scan all pixels of R in a row major order, a set FR of contour features can be represented as a vector, i.e.,
FR = [ DT R ( p0 ),...., DT R ( pi ),....] ,
(6)
where all pi belong to R and i is the scanning index. In addition to the outer contour, a road sign usually contains many text patterns. To verify a road sign candidate more accurately, its outer shape is more important than its inner text patterns. For reflecting this fact, a new weight wi which increases according to the distance between the pixel pi and the original O is included. Assume that O is the central of R and ri is the distance between pi and O, and the circumcircle of R has the radius z. Then, the weight wi is defined by: ⎧ exp(- | ri - z |2 ), if ri ≤ z; wi = ⎨ otherwise. ⎩0,
(7)
Then, Eq.(6) can be rewritten as follows: FR = [ w0 DT R ( p0 ),...., wi DT R ( pi ),....] .
(8)
This paper assumes that there are only three types of road signs, i.e., circle, triangle, and rectangle needed for verification. For each type Ri , a set of training samples is
Road Sign Detection Using Eigen Color
175
collected in advance for capturing shape characteristics. If there are N i templates in Ri , we can calculate its mean μi and variance Σi of FR from all samples in Ri . Then, given a road sign candidate H, the similarity between H and Ri can be measured by this equation: __
__
S ( H , Ri ) = exp(−( FH − ui ) ∑ i−1 ( FH − ui ) t ), ,
(9)
where t means the transpose of a vector.
4 Experimental Results To examine the performances of our proposed method, several video sequences on high way and roads were adapted. The sequences were captured under different road and weather conditions (like sunny, cloudy). The camera was embedded in the front position of the car and its optical axis is not required being perpendicular to the road sign. The frame rate of our system is over 20 fps. Fig. 5 shows the results of road sign color detection using the proposed method. For comparisons, the color thresholding technique [1] was also implemented. Fig. 5 (a) is the original image and Fig. 5 (b) is the result using the color thresholding technique. There were many false region detected for road sign detection in Fig. 5 (b). Fig. 5 (c) is the result of eigen color classification. It is noticed that only one eigen color model was used to detect all the desired road signs even though their colors were different. Compared with the thresholding technique, our proposed scheme has a much lower false detection rate. A lower false detection rate means that less computation time
(a)
(b)
(c)
Fig. 5. Result of color classification. (a) Original image. (b) Detection result of color thresholding [1]. (c) Eigen color classification.
176
L.-W. Tsai et al.
needed for candidate verification. In addition, the color thresholding technique needs several scanning passes to detect road signs if they have different colors. Thus, our method can perform much efficiently than traditional color-based approaches.
Fig. 6. Detection results of rectangular road sign
Fig. 7. Detection result when a skewed road sign or a low-quality frame was handled
Fig. 6 shows the detection results when rectangular road signs were handled. Even though the tree regions have similar color to the road signs, our method still worked very well to detect the desired road signs. Fig. 7 shows the detection results when skewed road signs or a low-quality video frame were handled. No matter how skewed and what color the road sign is, our proposed method performed well to detect it from the background.
Fig. 8. Detection results of circular road signs
Fig. 8 shows the detection results when the circular road signs were captured under different lighting conditions. The conditions included low lighting, skewed shape, or multiple signs. However, our method still worked well to detect all these circular road signs. Furthermore, we also used our scheme to detect triangular road signs. Fig. 9 shows the detection results when triangular roads were handled. No matter what types or colors of road signs were handled, our proposed method worked very successfully to detect them.
Road Sign Detection Using Eigen Color
177
Fig. 9. Detection results of triangular road signs
Fig. 10. Road sign detection in a video sequence under a sunny day. (a), (b), and (c): Consecutive detection results of a road sign from a video.
The next set of experiments was used to demonstrate the performances of our method to detect road signs under different weather conditions in video sequences. Fig. 10 shows a series of detection results when consecutive video frames under a sunny day were handled. In Fig. 10 (a) and (b), a smaller and darker road sign was detected. Then, its size gradually became larger. Fig. 10 (c) shows the detection result of a larger road sign. Clearly, even though the road sign had different size changes, all its variations were successfully detected using our proposed method. Fig. 11 shows the detection result when a series of road signs captured under a cloudy day were handled. In Fig. 11 (a), a very smaller road sign was detected. Its color was also darker. In Fig. 11(b), (c), and (d), its size gradually became larger. No matter how size of the road sign changes, it still can be well detected using our proposed method. Experiment results have proved the superiority of our proposed method in real time road sign detection.
178
L.-W. Tsai et al.
Fig. 11. Road sign detection in a video sequence under a cloudy day
5 Conclusion This paper presents a novel eigen color model for road sign detection. With this model, different road sign candidates can be quickly located no matter what colors they have. The model is global and doesn’t need to be re-estimated. Even though the road signs are lighted under different illuminations, the model still works very well to identify them from the background. After that, a coarse-to-fine verification scheme is applied to effectively identify all candidates according to their edge maps. Since most impossible candidates have been filtered in advance, desired road signs can be located very quickly. Experimental results have proved the superiority of our proposed method in real time road sign detection.
References [1] Escalera, A.D.L., et al.: Road Traffic Sign Detection and Classification. IEEE Transaction on Industrial Electronics 44(6), 848–859 (1997) [2] Kehtarnavaz, N., Ahmad, A.: Traffic sign recognition in noisy outdoor scenes. In: Proceedings of Intelligent Vehicles 1995 Symposium, pp. 460–465 (September 1995) [3] Fang, C.-Y., Chen, S.-W., Fuh, C.-S.: Road-sign detection and tracking. IEEE Transactions on Vehicular Technology 52(5), 1329–1341 (2003) [4] Chen, X., Yang, J., Zhang, J., Waibel, A.: Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing 13(1), 87–99 (2004) [5] Loy, G., Barnes, N.: Fast shaped-based road sign detection for a Driver Assistance System. In: IROS 2004 (2004)
Road Sign Detection Using Eigen Color
179
[6] Wu, W., Chen, X., Yang, J.: Detection of Text on Road Signs From Video. IEEE Transactions on ITS 6(4), 378–390 (2005) [7] Barnes, N., Zelinsky, A.: Real-time radial symmetry for speed sign detection. In: Proc. IEEE Intelligent Vehicles Symposium, Italy, pp. 566–571 (June 2004) [8] Bahlmann, C., et al.: A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In: Proceedings of IEEE Intelligent Vehicles Symposium, pp. 255–260 (June 2005) [9] de Saint Blancard, M.: Road Sign Recognition: A Study of Vision-based Decision Making for Road Environment Recognition, ch. 7. Springer, Heidelberg (1991) [10] Healey, G.: Segmenting Images Using Normalized Color. IEEE Transactions on Systems, Man, and Cybernetics 22(1), 64–73 (1992)
Localized Content-Based Image Retrieval Using Semi-Supervised Multiple Instance Learning Dan Zhang1 , Zhenwei Shi2 , Yangqiu Song1 , and Changshui Zhang1 State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation,Tsinghua University, Beijing 100084, China Image Processing Center, School of Astronautics, Beijing University of Aeronautics and Astronautics, Beijing 100083, P.R. China
[email protected],
[email protected],
[email protected],
[email protected]
1
2
Abstract. In this paper, we propose a Semi-Supervised MultipleInstance Learning (SSMIL) algorithm, and apply it to Localized ContentBased Image Retrieval(LCBIR), where the goal is to rank all the images in the database, according to the object that users want to retrieve. SSMIL treats LCBIR as a Semi-Supervised Problem and utilize the unlabeled pictures to help improve the retrieval performance. The comparison result of SSMIL with several state-of-art algorithms is promising.
1
Introduction
Much work has been done in applying Multiple Instance Learning (MIL) to Localized Content-Based Image Retrieval (LCBIR). One main reason is that, in LCBIR, what a user wants to retrieve is often an object in a picture, rather than the whole picture itself. Therefore, in order to tell the retrieval system what he really wants, the user often has to provide several pictures with the desired object on it, as well as several pictures without this object, either directly or through relevance feedback. Then, each picture with the desired object is treated as a positive bag, while the other query pictures will be considered as negative ones. Furthermore, after using image segmentation techniques to divide the images into small patches, each patch represents an instance. In this way, the problem of image retrieval can be converted to an MIL one. The notion of Multi-Instance Learning was first introduced by Dietterich et al. [1] to deal with the drug activity prediction. A collection of different shapes of the same molecule is called a bag, while its different shapes represent different instances. A bag is labeled positive if and only if at least one of its instances is positive; otherwise, this bag is negative. This basic idea was extended by several following works. Maron et al. [2] proposed another MIL algorithm - Diverse Density (DD). They tried to find a target in the feature space that resembled positive instance most, and this target was called a concept point. Then they applied this
The work was supported by the National Science Foundation of China (60475001, 60605002).
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 180–188, 2007. c Springer-Verlag Berlin Heidelberg 2007
Localized Content-Based Image Retrieval Using SSMIL
181
method to solve the task of natural scene classification [3]. Zhang and Goldman [6] combined Expectation Maximization with DD together and developed an algorithm EM-DD, which was much more efficent than the previous DD algorithm, to search for the desired concept. They extended their idea in [7] and made some modifications to ensemble the different concept points returned by EM-DD with different initial values. This is reasonable, since the desired object can not be described by only one concept point. Andrew et al. [10] used a SVM based method to solve the MI problem. Then, they developed an efficient algorithm based on linear programming boosting [11]. Y. Chen et al. [4] combined EM-DD and SVM, and devised DD-SVM. Recently, P.M. Cheung et al.[9] give a regularized framework to solve this problem. Z. H. Zhou etc. [15], also initiate some research on Multiple-Instance Multiple-Label problem and apply it to scene classification. All the above works assume that each negative bag should not contain any positive instance. But there may exist exceptions. After the image segmentation, the desired object may be divided into several different patches. The pictures without this object may also contain a few particular patches that are similar to that of the object and should not be retrieved. So, negative bags may also contain positive instances, if we consider each patch as an instance. Based on this assumption, Y. Chen et al. [5] recently devised a new algorithm called MultipleInstance Learning via Embedded Instance Selection (MILES) to solve multiple instance problems. So far, some developments of MIL have been reviewed. When it comes to LCBIR, one natural problem is that users are often unwilling to provide so many labeled pictures, and therefore the inadequate number of labeled pictures poses a great challenge to the existing MIL algorithms. Semi-Supervised algorithms are just devised to handle the situation when the labeled information is inadequate. Some typical semi-supervised algorithms include Semi-Supervised SVM [13], Transductive SVM [12], graph-based semi-supervised learning [14], etc. How to convert a standard MIL problem to a Semi-Supervised one has received some notices. Recently, R. Rahmani and S. Goldman combined a modified version of DD and graph-based semi-supervised algorithms together, and put forward the first graph-based Semi-Supervised MIL algorithm - MISSL[8]. They adopted an energy function to describe the likelihood of an instance being the concept points, and redefined the weights between different bags. In this paper, we propose a new algorithm - Semi-Supervised Multiple-Instance Learning (SSMIL) to solve the Semi-Supervised MIL problem, and the result is promising. Our paper is outlined as follows: in Section 2, the motivation of our algorithm will be introduced. In Section 3, we will give the proposed algorithm. In Section 4 the experimental results will be presented. In the end, a conclusion is given in Section 5.
2
Motivation
A bag can be mapped into a feature space determined by the instances in all the labeled bags. To be more precise, a bag B is embedded in this feature space as follows [5]:
182
D. Zhang et al.
m(B) = [s(x1 , B), s(x2 , B), · · · , s(xn , B)]T
(1)
Here, s(xk , B) = maxt exp(− ||btσ−2x || ). σ is a predefined scaling parameter. xk is the kth instance among all the n instances in the labeled bags and bt denotes the tth instance in the bag B. Then, the whole labeled set can be mapped to such a matrix: k
+ − − [m+ 1 , · · · , m l+ , m 1 , · · · , m l− ] + − =⎡ [m(B+ ), · · · , m(B− 1 ), · · · , m(Bl+ ), m(B1 l− )] + − ⎤ 1 1 s(x , B1 ) · · · s(x , Bl− ) − ⎥ 2 ⎢ s(x2 , B+ 1 ) · · · s(x , Bl− ) ⎥ ⎢ =⎢ ⎥ . .. .. .. ⎣ ⎦ . .
(2)
− n s(xn , B+ 1 ) · · · s(x , Bl− )
+ − − B+ 1 , . . . , Bl+ denote the bags labeled positive, while B1 , . . . , Bl− refer to the k negatively labeled bags. Each column represents a bag. If x is near some positive bags and far from some negative ones, the corresponding dimension is useful for discrimination. In MILES[5], a 1-norm SVM is trained to select features and get their corresponding weights from this feature space as follows :
min λ
w,b,ξ,η
n k=1
l
−
+
|wk | + C1
i=1
ξi + C2
l
ηj
j=1
+ s.t. (wT m+ i + b) + ξi ≥ 1, i = 1, . . . , l , − − (wT m− j + b) + ηj ≥ 1, j = 1, . . . , l ,
ξi , ηj ≥ 0, i = 1, . . . , l+ , j = 1, . . . , l−
(3)
Here, C1 and C2 reflect the loss penalty imposed on the misclassification of positive and negative bags, respectively. λ is a regularizer parameter, which controls the trade-off between the complexity of the classifier and the hinge loss. It can be seen that this formulation does not restrict all the instances in negative bags to be negative. Since the 1-norm SVM is utilized, a sparse solution can be obtained, i.e. in this solution, only a few wk in Eq. (3) are nonzero. Hence, MILES finds the most important instances in the labeled bags and their corresponding weights. MILES gives an impressive result on several data sets and has shown its advantages over several other methods, such as DD-SVM [4], MI-SVM [10]and k-means SVM [16], both in accuracy and speed. However, the image retrieval task is itself a Semi-Supervised problem - with only a few labeled pictures searching in a tremendous database. The utilization of the unlabeled pictures may actually improve the retrieval performance.
Localized Content-Based Image Retrieval Using SSMIL
3 3.1
183
Semi-Supervised Multiple Instance Learning (SSMIL) The Formulation of Semi-Supervised Multiple Instance Learning
In this section, we give the formulation for Semi-Supervised Multiple Instance Learning. Our aim is to maximize margins not only on the labeled but the unlabeled bags. A straightforward way is to map both the labeled and unlabeled bags into the feature space determined by all the labeled bags, using Eq. (2). Then, we try to solve such an optimization problem: minw,b,ξ,η,ζ λ
n
l
−
+
|wk | + C1
ξi + C2
l
ηj + C3
i=1 j=1 k=1 + s.t. (wT m+ i + b) + ξi ≥ 1, i = 1, · · · , l T − − −(w mi + b) + ηj ≥ 1, j = 1, · · · , l yu∗ (wT mu + b) + ζu ≥ 1, u = 1, · · · , |U | ξi , ηj , ζu ≥ 0, i = 1, · · · , l+ , j = 1, · · · , l− ,
|U|
ζu
u=1
(4)
u = 1, · · · , |U |
The difference between Eq. (3) and Eq. (4) is the appended penalty term imposed on the unlabeled data. C3 is the penalty parameter that controls the effect of unlabeled data, and yu∗ is the label assigned to the uth unlabeled bag during the training phase. 3.2
The Up-Speed of Semi-Supervised Multiple Instance Learning (UP-SSMIL)
Directly solving the optimization problem (4) is too time-consuming, because, in Eq. (4), all the unlabeled pictures are required to be mapped into the feature space determined by all the instances in the labeled bags and most of the time will be spent on the feature mapping step(Eq. (2)). In this paper, we try to up-speed this process and propose UP-SSMIL. After each labeled bag is mapped into the feature space by Eq. (2), all the unlabeled bags can also be mapped into this feature space according to Eq. (1). As mentioned in Section 2, one norm SVM can find the most important features, i.e. predominant instances in training bags. Hence, the dimension for each bag can be greatly reduced, with the irrelevant features being discarded. So, we propose using MILES as the first step to select the most important instances and mapping each bag B in both the labeled and unlabeled set into the space determined by these instances as follows: m(B) = [s(z1 , B), s(z2 , B), · · · , s(zv , B)]T
(5)
Here, z k is the kth selected instance and v denotes the total number of the selected instances. This is a supervised step. Then, we intend to use the unlabeled bags to improve the performance by optimize the feature weights of the selected
184
D. Zhang et al. Table 1. UP-SSMIL Algorithm
1. Feature Mapping 1: Map each labeled bag (into the feature space determined by the instances in the labeled bags, using Eq.(2). 2. MILES Training: Use 1-norm SVM to train a classifier, utilizing only the training bags. Then, each feature in the feature space determined by the training instances is assigned a weight, i.e. wk in Eq. (3). The regularizer in this step is denoted as λ1 . 3. Feature Selecting: Select the features with nonzero weights. 4. Feature Mapping 2: Map all the unlabeled and labeled bags into the feature space determined by the features selected from the previous step, i.e. the selected instances, using Eq. (5). 5. TSVM Training: Taking into account both the re-mapped labeled and unlabeled bags, use TSVM to train a classifier. The regularizer in TSVM is denoted as λ2 . 6. Classifying: Use this classifier to rank the unlabeled bags.
features. A Transductive Support Vector Machine (TSVM) [12] algorithm is employed to learn these weights. The whole UP-SSMIL algorithm can be depicted in Table 1. In this algorithm, TSVM is a 2-norm Semi-Supervised SVM. The reason why 1-norm Semi-Supervised SVM is not employed is that, after the feature selection step, the selected features are most relevant to the final solution. However, 1-norm Semi-Supervised SVM favors the sparsity of w. Therefore, it is not used here.
4
Experiments
We test our method on SIVAL, which is obtained at www.cs.wustl.edu/∼sg/ multi-inst-data/. Some sample images are shown in Fig. (1). In this database, each image is pre-segmented into around 30 patches. Color, texture and
(a) SpriteCan
(b) WD40Can
Fig. 1. Some sample images in SIVAL dataset
Localized Content-Based Image Retrieval Using SSMIL
185
Table 2. Average AUC values with 95% confidence intervals, with 8 randomly selected positive and 8 randomly selected negative pictures
FabricSoftenerBox CheckeredScarf FeltFlowerRug WD40Can CockCan GreenTeaBox AjaxOrange DirtyRunningShoe CandleWithHolder SpriteCan JulisPot GoldMedal DirtyWorkGlove CardBoardBox SmileyFaceDoll BlueScrunge DataMiningBook TranslucentBowl StripedNoteBook Banana GlazedWoodPot Apple RapBook WoodRollingPin LargeSpoon Average
UP-SSMIL MISSL 97.2±0.7 97.7±0.3 95.5±0.5 88.9±0.7 94.6±0.8 90.5±1.1 90.5±1.3 93.9±0.9 93.4±0.8 93.3±0.9 90.9±1.9 80.4±3.5 90.1±1.7 90.0±2.1 87.2±1.3 78.2±1.6 85.4±1.7 84.5±0.8 84.8±1.1 81.2±1.5 82.1±2.9 68.0±5.2 80.9±3.0 83.4±2.7 81.9±1.7 73.8±3.4 81.1±2.3 69.6±2.5 80.7±1.8 80.7±2.0 76.7±2.6 76.8±5.2 76.6±1.9 77.3±4.3 76.3±2.0 63.2±5.2 75.1±2.6 70.2±2.9 69.2±3.0 62.4±4.3 68.6±2.8 51.5±3.3 67.8±2.7 51.1±4.4 64.9±2.8 61.3±2.8 64.1±2.1 51.6±2.6 58.6±1.9 50.2±2.1 80.6 74.8
MILES 96.8±0.9 95.1±0.8 94.1±0.8 86.9±3.0 91.8±1.3 89.4±3.1 88.4±2.8 85.6±2.1 83.4±2.3 82.1±2.8 78.8±3.5 76.1±3.9 80.4±2.2 78.4±3.0 77.7±2.8 73.2±2.8 74.0±2.3 74.0±3.1 73.2±2.5 66.4±3.4 69.0±3.0 64.7±2.8 64.6±2.3 63.5±2.0 57.7±2.1 78.6
Accio! 86.6±2.9 90.8±1.5 86.9±1.6 82.0±2.4 81.5±3.4 87.3±2.9 77.0±3.4 83.7±1.9 68.8±2.3 71.9±2.4 79.2±2.6 77.7±2.6 65.3±1.5 67.9±2.2 77.4±3.2 69.5±3.3 74.7±3.3 77.5±2.3 70.2±3.1 65.9±3.2 72.7±2.2 63.4±3.3 62.8±1.7 66.7±1.7 57.6±2.3 74.6
Accio!+EM 44.4±1.1 58.1±4.4 51.1±24.8 50.3±3.0 48.5±24.6 46.8±3.5 43.6±2.4 75.4±19.8 57.9±3.0 59.2±22.1 51.2±24.5 42.1±3.6 57.8±2.9 57.8±2.9 48.0±25.8 36.3±2.5 37.7±4.9 47.4±25.9 43.5±3.1 43.6±3.8 51.0±2.8 43.4±2.7 57.6±4.8 52.5±23.9 51.2±2.5 50.3
neighborhood features have already been extracted for each patch, and form a set of 30-dimension feature vectors. In our experiments, these features are normalized to be exactly in the range from 0 to 1, and the scaling parameter σ is chosen to be 0.5. Treat each picture as a bag, and each patch in this picture as an instance in this bag. The source code of MILES is obtained from [17], and TSVM is obtained from [18]. During each trial, 8 positive pictures are randomly selected from one category, and other 8 negative pictures are randomly selected as background pictures from the other 24 categories. The retrieval speed of UP-SSMIL is pretty fast. In my computer, for each round, UP-SSMIL takes only 25 seconds while SSMIL takes around 30 minutes. For convenience, only the results of UP-SSMIL are reported here. We will demonstrate below that it achieves the best performance on SIVAL database. In UP-SSMIL’s Training step in Table 1 and MILES (see Eq. (3)), λ1 is set to 0.2, C1 and C2 are set to 0.5. In UP-SSMIL’s TSVM Training step in Table 1 (for a detailed description of the parameters, see the reference of SVMlin [18]),
186
D. Zhang et al. SpriteCan |U|=1500−|L|
SpriteCan |L|=16
0.92 0.86
0.9
0.84
0.88 AUC value
AUC value
0.82
0.86 0.84
0.8 0.78
MILES
0.76
UP−SSMIL
0.74
0.82
MILES
0.8 0.78
UP−SSMIL
0.72
20
30 40 50 60 Number of labeled pictures (|L|)
70
0.7
80
200
400 600 800 Number of unlabeled pictures (|U|)
(a)
(b)
WD40can |U|=1500−|L|
WD40Can |L|=16
0.96
0.92
0.95
0.91 0.9
0.94
0.89 AUC value
AUC value
0.93 0.92 0.91 0.9
0.88 0.87 0.86 0.85
0.89 0.88 0.87
1000
20
MILES
0.84
UP−SSMIL
0.83
30 40 50 60 Number of labeled pictures (|L|)
(c)
70
80
0.82
MILES UP−SSMIL 200
400 600 800 Number of unlabeled pictures (|U|)
1000
(d)
Fig. 2. The comparison result between UP-SSMIL and MILES
λ2 is set to 0.1 The positive class fraction of unlabeled data is set to 0.01. The other parameters in SVMlin are all set to their default values. In the image retrieval, the ROC curve is a good measure of the performance. So, the area under ROC curve - AUC value is used here to measure the performance. All the results reported here are averaged over 30 independent runs, with a 95% confidence interval being calculated. The final comparison result is shown in Table 2. From this table, it can be seen that, compared with MISSL, among all the 25 categories, UP-SSMIL performs better than MISSL for most categories, with only a few categories worse than MISSL. This may be due to two reasons. For one thing, MISSL uses inadequate number of pictures to learn the likelihood for each instance being positive and the “steepness factor” in MISSL is relatively hard to determine. These may lead to an inaccurate energy function. For another, on the graph level, MISSL uses just one vertex to represent all the negative training vertexes, and assumes the weights connecting from this vertex to all the unlabeled vertexes to be the same, which will result in some inaccuracy as well. Furthermore, after the pre-calculation of the distances between different instances, MISSL takes 30-100 seconds to get a retrieval result, while UP-SSMIL takes no more than 30 seconds without the need to calculate these distances. This
Localized Content-Based Image Retrieval Using SSMIL
187
is quite understandable, In the first Feature Mapping Step in Table 1, UP-SSMIL only need to calculate the distances within the training bags. Since the number of query images is so small, this calculation burden is relatively light. Then, after the features being selected, the unlabeled bags only need to be mapped into the space determined by these few selected features. In our experiments, this dimension can be reduced to around 10. So, the calculation cost of the second Feature Mapping step in Table 1 is very low. With the dimensions being greatly reduced, TSVM gets the solution relatively fast. Compared with other supervised methods, such as MILES, Accio [7] and Accio+EM [7]. The performance of UP-SSMIL is also quite promising. Some comparisons result with its supervised opponent–MILES are provided in Fig. 2. We illustrate how the learning curve will change when both the number of labeled pictures(|L|) and the number of unlabeled pictures(|U |) vary. It can be seen that the performance of UP-SSMIL always outperforms its supervised opponent.
5
Conclusion
In this paper, we propose a semi-supervised SVM framework of Multiple Instance algorithm - SSMIL. It uses the unlabeled pictures to help improve the performance. Then, UP-SSMIL is presented to accelerate the retrieval speed. In the end, we demonstrate on SIVAL database its superior performances.
References 1. Dietterich, T.G., Lathrop, R.H., Lozano-P¨erez, T.: Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artificial Inteligence 1446, 1–8 (1998) 2. Maron, O., Lozano-P¨erez, T.: A Framework for Multiple-Instance Learning. Advances in Neural Information Processing System 10, 570–576 (1998) 3. Maron, O., Ratan, A.L.: Multiple-Instance Learning for Natural Scene Classification. In: Proc. 15th Int’l. Conf. Machine Learning, pp. 341–349 (1998) 4. Chen, Y., Wang, J.Z.: Image Categorization by Learning and Reasoning with Regions. J. Machine Learning Research 5, 913–939 (2004) 5. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-Instance Learning via Embedded Instance Selection. IEEE Transatctions on Pattern Analysis and Machine Intelligence 28(12) (2006) 6. Zhang, Q., Goldman, S.: EM-DD: An improved Multiple-Instance Learning. In: Advances in Neural Information Processing System, vol. 14, pp. 1073–1080 (2002) 7. Rahmani, R., Goldman, S., Zhang, H., et al.: Localized Content-Based Image Retrieval. In: Proceedings of ACM Workshop on Multimedia Image Retrieval, ACM Press, New York (2005) 8. Rahmani, R., Goldman, S.: MISSL: Multiple-Instance Semi-Supervised Learning. In: Proc. 23th Int’l. Conf. Machine Learning, pp. 705–712 (2006) 9. Cheung, P.-M., Kwok, J.T.: A Regularization Framework for Multiple-Instance Learning. In: ICML (2006) 10. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support Vector Machines for Multiple-Instance Learning. In: Advances in Neural Information Processing System, vol. 15, pp. 561–568 (2003)
188
D. Zhang et al.
11. Andrews, S., Hofmann, T.: Multiple Instance Learning via Disjunctive Programming Boosting. In: Advances in Neural Information Processing System, vol. 16, pp. 65–72 (2004) 12. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machine. In: Proc. 16th Int’l. Conf. Machine Learning, pp. 200–209 (1999) 13. Bennett, K.P., Demiriz, A.: Semi-supervised sup- port vector machines. In: Advances in Neural Information Processing System, vol. 11, pp. 368–374 (1999) 14. Zhu, X.: Semi-supervised learning literature survey, in Technical Report 1530, Department of Computer Sci- ences, University of Wisconsin at Madison (2006) 15. Zhou, Z.H., Zhang, M.L.: Multi-Instance Multi-Label Learning with Application to Scene Classification. In: Advances in Neural Information Processing System (2006) 16. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual Categorization with Bags of Keypoints. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 59–74. Springer, Heidelberg (2004) 17. http://john.cs.olemiss.edu/∼ ychen/MILES.html 18. http://people.cs.uchicago.edu/∼ vikass/svmlin.html
Object Detection Combining Recognition and Segmentation Liming Wang1 , Jianbo Shi2 , Gang Song2 , and I-fan Shen1 1
2
Fudan University,Shanghai,PRC,200433 {wanglm,yfshen}@fudan.edu.cn University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104
[email protected],
[email protected]
Abstract. We develop an object detection method combining top-down recognition with bottom-up image segmentation. There are two main steps in this method: a hypothesis generation step and a verification step. In the top-down hypothesis generation step, we design an improved Shape Context feature, which is more robust to object deformation and background clutter. The improved Shape Context is used to generate a set of hypotheses of object locations and figureground masks, which have high recall and low precision rate. In the verification step, we first compute a set of feasible segmentations that are consistent with top-down object hypotheses, then we propose a False Positive Pruning(FPP) procedure to prune out false positives. We exploit the fact that false positive regions typically do not align with any feasible image segmentation. Experiments show that this simple framework is capable of achieving both high recall and high precision with only a few positive training examples and that this method can be generalized to many object classes.
1 Introduction Object detection is an important, yet challenging vision task. It is a critical part in many applications such as image search, image auto-annotation and scene understanding; however it is still an open problem due to the complexity of object classes and images. Current approaches [1,2,3,4,5,6,7,8,9,10] to object detection can be categorized by top-down, bottom-up or combination of the two. Top-down approaches [2,11,12] often include a training stage to obtain class-specific model features or to define object configurations. Hypotheses are found by matching models to the image features. Bottomup approaches start from low-level or mid-level image features, i.e. edges or segments [5,8,9,10]. These methods build up hypotheses from such features, extend them by construction rules and then evaluate by certain cost functions. The third category of approaches combining top-down and bottom-up methods have become prevalent because they take advantage of both aspects. Although top-down approaches can quickly drive attention to promising hypotheses, they are prone to produce many false positives when features are locally extracted and matched. Features within the same hypothesis may not be consistent with respect to low-level image segmentation. On the other hand, bottom-up approaches try to keep consistency in low level image segmentation, but usually need much more efforts in searching and grouping. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 189–199, 2007. c Springer-Verlag Berlin Heidelberg 2007
190
L. Wang et al.
Input Image Feature 1 Codebook Building
2 Matching &Voting Codebook Entries
Hypotheses
Image Segmentation
3 Verification
Detection
(Re−Evaluation)
Results
Fig. 1. Method overview. Our method has three parts (shaded rectangles). Codebook building (cyan) is the training stage, which generates codebook entries containing improved SC features and object masks. Top-down recognition (blue) generates multiple hypotheses via improved SC matching and voting in the input image. The verification part (pink) aims to verify these top-down hypotheses using bottom-up segmentation. Round-corner rectangles are processes and ordinary rectangles are input/output data.
Wisely combining these two can avoid exhaustive searching and grouping while maintaining consistency in object hypotheses. For example, Borenstein et al. enforce continuity along segmentation boundaries to align matched patches [2]. Levin et al. take into account both bottom-up and top-down cues simultaneously in the framework of CRF [3]. Our detection method falls into this last category of combining top-down recognition and bottom-up segmentation, with two major improvements over existing approaches. First, we design a new improved Shape Context (SC) for the top-down recognition. Our improved SC is more robust to small deformation of object shapes and background clutter. Second, by utilizing bottom-up segmentation, we introduce a novel False Positive Pruning (FPP) method to improve detection precision. Our framework can be generalized to many other object classes because we pose no specific constraints on any object class. The overall structure of the paper is organized as follows. Sec. 2 provides an overview to our framework. Sec.3 describes the improved SCs and the top-down hypothesis generation. Sec.4 describes our FPP method combining image segmentation to verify hypotheses. Experiment results are shown in Sec.5, followed by discussion and conclusion in Sec.6.
2 Method Overview Our method contains three major parts: codebook building, top-down recognition using matching and voting, and hypothesis verification, as depicted in Fig.1. The object models are learned by building a codebook of local features. We extract improved SC as local image features and record the geometrical information together with object figure-ground masks. The improved SC is designed to be robust to shape variances and background clutters. For rigid objects and objects with slight articulation, our experiments show that only a few training examples suffice to encode local shape information of objects. We generate recognition hypotheses by matching local image SC features to the codebook and use SC features to vote for object centers. A similar top-down voting scheme is described in the work of [4], which uses SIFT point features for pedestrian
Object Detection Combining Recognition and Segmentation r 1 2 3 1 2 3
(a)
(b)
θ θ 1
5
(c)
9
θ
θ
(d)
r 1 2 3 1 2 3
191
θ 1
5
(e)
9
θ
Fig. 2. Angular Blur. (a) and (b) are different bin responses of two similar contours. (c) are their histograms. (d) enlarges angular span θ to θ , letting bins be overlapped in angular direction. (e) are the responses on the overlapped bins, where the histograms are more similar.
detection. The voting result might include many false positives due to small context of local SC features. Therefore, we combine top-down recognition with bottom-up segmentation in the verification stage to improve the detection precision. We propose a new False Positive Pruning (FPP) approach to prune out many false hypotheses generated from top-down recognition. The intuition of this approach is that many false positives are generated due to local mismatches. These local features usually do not have segmentation consistency, meaning that pixels in the same segment should belong to the same object. True positives are often composed of several connected segments while false positives tend to break large segments into pieces.
3 Top-Down Recognition In the training stage of top-down recognition, we build up a codebook of improved SC features from training images. For a test image, improved SC features are extracted and matched to codebook entries. A voting scheme then generates object hypotheses from the matching results. 3.1 Codebook Building For each object class, we select a few images as training examples. Object masks are manually segmented and only edge map inside the mask is counted in shape context histogram to prune out edges due to background clutter. The Codebook Entries (CE) are a repository of example features: CE = {cei }. Each codebook entry cei = (ui , δi , mi , wi ) records the feature for a point i in labelled objects of the training images. Here ui is the shape context vector for point i. δi is the position of point i relative to the object center. mi is a binary mask of figure-ground segmentation for the patch centered at point i. wi is the weight mask computed on mi , which will be introduced later. 3.2 Improved Shape Context The idea of Shape Context (SC) was first proposed by Belongie et al. [13]. The basic definition of SC is a local histogram of edge points in a radius-angle polar grid. Following works [14,15] improve its distinctive power by considering different edge orientations. Besides SC, other local image features such as wavelets, SIFT and HOG have been used in keypoint based detection approaches [4,12].
192
L. Wang et al.
Suppose there are nr (radial) by nθ (angular) bins and the edge map E is divided into E1 , . . . , Eo by o orientations (similar to [15]), for a point at p, its SC is defined as u = {h1 , . . . , ho }, where →
hi (k) = #{q = p : q ∈ Ei , pq∈ bin(k)}, k = 1, 2, ..., nr nθ
(1)
Angular Blur. A common problem for the shape context is that when dense bins are used or contours are close to the bin boundaries, similar contours have very different histograms (Fig.2-(c)). This leads to a large distance for two similar shapes if L2 -norm or χ2 distance function is used. EMD [16] alleviates this by solving a transportation problem; but it is computationally much more expensive. The way we overcome this problem is to overlap spans of adjacent angular bins: bin(k) ∩ bin(k + 1) = ∅ (Fig.2-(d)). This amounts to blurring the original histogram along the angular direction. We call such an extension Angular Blur. One edge point in the overlapped regions are counted in both of the adjacent bins. So the two contours close to the original bin boundary will have similar histograms for the overlapping bins(Fig.2-(e)). With angular blur, even simple L2 -norm can tolerate slight shape deformation. It improves the basic SC without the expensive computation of EMD. Mask Function on Shape Context. In real images, objects SCs always contain background clutter. This is a common problem for matching local features. Unlike learning methods [1,12] which use a large number of labeled examples to train a classifier, we propose to use a mask function to focus only on the parts inside object while ignoring background in matching. For ce = (u, δ, m, w) and a SC feature f in the test image, each bin of f is masked by figure-ground patch mask m of ce to remove the background clutter. Formally, we compute the weight w for bin k and distance function with mask as: w(k) = Area(bin(k) ∩ m)/Area(bin(k)), k = 1, 2, ..., nr nθ Dm (ce, f ) = D(u, w · v) = ||u − w · v||
2
(2) (3)
where (·) is the element-wise product. D can be any distance function computing the dissimilarity between histograms (We simply use L2 -norm). Figure 3 gives an example for the advantage of using mask function. 3.3 Hypothesis Generation The goal of hypothesis generation is to predict possible object locations as well as to estimate the figure-ground segmentation for each hypothesis. Our hypothesis generation is based on a voting scheme similar to [4]. Each SC feature is compared with every codebook entry and makes a prediction of the possible object center. The matching scores are accumulated over the whole image and the predictions with the maximum scores are the possible object centers. Given a set of detected features {fi } at location {li }, we define the probability of matching codebook entry cek to fi as P (cek |li ) ∝ exp(−Dm (cek , fi )). Given the match of cek to fi , the probability of an object o with
Object Detection Combining Recognition and Segmentation
b1
a1
A
B
input feature 0.2 0.1 0
v
193
b2
a2
u
0.2 0.1 0 0.2 0.1 0
(a)
0
50 weighted feature
100
0
50 matched codebook entry
100
0
50
100
(b)
Fig. 3. Distance function with mask. In (a), a feature point v has the edge map of a1 around it. Using object mask b1 , it succeeds to find a good match to u in B (object model patch), whose edge map is b2 . a2 is the object mask b1 over a1 . Only the edge points falling into the mask area are counted for SC. In (b), histograms of a1 , a2 and b2 are shown. With the mask function, a2 is much closer to b2 , thus got well matched.
center located at c is defined as P (o, c|cek , li ) ∝ exp(−||c + δk − li ||2 ). Now the probability of the hypothesis of object o with center c is computed as: P (o, c) = P (o, c|cek , li )P (cek |li )P (li ) (4) i,k
P (o, c) gives a voting map V of different locations c for the object class o. Extracting local maxima in V gives a set of hypotheses {Hj } = {(oj , cj )}. Furthermore, figure-ground segmentation for each Hj can be estimated by backtracing the matching results. For those fi giving the correct prediction, the patch mask m in the codebook is “pasted” to the corresponding image location as the figure-ground segmentation. Formally, for a point p in image at location pl , we define P (p = f ig|cek , li ) as the probability of point p belonging to the foreground when the feature at location −→
li is matched to the codebook cek : P (p = f ig|cek , li ) ∝ exp(−||pl − li ||)mk (pl li ). And we assume that P (cek , li |Hj ) ∝ P (oj , cj |cek , li ) and P (fi |cek ) ∝ P (cek |fi ). The figure-ground probability for hypothesis Hj is estimated as −→ P (p = f ig|Hj ) ∝ exp(−||pl − li ||)mk (pl li )P (fi |cek )P (cek , li |Hj) (5) k
Eq. (5) gives the estimation of top-down segmentation. The whole process of top-down recognition is shown in Fig. 4. The binary top-down segmentation (F, B) of figure(F ) and background (B) is the obtained by thresholding P (p = f ig|Hj ).
4 Verification: Combining Recognition and Segmentation From our experiments, the top-down recognition using voting scheme will produce many False Positives (FPs). In this section, we propose a two-step procedure of False Positive Pruning (FPP) to prune out FPs. In the first step we refine the top-down hypothesis mask by checking its consistency with bottom-up segmention. Second the final score on the refined mask is recomputed by considering spatial constraints.
194
L. Wang et al. f1 f2
Hj
(a)
(b)
(c)
(d)
(e)
Fig. 4. Top-down recognition. (a) An input image; (b) A matched point feature votes for 3 possible positions; (c) The vote map V . (d) The hypothesis Hj traces back find its voters {fi }. (d) Each fi predicts the figure-ground configration using Eq. (5).
Combining Bottom-up Segmentation. The basic idea for local feature voting is to make global decision by the consensus of local predictions. However, these incorrect local predictions using a small context can accumulate and confuse the global decision. For example, in pedestrian detection, two trunks will probably be locally taken as human legs and produce a human hypothesis (in Fig. 5-(a)); another case is the silhouettes from two standing-by pedestrians.
A O1
A O B C D
E
O O2
D
(a)
E
O3
(b)
(c)
(d)
Fig. 5. Combining bottom-up segmentation. FPs tend to spread out as multiple regions from different objects. In example of (a). an object O consists of five parts (A, B, C, D, E). (A ∈ O1 , D ∈ O2 , E ∈ O3 ) are matched to (A, D, E) because locally they are similar. The hypothesis of O = (A , D , E ) is generated. (b) shows boundaries of a FP (in green) and a TP (in red) in a real image. (c) is the layered view of the TP in (b). The top layer is the top-down segmentation, which forms a force (red arrows) to pull the mask out from the image. The bottom layer is the background force (green arrows). The middle layer is the top-down segmentation (we threshold it to binary mask) over the segmentation results.(d) is the case for the FP.
In pedestrian detection, the top-down figure-ground segmentation masks of the FPs usually look similar to a pedestrian. However we notice that such top-down mask is not consistent with the bottom-up segmentation for most FPs. The bottom-up segments share bigger contextual information than the local features in the top-down recognition and are homogenous in the sense of low-level image feature. The pixels in the same segment should belong to the same object. Imagine that the top-down hypothesis mask(F, B) tries to pull the object F out of the whole image. TPs generally consists of several well-separated segments from the background so that they are easy to be pulled
Object Detection Combining Recognition and Segmentation
195
out (Fig. 5-(c)). However FPs often contain only part of the segments. In the example of tree trunks, only part of the tree trunk is recognized as foreground while the whole tree trunk forms one bottom-up segment. This makes pulling out FPs more difficult because they have to break the homogenous segments (Fig. 5-(d)). Based on these observations we combine the bottom-up segmentation to update the top-down figure-ground mask. Incorrect local predictions are removed from the mask if they are not consistent with the bottom-up segmentation. We give each bottom-up cut to propose segment Si a binary label. Unlike the work in [17] which uses graph Area(Si F ) the optimized hypothesis mask, we simply define the ratio Area(S B) as a criteria to i assign Si to F or B. We try further segmentation when such assignment is uncertain to avoid the case of under-segmentation in a large area. The Normalized Cut (NCut) cost [18] is used to determine if such further segmentation is reasonable. The procedure to refine hypothesis mask is formulated as follows: Input: top-down mask (F, B) and bottom-up segments {Si , i = 1, . . . , N }. Output: refined object mask (F, B). Set i = 0. 1) If i > N ,exit; else, i = i + 1. Area(Si F ) 2) If Λ = Area(S B) > κup , then F = F ∪ Si ,goto 1); i elseif Λ < κdown, then F = F − (F ∩ Si ), goto 1). Otherwise, go to 3). 3) Segment Si to (Si1 , Si2 ). If ζ = NCut(Si ) > Υup , F = F − (F ∩ Si ), goto 1); else SN +1 = Si1 ,SN +2 = Si2 , S = S ∪ {SN +1 , SN +2 }, N = N + 2, goto 1). Re-evaluation. There are two advantages with the updated masks. The first is that we can recompute more accurate local features by masking out the background edges. The second is that the shapes of updated FPs masks will change much more than those of TPs, because FPs are usually generated by locally similar parts of other objects, which will probably be taken away through the above process. We require TPs must have voters from all the different locations around the hypothesis center. This will eliminates those TPs with less region support or with certain partial matching score. The final score is the summation of the average scores over the different spatial bins in the mask. The shape of the spatial bins are predefined. For pedestrians we use the radius-angle polar ellipse bins; for other objects we use rectangular grid bins. For each hypothesis, SC features are re-computed over the masked edge map by F and feature fi is only allowed to be matched to cek in the same bin location. For each bin j, we P (cek |fi ) compute an average matching score Ej = #(ce , where both cek and fi come k ,fi ) from bin j. The final score of this hypothesis is defined as:
E=
j
Ej ,where Ej =
Ej , if Ej > α; −α , if Ej = 0 and #{cek , cek ∈ bin(j)} > 0.
(6)
The term α is used to penalize the bins which have no matching with the codebook. This decreases the scores of FPs with only part of true objects, i.e. bike hypothesis with one wheel. Experiments show that our FPP procedure can prune out FPs effectively.
196
L. Wang et al.
5 Results Our experiments test different object classes including pedestrian, bike, human riding bike, umbrella and car (Table. 1). These pictures were taken from scenes around campus and urban streets. Objects in the images are roughly at the same scale. For pedestrians, the range of the heights is from 186 to 390 pixels. Table 1. Dataset for detection task #Object Pedestrian Bike Human on bike Umbrella Car Training 15 3 2 4 4 Testing 345 67 19 16 60
For our evaluation criteria, a hypothesis whose center falls into an ellipse region around ground truth center is classified as true positive. The radii for ellipse are typically chosen as 20% of the mean width / height of the objects. Multiple detections for one ground truth object are only counted once. Angular Blur and Mask Function Evaluation. We compare the detection algorithm on images w/ and w/o Angular Blur (AB) or mask function. The PR curves are plotted in Fig.6. For pedestrian and umbrella detection, it is very clear that adding Angular Blur and mask function can improve the detection results. For other object classes, AB+Mask outperforms at high-precision/low-recall part of the curve, but gets no significant improvement at high-recall/low-precision part. The reason is that AB+Mask can improve the cases where objects have deformation and complex background clutter. For bikes,
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1
(a) Pedestrian
1
0
0
0.2
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.6
0.8
1
0
0
(b) Bike
1
0.8
0.4
0.2
0.4
0.6
Angular Blur+Mask Function w/o Angular Blur w/o Mask Function w/ FPP
0
0.2
0.4
0.6
(c) Umbrella
0.8
1
0.8
(c) Human on bike
HOG (only in (a))
0
0.2
0.4
0.6
0.8
1
(d) Car Fig. 6. PR-Curves of object detection results
(e) plot legend
1
Object Detection Combining Recognition and Segmentation
197
Fig. 7. Detection result on real images. The color indicates different segments. The last row contains cases of FPs for bikes and pedestrians.
198
L. Wang et al.
the inner edges dominate the SC histogram; so adding mask function makes only a little difference. Pedestrian Detection Compared with HOG. We also compare with HOG.using the implementation of the authors of [12] Figure 6-(a) shows that our method with FPP procedure are better than the results of HOG. Note that we only use a very limited number of training examples as shown in Table. 1 and we did not utilize any negative training examples.
6 Conclusion and Discussion In this paper, we developed an object detection method of combining top-down modelbased recognition with bottom-up image segmentation. Our method not only detects object positions but also gives the figure-ground segmentation mask. We designed an improved Shape Context feature for recognition and proposed a novel FPP procedure to verify hypotheses. This method can be generalized to many object classes. Results show that our detection algorithm can achieve both high recall and precision rates. However there are still some FPs hypotheses that cannot be pruned. They are typically very similar to objects, like a human-shape rock, or some tree trunks. More information like color or texture should be explored to prune out these FPs. Another failure case of SC detector is for very small scale object. These objects have very few edges points thus are not suitable for SC. Also our method does no work for severe occlusion where most local information is corrupted. Acknowledgment. This work is partially supported by National Science Foundation through grants NSF-IIS-04-47953(CAREER) and NSF-IIS-03-33036(IDLP). We thank Qihui Zhu and Jeffrey Byrne for polishing the paper.
References 1. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 2. Borenstein, E., Ullman, S.: Class-specific, top-down segmentation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, Springer, Heidelberg (2002) 3. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 4. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR (2005) 5. Ferrari, V., Tuytelaars, T., Gool, L.J.V.: Object detection by contour segment networks. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 6. Kokkinos, I., Maragos, P., Yuille, A.L.: Bottom-up & top-down object detection using primal sketch features and graphical models. In: CVPR (2006) 7. Zhao, L., Davis, L.S.: Closely coupled object detection and segmentation. In: ICCV (2005)
Object Detection Combining Recognition and Segmentation
199
8. Ren, X., Berg, A.C., Malik, J.: Recovering human body configurations using pairwise constraints between parts. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 9. Mori, G., Ren, X., Efros, A.A., Malik, J.: Recovering human body configurations: Combining segmentation and recognition. In: CVPR (2004) 10. Srinivasan, P., Shi, J.: Bottom-up recognition and parsing of the human body. In: CVPR (2007) 11. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61(1) (2005) 12. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 13. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(4) (2002) 14. Mori, G., Belongie, S.J., Malik, J.: Efficient shape matching using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell 27(11) (2005) 15. Thayananthan, A., Stenger, B., Torr, P.H.S., Cipolla, R.: Shape context and chamfer matching in cluttered scenes. In: CVPR (2003) 16. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV (1998) 17. Ramanan, D.: Using segmentation to verify object hypotheses. In: CVPR (2007) 18. Shi, J., Malik, J.: Normalized cuts and image segmentation. In: CVPR (1997)
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity Viet Cuong Dinh, Seong Soo Chun, Seungwook Cha, Hanjin Ryu, and Sanghoon Sull Department of Electronics and Computer Engineering, Korea University, 5-1 Anam-dong, Seongbuk-gu, Seoul, 136-701, Korea {cuongdv,sschun,swcha,hanjin,sull}@mpeg.korea.ac.kr
Abstract. Text appearing in video provides semantic knowledge and significant information for video indexing and retrieval system. This paper proposes an effective method for text detection in video based on the similarity in stroke width of text (which is defined as the distance between two edges of a stroke). From the observation that text regions can be characterized by a dominant fixed stroke width, edge detection with local adaptive thresholds is first devised to keep text- while reducing background-regions. Second, morphological dilation operator with adaptive structuring element size determined by stroke width value is exploited to roughly localize text regions. Finally, to reduce false alarm and refine text location, a new multi-frame refinement method is applied. Experimental results show that the proposed method is not only robust to different levels of background complexity, but also effective to different fonts (size, color) and languages of text.
1 Introduction The need for efficient content-based video indexing and retrieval has increased due to the rapid growth of video data available to consumers. For this purpose, text in video, especially the superimposed text, is the most frequently used since it provides highlevel semantic information about video content and it has distinctive visual characteristic. Therefore, the success in video text detection and recognition would have a great impact on multimedia applications such as image categorization [1], video summarization [2], and lecture video indexing [3]. Many efforts have been made for text detection in image and video. Regarding the way used to locate text regions, text detection methods can be classified into three approaches: connected component (CC)-based method [4, 5, 6], texture-based method [7, 8], and edge-based method [9, 10]. The CC-based method is based on the analysis of geometrical arrangement of edges or homogeneous color that belongs to characters. Alternatively, the texture-based method treats text region as a special type of texture and employs learning algorithms, e.g., neural network [8], support vector machine (SVM) [11], to extract text. In general, the texture-based method is more robust than the CC-based method in dealing with complex background. However, the main drawbacks of this method are its high complexity and inaccuracy localization. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 200–209, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity
201
Another popularly studied method is the edge-based method, which is based on the fact that text regions have abundant edges. This method is widely used due to its fast performance in detecting text and its ability to keep geometrical structure of text. The method in [9] detects edges in an image and then uses the fixed size horizontal, vertical morphological dilation operations to form text line candidate. Real text regions are identified by using the SVM. Two disadvantages of this method are its poor performance in case of complex background and the use of fixed size structuring element in dilation operations. To deal with the background complexity problem, edge detection-based method should be accompanied by a local threshold algorithm. In [10], the image is first divided into small windows. A window is considered to be complex if the “number of blank rows” is smaller than a certain specific value. Then, in the edge detection step, a higher threshold is assigned for these complex windows. However, the “number of blank rows” criterion appears sensitive to noise and not strong enough to handle different text sizes. Therefore, how to design an effective local threshold algorithm for detecting edge is still a challenging problem of text detection in video. The main problem of the above existing methods is that they are not robust to different text colors, sizes, and background complexity, since they simply use either general segmentation method or some prior knowledge. In this paper, we attempt to discover the intrinsic characteristic of text (namely the stroke width similarity) and then exploit it to build a robust method for text detection in video. From the knowledge of font system, it turns out that, if characters are in the same font type and size, their stroke widths are almost constant. In another view, a text region can be considered as a region with a dominant fixed stroke width value. Therefore, the similarity in stroke width can be efficiently used as a critical characteristic to describe the text region in video frame. The contributions of this paper can be summarized as follow: • Exploiting the similarity in stroke width characteristic of text to build an effective edge detection method with local adaptive threshold algorithm. • Implementing a stroke-based method to localize text regions in video. • Designing a multi-frame refinement method which can not only refine the text location but also enhance the quality of the detected text. The rest of this paper is organized as follows: Section 2 presents the proposed method for text detection in video. To demonstrate its effectiveness, experimental results are given in Section 3. In Section 4, the concluding remarks are drawn.
2 Proposed Method In the proposed method, text regions in video are detected through three processes. First, edge detection with local adaptive threshold algorithm is applied to reveal text edge pixels. Second, dilation morphological operator with adaptive structuring element size is exploited in the stroke-based localization process to roughly localize text regions. Finally, a multi-frame refinement process is applied to reduce false alarm, refine the location, and enhance the quality of each text region. Figure 1 shows the flow chart of the proposed system.
202
V.C. Dinh et al.
Video frames
Edge Detection with Local Adaptive Thresholds
Stroke-Based Text Localization
Multi-Frame Text Refinement
Detected Text Regions
Fig. 1. Flowchart of the proposed text detection method
2.1 Motivation From the knowledge of font system, it turns out that, if characters are in the same font type and font size, their stroke widths are almost constant. Therefore, in the proposed method, the stroke width similarity is used as a clue to characterize text regions in frame. Generally, the width of any stroke (of both text and non-text objects) can be calculated as distance (measured in pixel) in horizontal direction between its doubleedge pixels. Figure 2(a) shows an example of double-edge pixels (A and B). It can be seen from the figure that stroke widths from different characters are almost similar. Scan line A
B
(a)
(b)
Fig. 2. An example of text image. (a) Text image. (b) Edge values for the scan line in (a), wt is the stroke width value.
In general, the color of text often contrasts to its local background. Therefore, for any double-edge pixels of a stroke, this contrast makes an inversion in sign of the edge values, i.e. the gradient magnitude of edge pixels, in horizontal direction (dx) between two pixels on the left- and right-hand side of the stroke. Figure 2(b) shows the corresponding edge values in horizontal direction of a given horizontal scan line in Fig. 2(a); it is clear that the stroke can be modeled as double-edge pixels within a certain range, delimited by a positive and a negative peak nearby. By using the doubleedge pixel model to describe the stroke, we can take the advantages of: 1) Reducing the effect of noise; 2) Applicability even with low-quality edge image. 2.2 Edge Detection with Local Adaptive Threshold Algorithm First, the Canny edge detector with a low threshold is applied to video frame to keep all possible text edge pixels and each frame is divided into M × N blocks, typically 8 × 8 or 12 × 8. Second, by analyzing the similarity in stroke width corresponding to each block, blocks are classified into two types: simple blocks and complex blocks. Then, a suitable threshold algorithm for each block type is used to determine the proper threshold for each block. Finally, the final edge image is created by applying each block with the new proper threshold.
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity
203
2.2.1 Block Classification For each block, we create a stroke width set which is the collection of all stroke width candidates contained in this block. Due to the similarity in stroke width of characters, the values in the stroke width set of the text region on the simple background are concentrated on some close values. Whereas stroke width candidates of the text region on the complex background or background region may also be created by other background objects. As a result, the element values in this set may spread over a wide range of values. Therefore, text regions on a simple background can be characterized by a smaller value of the standard deviation of stroke width than those on other regions. Based on this different characteristic, blocks in the frame are classified into two types: simple blocks and complex blocks. A block is classified as a simple one if the standard deviation of stroke width values is smaller than a given specific value. Otherwise, it will be classified as a complex one. For the simple block, the threshold of the edge detector should be relatively low to detect both low-contrast and high-contrast texts. On the contrary, the threshold for the complex block should be relatively high to eliminate background and highlight text. 2.2.2 Local Adaptive Threshold Algorithm In each block, the stroke width value corresponding to text objects often dominates in population of the stroke width set. Therefore, it can be estimated by calculating the stroke width with the maximum stroke width histogram value. Let wt denote the stroke width value of text, wt can be defined as:
wt = max H (l ) , l
(1)
where H(l) is the value on the block’s stroke width histogram with stroke width l. From the set of all double-edge pixels, we construct two rough sets: the text set St and the background set Sbg . The St represents the set of all pixels which are predicted as text edge pixels whereas the Sbg represents the set of all predicted background edge pixels. St and Sbg are constructed as follow:
St = {i, j | i, j ∈ E , w(i, j ) = wt } ,
(2)
Sbg = {i, j | i, j ∈ E , w(i, j ) ≠ wt } ,
(3)
where E is the edge map of the block and w(i, j) denotes the stroke width between the double-edge pixels i and j. Note that St and Sbg are only the rough sets of the text edge pixels and background edge pixels, since only edge pixels with gradient direction in horizontal are considered during the stroke width calculation process. Thresholds for the simple block and the complex block are determined as follow: • In the simple block case, the text lies on clear background. Therefore, the threshold is determined as the minimum edge value of all edge pixels belonging to St in order to keep text information and simplify the computational process. • In the complex block case, to determine the suitable threshold for the edge detector is much more difficult. Applying general thresholding methods does not often give
204
V.C. Dinh et al.
a good result since these methods are used for classifying general problems, not for such a specific problem as separating text from background. In this paper, by discovering the similarity in stroke width of text, we can roughly estimate the text set and background set as St and Sbg . Therefore, the problem of finding an appropriate threshold in this case can be converted into another but easier problem of finding appropriate threshold to correctly separate the two sets: St and Sbg . Image Block
Calculate stroke width wt Estimate the text set and background set as: St and Sbg Simple block case
Complex block case
Calculate edge value of St and Sbg Set the threshold as the smallest edge value of St
(a) Construct corresponding edge value histograms: ht(r) and hbg(r) Set the threshold as the edge value satisfying equation (4)
Edge detection with new threshold value
Edge Image Block
(b)
Fig. 3. Flowchart of the proposed local adaptive threshold algorithm
(c)
Fig. 4. Edge detection results. (a) Original Image. Edge detection using (b) constant threshold, (c) proposed local adaptive threshold algorithm.
Let r denote the edge value (gradient magnitude) of a pixel in a block, ht (r ) and hbg (r ) denote the histograms of the edge values corresponding to the text set St and background set Sbg , respectively. According to [12], if the form of the two distributions is known or assumed, it is possible to determine an optimal threshold (in term of minimum error) for segmenting the image into the two distinct sets. And the optimal threshold, denoted as T, can be revealed as the root of the equation: pt × ht (T ) = pbg × hbg (T ) ,
(4)
where pt and pbg ( pbg = 1 − pt ) are the probabilities of a pixel to be in St and Sbg sets, respectively. Consequently, the appropriate threshold for the complex block is determined as the value which satisfies or approximately satisfies equation (4). Figure 3 shows flowchart of the local adaptive threshold algorithm. Figure 4 shows the results of edge detection method on video frame in Fig. 4(a) by using only one constant threshold (Fig. 4(b)), in comparison with using the proposed local adaptive thresholds (Fig. 4(c)). The pictures show that the proposed method could eliminate more background pixels while still conserves text pixels.
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity
205
2.3 Stroke-Based Text Localization
After edge detection process, dilation morphological operator is applied to the edge detected video frame for highlighting text regions. The size of the structuring element is adaptively determined by the stroke width value. When applying the dilation operator, one of the most important factors that need to be considered is the size of the structuring element. If this size is set too small, the text area cannot be filled wholly. As the result, this area can be regarded as non-text area. In contrast, if this size is set too large, text can be mixed with the surrounding background. This problem results in increasing the number of false alarms. Moreover, using only a fixed size of the structure element, as in Chen et al.’s [9] method, is not applicable for texts of different sizes. 2wt+1
1
1
….
1
1
Fig. 5. Structure element of the dilation operation (wt is the stroke width value)
In this paper, we determine the size of the structure element based on the stroke width value which is already revealed in the edge detection process. More specifically, for each block, we apply a dilation operator of the size: ( (2 × wt + 1) × 1 ) at which the stroke width is wt as shown in Fig. 5. This size is satisfactory to wholly fill the character as well as connect neighborhood characters together. Moreover, using block-based dilation with suitable structure element shape makes it applicable for text with different sizes, at different locations in video frame. Figure 6(a) shows the image using the proposed dilation operators.
(a)
(b)
(c)
Fig. 6. Text localization and refinement process (a) Dilated image (b) Text regions candidates (c) Text regions after being refined by multi-frame refinements
After dilation process, connected component analysis is performed to create text region candidates. Then, based on the characteristic of text, the following simple criteria for filtering out non-text regions are applied: 1) the height of the region is between 8 and 35; 2) the width of the region must be larger than the height; 3) the number of edge pixel must be two times larger than the width based on the observation that text
206
V.C. Dinh et al.
region should have abundant edge pixels. Figure 6(b) shows the text region candidates after applying these criteria. 2.4 Multi-frame Refinement
Multi-frame integration has been used for the purpose of text verification [13], or text enhancement [14]. However, temporal information for the purpose of text refinement in frame, which often plays an important role in increasing the accuracy of text segmentation and recognition steps afterward, has not been utilized so far. In this paper, we propose a multi-frame based method to refine the location of text by further eliminating background pixels in the rough text regions detected in the previous steps. Moreover, the quality of text is also improved by selecting the most suitable frame, i.e. the frame at which text is displayed clearest, in the frame sequence. By using our method, the enhanced text region doesn’t cause the blurring problem as in text enhancement of Li et al.’s method [14]. First, a multi-frame verification [13] is applied to reduce the number of false alarms. For each of m consecutive frames in a video sequence, a text region candidate is considered as a true text region only if existing at least n (n<m) similar text regions T0, T1, …, Tn-1 appearing in n different frames. Tk (k = 0, 1...n-1) is the region of the corresponding frame received after edge detection process. Let call T the stationary edge image of the corresponding text region candidate. The pixel value at location (x, y) of T is determined as follows: ⎧ if ⎪edge pixel , T ( x, y ) = ⎨ ⎪non edge pixel , ⎩
n-1
∑ I k ( x, y ) > θ
k=0
(5)
otherwise,
where θ is a specific threshold and Ik (x, y) is defined as:
⎧1, I k ( x, y ) = ⎨ ⎩0,
if Tk ( x, y ) is edge pixel otherwise.
(6)
Refer to (5), T(x, y) is an edge pixel if at the location (x, y), an edge pixel appears more than θ times, otherwise, T(x, y) is a non-edge pixel. In the proposed method, the θ is set equal to [n × 3 / 4] in order to reduce the effect of noise. Based on the stationary characteristic of text, almost all background pixels are removed in T. However, this integration process may also remove some text edge pixels. In order to recover the lost text edge pixels, a simple edge recovery process is performed. A pixel in T is marked as edge pixel if it’s two neighborhoods in the horizontal, vertical, or diagonal direction are edge pixels. After the recovery process, T can be seen as the edge image of the true text regions. Therefore, the precise text location of the corresponding text region can be obtained by calculating the bounding box of edge pixels contained in T. In order to enhance the quality of the text, we extract the most suitable frame in the frame sequence where text appears clearest. Based on the fact that a text region is clearest if the corresponding edge image contains almost text pixels, the most suitable frame is extracted if the edge image of its text region is the best matching with T. In
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity
207
other words, we choose the frame whose edge image Tk (k = 0,.., n-1) is the most similar with T. The MSE (Mean Squared Error) measurement is used to measure the similarity between two regions. The effectiveness of using multi-frame refinement is manifested in Fig. 6(c). Comparing to Fig. 6(b), two false alarms are removed and all of true text regions have more precise bounding boxes.
3 Experimental Result Due to the lack of a standard database for the problem of text detection in video, in order to evaluate the effectiveness of the proposed method, we have collected a number of videos from various sources for a test database. Text appearance varies with different color, orientation, language, and character font size (from 8pt to 90pt). The video frame formats are 512×384 and 720×480 pixels. The test database can be divided into three main categories: news, sport, and drama. Table 1 shows the video length and the number of ground-truth text regions contained in each video category. Totally, there are 553 ground-truth text regions in the whole video test database. Table 1. Properties of video categories Drama Video length Text regions
15 minute 126
Sport 32 minute 202
News 38 minute 225
For quantitative evaluation, the detected text region is considered as the correct one if the intersection of the detected text region (DTR) and the ground-truth text region (GTR) covers more than 90% of this DTR and 90% of this GTR. The efficiency of our detection method is assessed in terms of three measurements (which are defined in [10]): Speed, Detection Rate, and Detection Accuracy. In order to assess the effectiveness of the proposed method, we compare the performance of the proposed method with that of the typical edge-based method proposed by Lyu et al. [10], and the method using 3 processes: edge detection with a constant threshold, text localization with fixed size dilation operations (similar to the algorithm in [9]), and multi-frame refinement. Let call it “constant threshold” method. Table 2 shows the number of correct and false DTRs for three video categories. It can be seen from the table that not only does the proposed method create the highest number of correct DTRs but it also produces the smallest number of false DTRs in every case. Our method is obviously stronger than the others even in the case of news category (the number of false DTRs is about only a half compared to other methods). It is more difficult to detect text in news video since the background is changing fast and texts have variable sizes with different contrast levels to the background. The proposed method could overcome these problems since it successfully exploits the self characteristic of text (the stroke similarity), which is invariant to the background complexity as well as different font sizes and colors of text. Table 3 gives a summary of the detection rate and the detection accuracy of the three methods tested on the whole video test database. The proposed method achieved
208
V.C. Dinh et al. Table 2. Number of correct and false DTRs
Correct DTRs False DTRs Correct DTRs False DTRs Correct DTRs False DTRs
Lyu et al. [10] Constant threshold Proposed Method
Drama
Sport
News
96 16 109 19 114 11
154 26 152 32 179 20
185 38 189 46 205 21
the highest accuracy with the detection rate of 90.1% and the detection accuracy of 90.5%. This encouraging result shows that our proposed method is an effective solution to the background complexity problem of text detection in video. It can be seen from the table that the proposed method is faster than Lyu et al.’s [10] method and a bit lower than using constant threshold method which is obviously clear since we need to scan the frame with different thresholds. Moreover, the processing time of 0.18s per frame meets the requirement for real-time applications. Figure 7 shows some more examples of the results we got. In these pictures, all the text strings are detected and their bounding boxes are relatively tight and accurate. Table 3. Text detection accuracy
Lyu et al. [10] Constant threshold Proposed Method
Correct DTRs 435 450 498
False DTRs 80 97 52
Detection Rate 78.7 % 81.4 % 90.1 %
Detection Accuracy 84.5 % 82.3 % 90.5 %
Fig. 7. Some pictures of detected text regions in frames
Speed (Sec/frame) 0.23s 0.16s 0.18s
An Efficient Method for Text Detection in Video Based on Stroke Width Similarity
209
4 Conclusion This paper presents a comprehensive method for text detection in video. Based on the similarity in stroke width of text, an effective edge detection method with local adaptive thresholds is applied to reduce the background complexity. The stroke width information is further utilized to determine the structure element size of the dilation operator in the text localization process. To reduce the false alarm as well as refine the text location, a new multi-frame refinement method is applied. Experimental results with a large set of videos demonstrate the efficiency of our method with the detection rate of 90.1% and detection accuracy of 90.5%. Based on these encouraging results, we plan to continue research on text tracking and recognition for a real time text-based video indexing and retrieval system.
References 1. Zhu, Q., Yeh, M.C., Cheng, K.T.: Multimodal fusion using learned text concepts for image categorization. In: Proc. of ACM Int’l. Conf. on Multimedia, pp. 211–220. ACM Press, New York (2006) 2. Lienhart, R.: Dynamic video summarization of home video. In: Proc. of SPIE, vol. 3972, pp. 378–389 (1999) 3. Fan, J., Luo, H., Elmagarmid, A.K.: Concept-oriented indexing of video databases: toward semantic sensitive retrieval and browsing. IEEE Trans. on Image Processing 13, 974–992 (2004) 4. Zhong, Y., Karu, K., Jain, A.K.: Locating text in complex color images. Pattern Recognition 28, 1523–1536 (1995) 5. Jain, A.K., Yu, B.: Automatic text location in images and video frames. In: Proc. of Int’l. Conf. on Pattern Recognition, vol. 2, pp. 1497–1499 (August 1998) 6. Ohya, J., Shio, A., Akamatsu, S.: Recognition characters in scene images. IEEE Trans. on Pattern Analysis and Machine Intelligence 16, 214–220 (1994) 7. Qiao, Y.L., Li, M., Lu, Z.M., Sun, S.H.: Gabor filter based text extraction from digital document images. In: Proc. of Int’l. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, pp. 297–300 (December 2006) 8. Li, H., Doermann, D., Kia, O.: Automatic text detection and tracking in digital video. IEEE Trans. on Image Processing, 147–156 (2000) 9. Chen, D., Bourlard, H., Thiran, J.P.: Text identification in complex background using SVM. In: Proc. of Int’l. Conf. on Document Analysis and Recognition, vol. 2, pp. 621–626 (December 2001) 10. Lyu, M.R., Song, J., Cai, M.: A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Trans. on Circuits Systems Video Technology, 243–255 (2005) 11. Jung, K.C., Han, J.H., Kim, K.I., Park, S.H.: Support vector machines for text location in news video images. In: Proc. of Int’l. Conf. on System Technology, pp. 176–189 (September 2000) 12. Gonzalez, R.-C., Woods, R.E.: Digital Image Processing, 2nd edn., pp. 602–608. PrenticeHall, Englewood Cliffs (2002) 13. Lienhart, R., Wernicke, A.: Localizing and segmenting text in images and videos. IEEE Trans. on Circuits Systems Video Technology, 256–268 (2002) 14. Li, H., Doermann, D.: Text enhancement in digital video using multiple frame integration. In: Proc. of ACM Int’l. Conf. on Multimedia, pp. 19–22. ACM Press, New York (1999)
Multiview Pedestrian Detection Based on Vector Boosting Cong Hou1, Haizhou Ai1, and Shihong Lao2 1
Computer Science and Technology Department, Tsinghua University, Beijing 100084, China Sensing and Control Technology Laboratory, Omron Corporation, Kyoto 619-0283, Japan
[email protected]
2
Abstract. In this paper, a multiview pedestrian detection method based on Vector Boosting algorithm is presented. The Extended Histograms of Oriented Gradients (EHOG) features are formed via dominant orientations in which gradient orientations are quantified into several angle scales that divide gradient orientation space into a number of dominant orientations. Blocks of combined rectangles with their dominant orientations constitute the feature pool. The Vector Boosting algorithm is used to learn a tree-structure detector for multiview pedestrian detection based on EHOG features. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. Experimental results are reported to show its high performance. Keywords: Pedestrian detection, Vector Boosting, classification.
1 Introduction Pedestrian detection researches originated in the requirement of intelligent vehicle system such as driver assistance systems [1] and automated unmanned car systems [13], and become more popular in recent research activities including visual surveillance [2], human computer interaction, and video analysis and content extraction, of which the last two are in more general sense that involve full-body human detection and his movement analysis [14]. Pedestrian, by definition, means a person traveling on foot, that is, a walker. Pedestrian detection is to locate all pedestrian areas in an image, usually in the form of bounding rectangles. We all know as a special case in more general research domain “object detection or object category”, face, car and pedestrian are most researched targets. Nowadays, although face detection or at least frontal face detection is well accepted solved problem in academic society, car detection and pedestrian detection are not so well solved; they remain a big challenge to achieve a comparable performance to face detection in order to meet the requirement of practical applications in visual surveillance etc. In general, object detection or object category is still in its early research stage that is very far from real application. For previous works before 2005 see a survey [3] and an experimental study [4]. Recent works are mainly machine learning based approaches among which the edgelets method [5] and the HOG method [7] are most representative. The edgelets method [5] uses a new type of silhouette oriented feature called an edgelet that is a short segment of line or curve. Based on edgelets features, part (full-body, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 210–219, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multiview Pedestrian Detection Based on Vector Boosting
211
head-shoulders, torso, and legs) detectors are learned by Real AdaBoost. Responses of part detectors are combined to form a joint likelihood model and the maximum a posteriori (MAP) method is used for post processing to deal with multiple, possibly inter-occluded humans detection problem. The HOG method [7] uses the histograms of oriented gradients features to characterize gradient orientation distribution in a rectangular block. A detection window is divided into several rectangular blocks and each block is divided into several small spatial regions called cells in which the HOGs are computed and combined to form the features for classification. A linear SVM is used for detector training. This method is improved in [9] for speed up by AdaBoost learning of a cascade detector, in which variable-size blocks are used to enrich the feature pool and each feature is a 36D vector (concatenated by 4 cells’ 9 orientation bins of histogram in a block) that is fed into a linear SVM to form a weak classifier for AdaBoost learning. The above works are for still images and there are also recent advance for video [6][8][12]. Pedestrian detection in video is quite different from that in still images although techniques developed in later case can be help to pedestrian detection in video, for example to initialize or trigger a tracking module. Anyway pedestrian detection in still images is more fundamental. In this paper, we will focus on multiview pedestrian detection (MVPD) in still images and present a method using extended HOG features and Vector Boosting originally developed for multiview face detection (MVFD) for MVPD. Although pedestrian detection seems similar to face detection, it is more difficult due to large variation caused by clothes in addition to other common factors like pose, illumination, etc. In last several years MVFD has achieved great success and found its ways in practical applications. Many MVFD methods have been developed including parallel cascades [15], pyramid [16], decision tree [17] and WFS tree [18], of which the WFS tree together with the Vector Boosting algorithm is proved to be one of the most efficient methods. In this paper, we develop a method to apply this technique to the MVPD problem. We quantify gradient orientations into three angle scales that divide gradient orientation space into totally 27 dominant orientations. The EHOG features of a block of rectangle or non-regular rectangle are used to represent statistical information of edges in that block. Therefore blocks with their dominant orientations constitute the feature pool. The Vector Boosting learning [18] is used to construct a tree-structure detector [18] for MVPD. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. The main contributions are (1) Dominant orientation in combined rectangle block is introduced into HOG features to form a feature pool; (2) A high performance tree-structure detector is developed for MVPD based on Vector Boosting. The rest of the paper is organized as follows: in Section 2, an extension to the HOG feature is introduced. In Section 3, the tree-structure detector training is described. Experiments are reported in Section 4 and conclusions are given in Section 5.
2 Extended HOG Feature The HOG feature has been proved effective in pedestrian detection [7][9]. The feature makes statistics about magnitude of gradient in several orientations, which are called
212
C. Hou, H. Ai, and S. Lao
bins. In [7][9], the orientation over 0◦~180◦ is divided into 9 bins. The HOG is collected in local regions in a picture called cells. A block contains several cells whose HOGs are concatenated into a higher dimension vector. And then a SVM is used to construct a classifier for each corresponding block over training set. However, the SVM detector is computational intensive in detection. Therefore, boosted detector that has been proved successful in face detection can be a good choice. In [9] HOG features are fed into a linear SVM to form a weak classifier which results in a much faster detector. In this paper in order to achieve better performance in both detection rate and speed, we make an extension to the HOG feature which outputs a scale value. The feature can be directly used in boosting learning as a weak classifier, which avoids time consuming inner product of high dimension vectors as in SVM or LDA type of weak classifier. First, we calculate the HOG in a block itself without dividing it into smaller cells as in [7]. Therefore the block functions in fact as an ensemble cell:
Gb = ( gb (1), gb (2),..., g b (n))T where n is the dimension of the HOG (in [7][9], n = 9), and b is a block in an image. Then, we introduce the concept of dominant orientation D (for detail, see section 2.1) that is defined as a subset of the above basic level of bins, that is, D ⊆ {1, 2," , n} , and calculate the EHOG feature corresponding to D as:
Fb ( D) = ∑ gb (i ) / Z b i∈D
where Z b is the normalizing factor: n
Z b = ∑ g b (i ) i =1
With the help of the integral image of HOG [9],
gb (1), gb (2),..., gb (n) and Zb
can be calculated very fast. We will explain two important concepts in more detail: the dominant orientation D and the non-rectangle block b . 2.1 Dominant Orientation The dominant orientation is indeed a set of representative bins of the HOG. We have observed that in an area containing simple edges, most of gradients concentrate in a relatively small range of orientation. Therefore, we can use a small part of bins to represent these edges. In most situations, this treatment is acceptable as shown in Fig 1. In training, the dominant orientation is found by feature selection. In our implementation, we also divide the orientation over 0◦~180◦ into 9 bins as in [7], and the dominant orientation of each feature may contain 1, 2 or 3 neighboring bins as shown in Fig 2. Therefore, there are totally 27 different dominant orientations for each block.
Multiview Pedestrian Detection Based on Vector Boosting
213
Fig. 1. (a) A picture with a pedestrian. (b) The HOG calculated in the red rectangle in (a). The length of each line denotes the magnitude of gradient in each bin. It can be seen there are three main orientations (lines of these orientations are in bold). (c) We only pick these three bins out and use the normalized summation of their values as the output of the EHOG feature.
Fig. 2. Three levels of orientation partition between 0◦-180◦, and each partition has 9 different orientations (note that there are some overlaps between neighboring parts in (b) (c)). The dominant orientation in each level covers (a) 20◦ (1 bin), (b) 40◦ (2 bins), (c) 60◦ (3 bins).
2.2 Non-rectangle Blocks The HOG and EHOG feature are both calculated in a local region of an image which is called a block. In [7], the size of the block is fixed, and in [9] it is variable. We also use variable-size blocks, and make some extension that other than the rectangle blocks used in [7][9], we also adopt blocks with non-rectangle shapes like in Fig 3 (a) called combined blocks to enrich the feature pool in order to reflex geometry structure of feature representation. In addition, we add block pairs to capture symmetric feature of pedestrians (see Fig.7 for block pair examples). To avoid feature space explosion, we manage the feature space by way of selecting and expanding with a heuristic search strategy. The initial feature space contains only
Fig. 3. (a) Some blocks with irregular shapes. (b) Two types of expanding operators.
214
C. Hou, H. Ai, and S. Lao
rectangle blocks. After feature selection, we get a small set of best rectangle features as seeds for generating additional non-rectangle blocks. Two kinds of operation on these seeds to make shape changes are defined as illustrated in Fig 3 (b): sticking and pruning. To describe the operation, we differ the rectangle blocks into two types: positive one and negative one. To stick is to add a positive rectangle block beside the seed block, and to prune is to add a negative rectangle block in the seed block. After several such operations, a seed can be propagated into thousands of ones which constitute the new feature space for further training.
3 Multi-view Pedestrian Detection Although pedestrians of different poses are not so much discriminative as that in MVFD problem in which frontal, left-half-profile and left-full-profile, right-halfprofile and right-full-profile are common divided sub-views, we can still divide pedestrians into three relatively separated classes according to their views: frontal/rear, left-profile and right-profile views. We use Vector Boosting to learn a tree-structure detector for multiview pedestrian detection. 3.1 Vector Boosting The Vector Boosting algorithm was first proposed by Huang etc. in [19] to deal with the multi-view face detection. It deals with multi-class classification problems by means of the vectorization of hypothesis output space and the flexible loss function defined by intrinsic projection vectors, for detail see [19]. 3.2 Tree-Structure Detector The tree-structure detector is illustrated in Fig 4. Before the branching node, a series of nodes try to separate different views of positive samples and at the same time discard as many negative samples as possible. They functions as a cascade detector [11] in which each node performs a binary decision: positive or negative. The branching node outputs a 3D vector whose components determine which branch or branches the sample should be sent to. For example, the output (1, 1, 0) means the sample may be a left profile pedestrian or a frontal/rear one. After the branching node, again there comes a cascade detector for each branch.
Fig. 4. The tree-structure multi-view pedestrian detector. The gray node is a branching node which outputs a 3D binary decision.
Multiview Pedestrian Detection Based on Vector Boosting
215
3.3 Training Process There are three kinds of tree node to train: the nodes before the branching node, the branching node and the nodes after the branch node. Each node is a strong classifier learned by Vector Boosting which is denoted by F(x) . In our problem,
F (x) = ( Fl ( x), F f (x), Fr (x))T . The decision boundaries as stated in [19] will be as follows in our problem:
P (ω N | x) =
1 1 + exp(2 Fl ( x)) + exp(2 F f ( x)) + exp(2 Fr ( x))
P (ωL | x) = exp(2 Fl ( x)) P (ω N | x) P (ωF | x) = exp(2 F f ( x)) P (ω N | x) P (ωR | x) = exp(2 Fr ( x)) P (ω N | x) where P(ωN | x) , P(ωL | x), P(ωF | x) and P(ωR | x) are separately the posterior probability of negative samples and positive samples of three views. The first kind of node above only cares if the sample is positive or negative, so it only needs to calculate P(ωN | x) . In the training, we’ll find a threshold Pt (ωN )
Fig. 5. Distributions of 3 classes (negative samples, positive samples of left profile and frontal/rear views) in the output space of the first 9 nodes before the branching. It can be seen that after 6 nodes pedestrians of different views can be separated rather well.
216
C. Hou, H. Ai, and S. Lao
according to the detection rate and false alarm rate of the node. If P (ω N | x) > Pt (ω N ) , the sample is regarded as a negative one, else positive. The second kind of node, that is, the branching node, tries to separate positive samples with different views, so P(ωL | x), P(ωF | x) and P(ωR | x) are all needed. The output is a 3D vector, in which each dimension is a binary decision decided by a corresponding threshold. The node in branches deals with a two-class classification problem, so the normal Real AdaBoost [10] learning can be used. One question remained is how to determine when to branch. This is done by experiments in our practice. As shown in Fig 5, the first 9 nodes before branching node have their distributions of negative samples, positive samples with frontal/rear and left profile views. It can be seen that the pedestrians with different views have been well separated in the 9th node; therefore we choose this node as the branching node. 3.4 The Detector Pyramid Framework Generally speaking, the size of training samples has great impact on the performance of the learnt detector both in detection accuracy and in speed. In face detection research, a common used size is 24×24 pixel (19×19 and 20×20 are also used in earlier work) which has been demonstrated very effective. In pedestrian detection research, 15×20 [12], 24×58 [5], 64×128[7][9] have been used. Different from face detection research, there is no common accepted size widely used. In practice, we found that detectors trained by larger samples have better performance when detecting larger pedestrians possibly because larger samples offer more clear information for classification of more complex objects like pedestrians. So, we use samples of different scales (sizes) to build a detector pyramid. The small size detector in the pyramid deals with small pedestrians and the large size detector deals with large ones. The number of layers of the scale pyramid of the input image to be scanned accordingly decreases, which can speed up the detection compared with single scale detector case.
4 Experiments Our training set contains 9,006 positive samples for frontal/rear view, 2,817 positive samples for left/right profile view. Pedestrians in samples are upright standing or walking. Some samples are shown in Fig 6. The negative samples are sampled from more than 10,000 images without any human.
Fig. 6. Positive training samples: (a) frontal/real views; (b) left profile view; (c) right profile view
Multiview Pedestrian Detection Based on Vector Boosting
217
The detector pyramid has 3 layers whose sizes are 24×58, 36×87 and 48×116 pixel respectively. The number of features in each node of the three detectors decreases as the size of the detector increases. For example, the total numbers of features in the first 5 nodes of these three detectors are 75, 53 and 46 respectively. So the speed of the detector increases with its size grows. Because EHOG features with non-rectangle blocks are slower than those with rectangle blocks in computing, for efficiency the feature pool for the first several nodes only contains the rectangle ones that guarantees faster speed. Fig 7 shows the first three (pair) features selected in the 24×58 detector. It can be seen that the second feature captures the edge of shoulders and the third captures the edge of foot. The detection speed of our detector is about 1.2 FPS with a 320×240 pixel image on a 3.06GHz CPU.
Fig. 7. The first three features selected and their corresponding dominant orientations
We evaluate our detector on two testing sets: one is Wu et al [5]’s testing set which contains pedestrians with frontal/rear view and the other is the INRIA testing set [7]. Wu’s testing set contains 205 photos with 313 humans of frontal and rear view. Fig 8 (a) shows the ROC curves of our detector (including the detector pyramid and a
Fig. 8. (a) ROC curves of evaluation on Wu’s testing set [5]. (b) Miss-rate/FPPW curves on INRIA testing set [7].
218
C. Hou, H. Ai, and S. Lao
Fig. 9. Some detection results on Wu’s frontal/rear testing set [5]
Fig. 10. Some detection results on INRIA testing set [7]
24×58 detector) and Wu’s edgelet full-body detector and their combined detector. It can be seen that our detector pyramid is better than the full-body detector and the combined detector in accuracy, and is better than single detector too. Some detection results are shown in Fig 9 on Wu’s test set. The INRIA testing set contains 1805 64×128 images of humans with a wide range of variations in pose, appearance, clothing, illumination and background. Fig 8 (b)
miss-rate/FPPW
shows the comparative results in (False Positive Per Window) curves. We can see that our method is comparable with Zhu’s method when the FPPW is low. At 10-4 FPPW, the detection rate is 90%. Some detection results are shown in Fig 10.
5 Conclusion In this paper, a multiview pedestrian detection method based on Vector Boosting algorithm is presented. The HOG features are extended to form EHOG features via dominant orientations. Blocks of combined rectangles with their dominant orientations constitute the feature pool. The Vector Boosting algorithm is used to learn a tree-structure detector for multiview pedestrian detection based on EHOG features. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. This results in a high performance MVPD system that can be very useful in many practical applications including visual surveillance. We are planning to extend this research to video for pedestrian tracking in future.
Acknowledgement This work is supported in part by National Science Foundation of China under grant No.60673107 and it is also supported by a grant from Omron Corporation.
Multiview Pedestrian Detection Based on Vector Boosting
219
References 1. Gavrila, D.M.: Sensor-based Pedestrian Protection. IEEE Intelligent Systems, 77–81 (2001) 2. Zhao, T.: Model-based Segmentation and Tracking of Multiple Humans in Complex Situations. In: CVPR 2003 (2003) 3. Ogale, N.A.: A survey of techniques for human detection from video, University of Maryland, Technical report (2005) 4. Munder, S., Gavrila, D.M.: An Experimental Study on Pedestrian Classification. TPAMI 28(11) (2006) 5. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 6. Wu, B., Nevatia, R.: Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. In: CVPR 2006 (2006) 7. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human detection. In: CVPR 2005 (2005) 8. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 9. Zhu, Q., Avidan, S., et al.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: CVPR 2006 (2006) 10. Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37, 297–336 (1999) 11. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. In: CVPR 2001 (2001) 12. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Pattern of Motion and Appearance. In: ICCV 2003 (2003) 13. Zhao, L., Thorpe, C.E.: Stereo- and Neural Network-Based Pedestrian Detection. IEEE Trans. on Intelligent Transportation Systems 1(3) (2000) 14. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 15. Wu, B., Ai, H., et al.: Fast rotation invariant multi-view face detection based on Real Adaboost. In: FG 2004 (2004) 16. Li, S.Z., Zhu, L., et al.: Statistical Learning of Multi-View Face Detection. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, Springer, Heidelberg (2002) 17. Jones, M., Viola, P.: Fast Multi-view Face Detection. MERL-TR2003-96 (July 2003) 18. Huang, C., Ai, H.Z., et al.: Vector Boosting for Rotation Invariant Multi-View Face Detection. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 19. Huang, C., Ai, H.Z., et al.: High-Performance Rotation Invariant Multiview Face Detection. TPAMI 29(4), 671–686 (2007)
Pedestrian Detection Using Global-Local Motion Patterns Dhiraj Goel and Tsuhan Chen Department of Electrical and Computer Engineering Carnegie Mellon University, U.S.A.
[email protected],
[email protected]
Abstract. We propose a novel learning strategy called Global-Local Motion Pattern Classification (GLMPC) to localize pedestrian-like motion patterns in videos. Instead of modeling such patterns as a single class that alone can lead to high intra-class variability, three meaningful partitions are considered - left, right and frontal motion. An AdaBoost classifier based on the most discriminative eigenflow weak classifiers is learnt for each of these subsets separately. Furthermore, a linear threeclass SVM classifier is trained to estimate the global motion direction. To detect pedestrians in a given image sequence, the candidate optical flow sub-windows are tested by estimating the global motion direction followed by feeding to the matched AdaBoost classifier. The comparison with two baseline algorithms including the degenerate case of a single motion class shows an improvement of 37% in false positive rate.
1
Introduction
Pedestrian detection is a popular research problem in the field of computer vision. It finds its applications in surveillance, fast automatic video browsing for pedestrians, activity monitoring etc. The problem to localize pedestrians in image sequences, however, is extremely challenging owing to the variations in pose, articulation and clothing. The resulting high intra-class variability for the class of pedestrians is further exaggerated by the background clutter and the presence of pedestrian-like upright objects in the scene like trees and windows. Traditionally, appearance and shape cues have been the popular discernible features to detect pedestrians in a single image. Oren et al. [1] devised one of the first appearance based algorithms using wavelet response, while more recently, histogram of oriented gradients [2] have been used to learn a shapebased model to segment out humans. However, in an uncontrolled environment the appearance cues alone aren’t faithful enough for reliable detection. Recently, motion cues have been gaining a lot of interest for pedestrian detection. In general, pedestrians need to be detected in videos where high correlation between consecutive frames can be used to good effect. While human appearances can be deceptive in a single image, their motion patterns are significantly different from other kinds of motions like vehicles (Fig. 2). The articulation of the human body while in motion due to the movement of limbs and torso can Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 220–229, 2007. c Springer-Verlag Berlin Heidelberg 2007
Pedestrian Detection Using Global-Local Motion Patterns
221
Fig. 1. Overview of the proposed system
provide useful cues to localize moving pedestrians, especially in a stationary cluttered background. To model such a phenomenon, spatio-temporal filters based on shifted frame difference were used by Viola et al. [3], thus, combining the advantages of both shape and motion cues. Fablet and Black [4] used dense optical flow to learn a generative human-motion model while a discriminative model based on Support Vector Machines was trained by Hedvig [5]. The common feature in all the above techniques is that they consider pedestrians as a single class. Though at one hand using human motion patterns circumvents many problems posed by appearance cues, considering all such patterns as a single class can still lead to a very challenging classification problem. In this paper, we present a novel learning strategy to partition the human motion patterns into natural subsets with lesser variability. The rest of the paper is organized as follows: Sect. 2 provides an overview of the proposed method, Sect. 3 introduces the learning strategy based on partitioning the human motion pattern space, Sect. 4 reports the comparison with two baseline algorithms and detection results, and Sect. 5 concludes with a discussion.
2
Overview
Figure 1 gives an overview of the proposed system to detect pedestrian-like motion patterns in the image sequences. Figure 2 illustrates some of the examples of such patterns. Due to high intra-class variability of the flow patterns generated by the pedestrians, modeling all such patterns using a single classifier is difficult. Hence, these are divided into meaningful subsets according to the global motion direction - left, right and frontal. As a result, the classification is divided into two stages. A linear three-class Support Vector Machines (SVM) classifier is trained to estimate the global motion direction. Next, a cascade of AdaBoost classifiers with the most discriminative eigenflow vectors is learnt for each of the global motion subsets. The motion patterns in the same partition share some similarity and hence, intra-class variability for each of these subsets is be less as compared to the whole set, rendering the classification less challenging.
222
D. Goel and T. Chen
(a)
(b)
Fig. 2. (a) Pedestrian sample images along with their horizontal optical flow for right, left and frontal motion subsets. (b) Sample labeled images from the non-pedestrian data and examples of non-pedestrian horizontal flow.
At the time of testing, the dense optical flow image is searched for pedestrianlike motion patterns using sub-windows of different sizes. For every candidate sub-window, first the global motion direction is estimated using the linear threeclass SVM classifier. Thereafter, it is tested against the matching AdaBooost classifier. 2.1
Computing Dense Optical Flow
Dense optical flow is used as a measure to estimate motion between consecutive frames. Though numerous methods exist in the literature to compute dense flow, 2-D Combined Local Global method [8] was chosen since it has been shown to provide very accurate flow. Furthermore, using bidirectional multi-grid strategy, it can work in real-time [9] at upto 40 fps for 200x200 pixels image. The final implementation used for pedestrian detection incorporates a slight modification in the weighting function of the regularization term as mentioned in [6]. 2.2
Training Data
The anatomy of the learning algorithm necessitates a pedestrian data set labeled according to the global motion. For this purpose, the CASIA Gait database [7] was chosen. A total of eight global motion directions were considered that were merged to give three dominant motions - left, right and frontal (Fig. 2(a)). The left and the right motion subset capture the lateral motion while the motion perpendicular to the camera plane is contained in the frontal motion subset. Dense optical flow was computed for the videos and the horizontal, u, and the vertical, v, flows for the labeled pedestrians were cropped. The collection of these flow patterns formed the training-test data for the classification. Specifically, the frontal motion subset had 2500 training data samples and 1000 test data samples. The other two motion subsets had 4800 training data samples and 2000 test data samples each. The cropped data samples were resized to 16x8 pixels, normalized to lie in the range [−1, 1] and concatenated to form a 256 dimension feature vector - [u1 , u2 , . . . , u128 , v1 , v2 , . . . , v128 ]. The non-pedestrian data was generated by hand-labeling sub-windows with non-zero flow in the videos containing moving vehicles. To automate the process,
Pedestrian Detection Using Global-Local Motion Patterns
223
an Adaboost classifier was trained for the set of all pedestrian and non-pedestrian data and was run on other videos to generate additional non-pedestrian flow patterns (from the false positives). The non-pedestrian data samples are resized and normalized in the same way as the pedestrian data. Approx. 120,000 such samples were generated, with some examples shown in Fig. 2(b).
3
Classification Strategy
This section describes the classification strategy to distinguish the motion patterns of pedestrians from other kinds of motions like that of vehicles etc. As illustrated in Fig. 1, it is divided into two stages - estimating the global motion direction (Section 3.1) followed by testing against the discriminative classifier (Section 3.2). Training procedure for the latter has been described in [6]. The final detection performance depends on the accuracy of both the stages and is greatly influenced by the taxonomy of the pedestrian motion patterns. A maximum of eight possible motion classes were considered as shown in the Fig. 2(a). Building a discriminative classifier for each of them results in a group of classifiers that are highly discriminative for the motion direction they are trained for. Thus, the accuracy in estimating the motion direction becomes crucial to the overall performance, i.e. the sub-window containing strictly left moving pedestrian should be fed to the classifier trained to detect strictly left moving pedestrians. However, it is very difficult to reliably estimate the motion direction in these eight subsets. Thus, the detection rate of the classifier as a whole degrades. The natural modification is to merge the different motion subsets such that the motion direction can be estimated faithfully but at the same time intraclass variability is kept low. Splitting the motion patterns into three subsets left, right and frontal - gave the best performance. 3.1
Estimating Global Motion
In order to decide which motion-specific discriminative classifier to use, it is important to first estimate the global motion. The mean motion direction for the pedestrian data was found to be unreliable in achieving such an objective. Hence, a linear three-class SVM classifier was trained. This classifier acts as more of a switch that assigns the queried data samples to their appropriate classifiers that have been specifically trained to handle those particular flow patterns. The labeled pedestrian data is used to train this switch classifier. The same number of training data samples, about 2000 each, was used for all the three classes to obviate bias towards any particular class. Further, each of the classes themselves contain the same proportion of different motions contained within them. For example, the left class contains the same number of samples for strict left motion, left front at 45o and left back at 45o . Figure 3 shows the class confusion matrix for the learned model. 348 support vectors were chosen by the model that is less than 6% of the number of training data samples, indicating a well generalized classifier.
Frontal
224
D. Goel and T. Chen Frontal
Right
Left
0.964
0.023
0.013
Left
Right
Frontal
0.022
0.978
0.00
0.019
0.00
0.981 Right
Fig. 3. Class confusion matrix for estimating the global motion direction using the three-class linear SVM classifier
Left
Fig. 4. Magnitude of the mean and the first two eigenflow vectors of the horizontal optical flow for the training pedestrian data
The trained switch classifier is used to allocate non-pedestrian data for each of the motion classes for training the discriminative motion-specific classifiers. Out of 120,000 data samples, about 75,000 got classified as belonging to the frontal motion, 25,000 were categorized as left motion class while the remaining 20,000 as having right motion. 3.2
Learning the Discriminative Classifiers
This section describes the learning procedure to train the discriminative motionspecific classifiers. In total, three separate classifiers are learnt, one for each global motion. The learning process is the same for all of them. Hence, for the sake of clarity, motion-specific term has been dropped in this section and whenever pedestrian and non-pedestrian data is mentioned, it refers to the data belonging to a particular global motion, unless stated otherwise. It is worth mentioning that the symmetrical properties of left and right classifiers can be exploited by training the classifier for one and using it’s mirror image (after changing the sign for horizontal motion) for the other. Weak Classifier. Principal Component Analysis was done separately on the pedestrian and non-pedestrian data to obtain the eigenvectors for the optical flow, known as eigenflow [10]. Figure 4 shows the magnitude of the mean and the first two u-flow eigenvectors for each of the three global motions. As is evident, the mean flows represent the global motion while the eigenflow vectors capture the poses and the articulation of the human body, especially the movement of the limbs. For the frontal motion, the mean is not that informative since it contains both front and backward moving pedestrians. Using all the eigenflow vectors, 256 for each of the pedestrian and nonpedestrian data, we have a total of 512 eigenflow vectors that act as a pool of features for AdaBoost. Taking the magnitude of correlation between the training
Pedestrian Detection Using Global-Local Motion Patterns
225
Table 1. Feature selection and training AdaBoost classifier – Given the training data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) where xi is the eigenflow and yi is 0 for non-pedestrian and 1 for pedestrian examples. 1 1 , 2m for yi = 0, 1 respectively, where l and m are – Initialize the weights w1,i = 2l the number of pedestrian and non-pedestrian examples. – for t = 1, . . . , T w 1. Normalize the weights wt,i ← n t,iwt,j j=1
2. Selectthe best weak classifier ht with respect to the weighted error: t = minj i wi |hj − yi | 3. Update the weights: wt+1,i = wt,i βt1−ei where ei = 0 if example xi is correctly classified by ht , ei = 1 otherwise, and t . βt = 1− t – The strong classifier is given by: T T 1 1, if t=1 αt ht (x) ≥ 2 t=1 αt (2) C(x) = 0, otherwise. where αt = log
1 βt
data x and an eigenflow vector zj and finding the optimum threshold θj that minimizes the overall classification error would yield a weak classifier hj . 1, if |xT zj | ≶ θj hj (x) = (1) 0, otherwise. Feature Selection and AdaBoost. The procedure to choose the most discriminative of the weak classifiers, as illustrated in Fig. 5(a) is motivated by the face detection algorithm proposed in [11]. Table 1 describes the complete algorithm. The final strong classifier is a weighted vote of the weak classifiers (Eq. (2)). Figures 5(b), (c) and (d) depict the horizontal component of the two eigenflow features selected by this algorithm for each of the global motion subset. The selection of the most discriminative vectors follows a similar trend in all the three cases. While the first one responds to motion near the boundary, the second one captures the motion within the window. It is also interesting to note the pattern at the bottom of the first eigenflow vectors - those belonging to the right and left subsets take into account the spread of the legs in the lateral motion while the one for the frontal motion restricts any such articulation. Individually, they may perform poorly but as a combination, they can perform much better. Table 2 juxtaposes the false positive rate (FPR) of the GLMPC classifier with two other classifiers for a fixed detection rate of 98%. The first one is the linear SVM classifier that is clearly outperformed in both speed and accuracy. 13,313 support vectors were chosen by the linear SVM that is more than 50% of the
226
D. Goel and T. Chen
Fig. 5. (a) Feature Selection using AdaBoost. (b), (c) and (d) Two u-eigenflow vectors selected by AdaBoost for the Right, Left and Frontal subsets respectively. Table 2. False positive rate for the different classifiers for the detection rate of 98% SVM LMPC GLMPC False Positives (%) 62.3 1.16 0.74
training data, an indication of a poorly generalized classifier. Besides, such a high number of support vectors would result in about 1.3 million dot products per frame, assuming 100 candidate sub-windows in a frame. On the other hand, classification using GLMPC requires only 348 dot products for the three-class SVM switch and 35 dot products for AdaBoost cascade (full cascade in the worst case). The other classifier considered for comparison is the degenerate case of the proposed algorithm, that we refer as Local Motion Pattern Classifier (LMPC) [6], when all the pedestrian data is considered as one single class. GLMPC provides a reduction of 37% in FPR that is further amplified by the fact that they may hundreds of candidate sub-windows in a frame. Cascade of AdaBoost Classifiers. In general, in any scene, flow patterns that share no resemblance with human motion should be discarded quickly, while those that share greater similarity require more complex analysis. A cascade of AdaBoost classifiers [11] can achieve this. The early stages in the cascade have a lesser number of weak classifiers and hence, aren’t too discriminative but are really fast at classification. The later stages consist of more complex classifiers with larger number of weak classifiers. To be labeled as a detection, a candidate data sample has to pass through all the stages. Hence, the classifier spends most of the time analyzing difficult motion patterns and rejects easy ones quickly. In our implementation, there are two stages in the cascade for each of the global motion classifiers. The same pedestrian data was used across all stages. For training the classifier, the ratio of pedestrian to non-pedestrian data (for both training and test data) was kept at one for the left and right motion subsets and 0.5 for the frontal motion. Non-pedestrian data for the next stage in the cascade is generated by collecting the false positives after running the existing classifier on different videos taken from both static and moving cameras. The
Pedestrian Detection Using Global-Local Motion Patterns
227
final frontal classifier has 5 weak learners in the first stage and 20 in the second. The corresponding numbers for the right and the left motion classifiers are 10 and 25, and 10 and 20 respectively.
4
Experiments
For detecting human motion patterns, the dense optical flow image is searched with sub-windows of different scales, seven in total. Every scale size also has an associated step size. Naturally, larger sub-windows have bigger steps size to prevent redundancy due to excessive overlap between neighboring sub-windows. Knowing a priori, the camera orientation can greatly reduce the search space since the pedestrians need to be looked for only on the ground plane. Exploiting such an information reduced the total number of scanned sub-windows in the image by almost half. Finally, only the candidate sub-windows that satisfy the minimum flow thresholds are resized and normalized, before feeding to the classifier. Again, these thresholds vary with the scale size as larger sub-windows search for near-by pedestrians that should appear to move faster due to parallax. Figure 6 depicts the detection results by linear SVM, LMPC and GLMPC classifier after the first stage in the cascade. The overlapping windows have not been merged to show the all the detected sub-windows. As is evident, the GLMPC is able to localize the pedestrians much better than any of the two methods and in addition, gives less false positives. The full cascade GLMPC classifier was tested for pedestrian patterns in different test videos and works at 2fps on a Core 2 Duo 2 GHz PC. Figure 7 shows some of the relevant results. The algorithm was tested with multiple moving pedestrians in the presence of other moving objects, mainly cars and is able to detect humans in different poses and moving at different pace (Fig. 7(a)). The occluding objects can lead to false rejections since the flow in the concerned sub-window doesn’t conform to the pedestrian motion. This is evident in the
(a) SVM
(b) LMPC
(c) GLMPC
Fig. 6. Comparison of the performance of GLMPC classifier with linear SVM and LMPC after Stage 1 in the cascade. Color coding - white if direction is not known, red for right moving pedestrians, yellow for left and black for frontal motion.
228
D. Goel and T. Chen
(a)
(b)
(c)
(d) Fig. 7. Final detection results without merging the overlapping detections
second image in Fig. 7(a). Stationary and far-off pedestrians that are moving very slowly can also be missed owing to their negligible optical flow. The system is also robust to illumination changes (Fig. 7 (b)) and can detect moving children (Fig. 7(c)) even though the training data was composed of only adult pedestrians. Moreover, notice the panning of the camera over time in the image sequence, illustrating the robustness of the system towards small camera motion. The videos captured from a slow moving car were also tested and the system still manages to detect pedestrians (Fig. 7 (d)).
5
Discussion
A novel learning strategy to detect moving pedestrians in videos using motion patterns was introduced in the paper. Instead of considering all human motion patterns as one class, they were split into three meaningful subsets dictated by the global motion direction. A cascade of AdaBoost classifiers with the most discriminative eigenflow vectors were learnt for each of these global motion
Pedestrian Detection Using Global-Local Motion Patterns
229
subsets. Further, a linear three-class SVM classifier was trained that acts as a switch to decide which Adaboost classifier to choose to determine if a pedestrian is contained in the candidate sub-window. It was shown that the proposed algorithm is far superior to the linear SVM and provides an improvement of 37% in FPR as compared to LMPC. Moreover, the proposed system has been shown to be robust to slow illumination changes, camera motion and can even detect children. Apart from conspicuous advantages of accuracy, GLMPC allows for extensibility to incorporate new pedestrian motion like jumping without retraining the whole classifier again. Only a couple of changes would be required. The first would be to retrain the motion switch multi-class SVM classifier to take into account the new motion type. The next would be to train a new AdaBoost classifier to discriminate between the jumping motion of the pedestrians and other kinds of motions. The already trained classifiers for left, right and frontal motion can be used in their original form. An important area of research for the future work would be to compute the ROC curve for the classifiers like GLMPC that don’t have a single global threshold. Work on similar lines has been done by Xiaoming et al. [10].
References 1. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian detection using wavelet templates. CVPR, 193–199 (1997) 2. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. CVPR 1, 886–893 (2005) 3. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. ICCV 2, 734–741 (2003) 4. Fablet, R., Black, M.J.: Automatic Detection and Tracking of Human Motion with a View-Based Representation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 476–491. Springer, Heidelberg (2002) 5. Sidenbladh, H.: Detecting Human Motion with Support Vector Machines. ICPR 2, 188–191 (2004) 6. Goel, D., Chen, T.: Real-time Pedestrian Detection using Eigenflow. In: IEEE International Conference on Image Processing, IEEE Computer Society Press, Los Alamitos (2007) 7. http://www.cbsr.ia.ac.cn/Databases.htm 8. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods. IJCV 61, 211–231 (2005) 9. Bruhn, A., Weickert, J., Kohlberger, T., Schn¨ orr, C.: A Multigrid Platform for Real-Time Motion Computation with Discontinuity-Preserving Variational Methods. IJCV 69, 257–277 (2006) 10. Liu, X., Chen, T., Kumar, B.V.: Face authentication for multiple subjects using eigenflow. Pattern Recognition 36, 313–328 (2003) 11. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR (2001)
Qualitative and Quantitative Behaviour of Geometrical PDEs in Image Processing Arjan Kuijper Radon Institute for Computational and Applied Mathematics, Linz, Austria
Abstract. We analyse a series of approaches to evolve images. It is motivated by combining Gaussian blurring, the Mean Curvature Motion (used for denoising and edge-preserving), and maximal blurring (used for inpainting). We investigate the generalised method using the combination of second order derivatives in terms of gauge coordinates. For the qualitative behaviour, we derive a solution of the PDE series and mention its properties briefly. Relations with general diffusion equations are discussed. Quantitative results are obtained by a novel implementation whose stability and convergence is analysed. The practical results are visualised on a real-life image, showing the expected qualitative behaviour. When a constraint is added that penalises the distance of the results to the input image, one can vary the desired amount of blurring and denoising.
1
Introduction
Already in early years of image analysis the Gaussian filter played an important role. As a side effect of Koenderink’s observation that this filter relates to human observation due to the causality principle [1], it opened the way for application of diffusion processes in image analysis. This is due to the fact that the Gaussian filter is the Greens’ function of the heat equation, a linear partial differential equations (PDEs). Because of its linearity, details are blurred during evolution. Therefore, various non-linear PDEs were developed to analyse and process images. A desirable aspect in the evolution of images is independence of the Cartesian coordinate system by choosing one that relates directly to image properties. One can think of the famous Perona Malik equation [2] using edge-strength. Using such so-called gauge coordinates, Alvarez et al. derived the Mean Curvature Motion [3] by blurring only along edges. On the other hand, the opposite approach can be used in inpainting [4,5]: blurring perpendicular to edges. Perhaps surprisingly, when combining these to methods one obtains the heat equation (see section 2). In this paper we proposea series of PDEs obtained by a parameterised linear combination of these two approaches. By doing so, one is able to influence the
This work was supported by FFG, through the Research & Development Project ‘Analyse von Digitaler Bilder mit Methoden der Differenzialgleichungen’, and the WWTF ‘Five senses-Call 2006’ project ‘Mathematical Methods for Image Analysis and Processing in the Visual Arts’.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 230–239, 2007. c Springer-Verlag Berlin Heidelberg 2007
Qualitative and Quantitative Behaviour of Geometrical PDEs
231
evolution of the methods discussed above by adjusting the parameters. This relates to in- or decreasing blurring that is locally either tangent or normal to isophotes. Although one cannot obtain a filter as the Greens’ function for the general case, solutions give insight into the qualitative behaviour of the PDE. This is done in section 2. Also relations with general diffusion processes are given. The PDEs need a stable numerical implementation, which is depending on the parameters. In section 3 a novel numeric scheme is given, including a stability analysis. This scheme allows larger time steps than conventional finite difference schemes, and remains stable at corner points, in contrast to standard finite difference schemes.
2
Geometric PDEs: Second Order Gauge Derivatives
An image can be thought of as a collection of curves with equal value, the isophotes. Most isophotes are non-self-intersecting. At extrema an isophote reduces to a point, at saddle points the isophote is self-intersecting. At the noncritical points Gauge coordinates (v, w) (or (T, N ), or (ξ, η), or . . .) can be chosen [6,7,8]. Gauge coordinates are locally set such, that the v direction is tangent to the isophote and the w directionpoints in the direction of the gradient vector. Consequently, Lv = 0 and Lw = L2x + L2y . Of interest are the following second order structures: Lvv =
L2x Lyy + L2y Lxx − 2Lx Ly Lxy L2x + L2y
(1)
Lww =
L2x Lxx + L2y Lyy + 2Lx Ly Lxy L2x + L2y
(2)
These gauge derivatives can be expressed as a product of gradients and the Hessian matrix H with second order derivatives: Lww L2w = ∇L · H · ∇T L ˜ −1 · ∇T L, Lvv L2w = ∇L · H
(3) (4)
˜ −1 = det H · H −1 . Note that with ∇L = (Lx , Ly ), H the Hessian matrix, and H the expressions are invariant with respect to the spatial coordinates. Combining the two different expressions for the second order derivatives in gauge coordinates, Eqs. (1)-(2), yield Lt = pLvv + qLww .
(5)
Several parameter settings have relations to PDEs and histogram operations [9]: – (p, q) = (1, 1): Gaussian scale space [1], repeated infinitesimal mean filtering, – (p, q) = (1, 0): Mean Curvature Motion [3,10], repeated infinitesimal median filtering, – (p, q) = (1, −2): infinitesimal mode filtering. – (p, q) = (0, 1): maximal blurring used for inpainting [4].
232
2.1
A. Kuijper
A General Solution
In this section we derive the general solution for Eq. (5). As is the case with gauge coordinates, it is assumed that the solution is independent of direction, 2 2 size, dimension, and orientation. Therefore the dimensionless variable ξ = x +y t is used. Second, an additional t-dependency is assumed. This is inspired by the observation that the solution for p = q = 1 (the Gaussian filter) contains the factor t−1 . So starting assumption is L(x, y; t) = tn f (ξ), and equation (5) becomes tn−1 (−nf (ξ) + (2(p + q) + ξ)f (ξ) + 4qξf (ξ)) = 0. (6) The solution of this ODE with respect to f (ξ) and ξ is given by p−q ξ ξ p+q p+q ξ − 4q 2q f (ξ) = e c1 U n + , , + c2 L− p+2nq+q 2q 2q 4q 4q 2q
(7)
Here U (a, b, z) is a confluent hypergeometric function and Lba (z) is the generalised Laguerre polynomial expression [11]. Taking r = p+q 2q , we find L(x, y; t) = e−
x2 +y2 4qt
2 x + y2 x2 + y 2 tn c1 U n + r, r, + c2 Lr−1 . (8) −n−r 4qt 4qt
The formula reduces dramatically for n = −r, since U (0, ., .) = L.0 (.) = 1. This gives the following positive solutions of Eq. (5): p+q
L(x, y; t) =
+y2 t− 2q − x24qt e 4πq
(9)
The simplified diffusion (p, q) = (b − 1, 1) Lt = Lww + (b − 1)Lvv
(10)
has solution
−x2 −y2 1 4t e . (11) 4πtb/2 Qualitatively these types of flows are just a rescaling of standard Gaussian blurring, albeit that linearity between subsequent images in a sequence with increasing scale t is lost. Only for b = 2 the filter is linear, resulting in the Gaussian filter. For b = 1 one obtains maximal blurring. Note that a solution can only be obtained when q = 0. This implies that the direction Lww (i.e. blurring) must be present in the flow. Solutions for pure Lvv flow - mean curvature motion - are given by L(x, y; t) = L x2 + y 2 + 2t , which is not dimensionless.
2.2
Nonlinear Diffusion Filtering
The general diffusion equation [12] reads Lt = ∇ · (D · ∇L).
(12)
Qualitative and Quantitative Behaviour of Geometrical PDEs
233
The diffusion tensor D is a positive definite symmetric matrix. With D = 1 (when D is considered a scalar - i.e. an isotropic flow), or better, D = In , we have Gaussian scale space. When D depends on the image, the flow is nonlinear, e.g. in the case of the Perona Malik equation [7,2] with D = k 2 /(k 2 + ∇L2 ). For D = Lp−2 we have the p-Laplacian [13,14]. To force the equality Eq. (5) = w Eq. (12), (13) ∇ · (D · ∇L) = pLvv + qLww , D must be a matrix that is dimensionless and that contains only first order derivatives. The most obvious choice for D is D = ∇L · ∇T L/L2w . This yields, perhaps surprisingly, the Gaussian scale space solution. This is, in fact, the only possibility as one can verify. 2.3
Constraints
An extra condition may occur in the presence of noise (assume zero mean, variance σ 2 ): 1 (L − L0 )2 d Ω = σ 2 (14) I= 2 Ω where L0 is the input image and L the denoised one. The solution of min E s.t. I is obtained by the Euler Lagrange equation δE + λδI = 0 with δI = L − L0 , 2 λ = <δE,δI> <δI,δI> , and < δI, δI > = 2σ (see Eq. (14)). The solution can be reached by an evolution determined by a steepest decent evolution Lt = −(δE + λδI) When we set λ = 0, an unconstrained blurring process is obtained. Alternatively, λ can be regarded as a penalty parameter that limits the L2 difference between the input and output images. A too small value will cause an evolution that forces the image to stay close to the input image.
3
Numerical Implementation
The PDE is implemented using Gaussian derivatives [6,15]. As a consequence, larger time steps can be taken. When the spatial derivatives are computed as a convolution ( ) of the original image L with derivatives of a Gaussian G, the −y following results hold: (G L)x = ( −x 2t G) L, (G L)y = ( 2t G) L, (G L)xx = xy y −2t ( x 4t−2t 2 G) L, (G L)xy = ( 4t2 )G L, and (G L)yy = ( 4t2 G) L. Consequently, 2
2
y 2 + x2 − 4t G) L 4t2 −1 =( G) L 2t y 2 + y 2 − 2t =( G) L 4t2
(16)
q(x2 + y 2 ) − 2t(p + q) G) L 4t2
(18)
(G L)xx + (G L)yy = ( (G L)vv (G L)ww
(15)
(17)
Then we have pLvv + qLww = (
234
A. Kuijper
and Eq. (5) is numerically computed by n Ln+1 j,k − Lj,k
Δt
=(
q(x2 + y 2 ) − 2t(p + q) G) Lnj,k 4t2
(19)
where Lnj,k = ξ n ei(jx+ky) is the Von Neumann solution. The double integral, the right hand side of Eq. (19) reads q (α − x)2 + (β − y)2 − 2t(p + q) n i(jα+kβ)− (α−x)2 +(β−y)2 4t ξ e dαdβ 16πt3 2 2 1 p + q 2t(j 2 + k 2 ) − 1 ξ n e−t(j +k )+i(xj+ky) , which and evaluates to − 2t equals 2 2 1 − (20) p + q 2t(j 2 + k 2 ) − 1 e−t(j +k ) · Lnj,k = Ψ · Lnj,k . 2t Consequently, after dividing by Lnj,k (= 0!), Eq. (19) reduces to ξ − 1 = Δt · Ψ
(21)
For stability we require ξ ≤ 1, so Δt · Ψ + 1 ≤ 1. The minimum for Ψ is −p+3q obtained by ∂j Ψ = 0, ∂k Ψ = 0, i.e. t = 2q(j 2 +q 2 ) , yielding the value Ψmin = −q p−3q 2q . t e
p−3q
− 2q For the maximum step size we find ξmax = 2t . q e Obviously, as the implementation is based on the solution of the heat equation, so the maximum step size is limited by the case p = q = 1, i.e. ξmax ≤ 2te. So for 3 the Lww flow (p = 0, q = 1) the step size 2te 2 would yield instabilities. Secondly, −t(j 2 +k2 )
for Lvv flow (p = 1, q = 0), Ψ reduces to − e 2t p . The minimum is obtained at (j, k) = (0, 0), which obviously makes no sense as the Von Neumann solution −t then simplifies to ξ n . We therefore can assume j 2 + k 2 ≥ 1. Then Ψmin = −1 2t e t and the maximum step size is min{4te , 2te}. 3.1
An Alternative Approach
∇L Niessen et al. [15, p196] used ∇L = (cos θ, sin θ) to derive a maximal time step of 2et for the Lvv flow. Here we follow their line of reasoning for the more general Eq. (5). Firstly, the derivatives become
Gvv = cos2 (θ) ∗ Gyy + sin2 (θ) ∗ Gxx − 2 sin(θ) cos(θ) ∗ Gxy
(22)
Gww = cos (θ) ∗ Gxx + sin (θ) ∗ Gyy + 2 sin(θ) cos(θ) ∗ Gxy .
(23)
2
2
Strictly, the Von Neumann stability analysis is only suitable for linear differential equations with constant coefficients. However, we can apply it to equations with variable coefficients by introducing new constant coefficients equal to the frozen values of the original ones at some specific point of interest and test the n . We then find: modified problem instead. Let θ denote θj,k y 2 − 2t x2 − 2t xy + sin2 (θ ) − 2 sin(θ ) cos(θ ) 2 2 2 4t 4t 4t 2 2 2 y − 2t 2 x − 2t xy = cos (θ ) + sin (θ ) + 2 sin(θ ) cos(θ ) 2 . 4t2 4t2 4t
Gvv = cos2 (θ ) Gww
(24) (25)
Qualitative and Quantitative Behaviour of Geometrical PDEs
235
Numerically, with Lnj,k as above, we derive (Lnj,k G)vv = (j sin(θ ) − k cos(θ )) e−(j 2
2
+k2 )t
2 −(j 2 +k2 )t
(Lnj,k G)ww = (j cos(θ ) + k sin(θ )) e
Lnj,k
(26)
Lnj,k .
(27)
Since p (j sin(θ ) − k cos(θ )) + q (j cos(θ ) + k sin(θ )) ≤ max(p, q)(j 2 + k 2 ) 2
2
(28)
we derive for the stability criterion ξ = 1 − Δt max(p, q)(j 2 + k 2 )e−(j
2
+k2 )t
(29)
2et where again the optimum is obtained for s = j 2 + k 2 , yielding Δt ≤ max(p,q) . n This derivation holds for all points Ljk and we find the same stability criterion for the Lww and Lvv flow and for Gaussian blurring.
4
Results
Figure 1 shows two standard shapes used to evaluate the given numerical recipes. Firstly, results for applying 10 time steps in a finite difference scheme is shown in Figure 2. Clearly artifacts can be seen at the corners, due to the directional preference of the first order derivatives. Clearly, the corner behaving “good” in Lvv behaves “bad” in Lww flow, vice versa. Secondly, the Gaussian derivatives implementation for the disk and square are shown in Figures 3 - 4. The scale is chosen as σ = .8, so Δt = 2et = 2e 21 σ 2 = 1.74. The predicted critical scale for the Lvv flow is 4tet = 1.64e.32 = 1.76. One clearly
Fig. 1. Disk and square with values √ 0,1, with uniform random noise on (0,1), and the results of a Gaussian filter at σ = 128, i.e. t=64
Fig. 2. Results of 10 time steps in a finite difference scheme for, from left to right, the Lvv flow for the noisy disk and square, and the Lww flow for these images
236
A. Kuijper timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
Fig. 3. Results of the noisy disk for Lvv flow (top row), Gaussian flow (middle row), and Lww flow (bottom row), for various time step ranges around the critical value Δt = 2et = 2e 12 σ 2 = 1.74 timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
timestep 1.82857
timestep 1.77778
timestep 1.72973
timestep 1.68421
timestep 1.64103
Fig. 4. Results of the noisy square for Lvv flow (top row), Gaussian flow (middle row), and Lww flow (bottom row), for various time step ranges around the critical value Δt = 2et = 2e 12 σ 2 = 1.74
sees the change around the critical values. Since relatively much noise is added, the value is a bit lower that the predicted value. If a too large time step is taken, instability artifacts are visible: For Lvv flow the results become peaky, while the Lww flow shows ringing, and the Gaussian blurring is completely disastrous. Note that the rounding effect for the Lvv flow and the peaky results for the Lww flow are intrinsic to these flows.
Qualitative and Quantitative Behaviour of Geometrical PDEs
237
Fig. 5. Original image and a noisy one, σ = 20 p,q 0.5,1.
p,q 0.,1.
p,q 0.5,1.
p,q 1.,1.
p,q 0.5,0.5
p,q 0.,0.5
p,q 0.5,0.5
p,q 1.,0.5
p,q 0.5,0.
p,q 0.,0.
p,q 0.5,0.
p,q 1.,0.
p,q 0.5,0.5
p,q 0.,0.5
p,q 0.5,0.5
p,q 1.,0.5
Fig. 6. Geometrical evolution of Lt = pLvv + qLww for several values of p and q. The noise variance σ is set to 20. The result satisfies the noise constraint up to an error of 10−7 .
To see the effect of Lt = pLvv + qLww for several values of p and q, Figure 5 is used. The result of applying the Gaussian derivatives implementation in 50 time steps is shown in Figure 6 (with noise constraint) and Figure 7 (without one). As one can see in Figure 6, the choice of p and q enables one to steer between
238
A. Kuijper p,q 0.5,1.
p,q 0.,1.
p,q 0.5,1.
p,q 1.,1.
p,q 0.5,0.5
p,q 0.,0.5
p,q 0.5,0.5
p,q 1.,0.5
p,q 0.5,0.
p,q 0.,0.
p,q 0.5,0.
p,q 1.,0.
p,q 0.5,0.5
p,q 0.,0.5
p,q 0.5,0.5
p,q 1.,0.5
Fig. 7. Geometrical evolution of Lt = pLvv + qLww for several values of p and q. There is no constraint. For negative p there are spiky artifacts, for positive ones there is blurring. For negative q one sees the edges.
denoising regions and deblurring around edges (where the artifacts occurred). The evolution converges within 50 time steps, the error in the constraint is of order 10−7 . The unconstrained evolution shows the spiky artifacts for p ≤ 0, while q < 0 gives the edges. Note that for these values Ψ may become negative and local stability problems may occur. The diagonal gives Gaussian (de)blurring. Visually, q = 0 gives the best result, although here the number of time steps heavily influences the results.
5
Summary and Discussion
We presented a line of approaches to evolve images that unify existing methods in a general framework, by a weighted combination of second order derivatives in terms of gauge coordinates. The series incorporate the well-known Gaussian blurring, Mean Curvature Motion and Maximal Blurring. For the qualitative
Qualitative and Quantitative Behaviour of Geometrical PDEs
239
behaviour, a solution of the series was derived and its properties were briefly mentioned. Relations with general diffusion equations were given. Quantitative results were obtained by a novel implementation and its stability was analysed. The practical results are visualised on artificial images to study the method in detail, and on a real-life image showing the expected qualitative behaviour. The examples showed that positive values for p and q are indeed necessary to guarantee numerical stability (Fig. 7). Theoretically, this relates to the fact that q < 0 implies deblurring, notoriously ill-posed and unstable. However, when a reasonable constraint is added, this deblurring is possible (Fig. 6). Choosing optimal values of p and q depends on the underlying image and is beyond the scope of this paper.
References 1. Koenderink, J.J.: The structure of images. Biological Cybernetics 50, 363–370 (1984) 2. Perona, P., Malik, J.: Scale space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 3. Alvarez, L., Lions, P., Morel, J.: Image selective smoothing and edge detection by nonlinear diffusion. SIAM Journal on Numerical Analysis 29, 845–866 (1992) 4. Caselles, V., Morel, J.M., Sbert, C.: An axiomatic approach to image interpolation. IEEE Transactions on Image Processing 7, 376–386 (1996) 5. Bertalmio, M., Vese, L., Sapiro, G., Osher, S.: Simultaneous structure and texture image inpainting. IEEE Transactions on Image Processing 12, 882–889 (2003) 6. Haar Romeny, B.M.t.: Front-end vision and multi-scale image analysis. Kluwer Academic Publishers, Dordrecht, The Netherlands (2003) 7. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations, 2nd edn. Springer, Heidelberg (2006) 8. Kornprobst, P., Deriche, R., Aubert, G.: Image coupling, restoration and enhancement via PDE’s. In: Proc. Int. Conf. on Image Processing, vol. 4, pp. 458–461 (1997) 9. Griffin, L.: Mean, median and mode filtering of images. Proceedings of the Royal Society Series A 456, 2995–3004 (2000) 10. Yezzi, A.: Modified curvature motion for image smoothing and enhancement. IEEE Transactions on Image Processing 7, 345–352 (1998) 11. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th edn. Dover, New York (1972) 12. Weickert, J.A.: Anisotropic Diffusion in Image Processing. Teubner, Stuttgart (1998) 13. Aronsson, G.: On the partial differential equation u2x uxx + 2ux uy uxy + u2y uyy = 0. Arkiv f¨ ur Matematik 7, 395–425 (1968) 14. Kuijper, A.: p-laplacian driven image processing. In: ICIP 2007 (2007) 15. Niessen, W.J., ter Haar Romeny, B.M., Florack, L.M.J., Viergever, M.A.: A general framework for geometry-driven evolution equations. International Journal of Computer Vision 21, 187–205 (1997)
Automated Billboard Insertion in Video Hitesh Shah and Subhasis Chaudhuri Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India
Abstract. The paper proposes an approach to superimpose virtual contents for advertising in an existing image sequence with no or minimal user interaction. Our approach automatically recognizes planar surfaces in the scene over which a billboard can be inserted for seamless display to the viewers. The planar surfaces are segmented in the image frame using a homography dependent scheme. In each of the segmented planar regions, a rectangle with the largest area is located to superimpose a billboard into the original image sequence. It can also provide a viewing index based on the occupancy of the virtual real estate for charging the advertiser.
1
Introduction
Recent developments in computer vision algorithms have paved the way for a novel set of applications in the field of augmented reality [1]. Among these, virtual advertising has gained considerable attention on account of its commercial implications. The objective of virtual advertising is to superimpose computer mediated advertising images or videos seamlessly into the original image sequence so as to give the appearance that the advertisement was part of the scene when the images were taken. It introduces possibilities to capitalize on the virtual space. Conventionally, augmentation of video or compositing has been done by skilled animators by painting 2D images onto each frame. This technique ensures that the final composite is visually credible, but is enormously expensive, and is also limited to relatively simple effects. Current state-of-art methods for introducing virtual advertising broadly fall into three categories. The first category consists of approaches which utilize pattern recognition techniques to track the regions over which the advertisement is to be placed. Patent [2] is an example of such an approach. It depends on human assistance to initially locate the region for placement of billboard which is tracked in subsequent frames using a Burt pyramid. The approaches in this category face problems when the region leaves the field of view and later reappears, requiring complete and accurate re-initialization. Medioni et al. [3] present an interesting approach which addresses this issue, but the approach is limited to substitution of billboards. The second category comprises of the methods which require access to the scene and/or to the equipment prior to filming like in [4,5,6,7]. In these approaches special markers are set up in the scene to identify the places for future billboard insertions. They may also require physical sensors on the camera to track the changes in the view (pan, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 240–250, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automated Billboard Insertion in Video
241
tilt, zoom). However, this renders such approaches incapable of augmenting an existing video. The final category assumes the knowledge about the structure and geometry of the scene, e.g. the system proposed in patent [8] depends on a landmark model describing the set of natural landmarks in a given scene. Similarly, the techniques proposed in [9,10] assume the image sequence to be of a soccer and tennis game, respectively to make the best use of scene property. This makes such a solution very case specific. As opposed to the above methods, the proposed approach automatically locates a possible region for superimposing a billboard and does not require access or strong assumptions on the structure of the scene or equipment. Our approach exploits the inherent constraints introduced due to requirement of mapping of a planar surface (billboard) onto a planar surface (for e.g. wall) in this particular context. Further, we provide a viewing index for the price fixation for the advertisement.
2
Problem Formulation
An arbitrary sequence of n images I1 , ..., In of a scene and the billboard(s) to be placed are given. We need to automatically locate dominant planar regions in the scene and superimpose geometrically corrected billboards over each of the planar regions in each image frame (Figure 1). The scene is assumed to have textured and planar dominant regions which are not occluded in majority of the frames in the sequence. As indoor scene with walls, outdoor scenes of buildings or pre-existing physical billboards are the target for the approach, the requirement of dominant planar region is not at all restrictive.
(a)
(b)
(c)
Fig. 1. Illustration of virtual billboard insertion: (a) is a frame from input image sequences. The frame after it has been augmented by placing a virtual advertisement at a planar region in the scene are shown in (b) and (c).
3
Proposed Approach
Our approach consists of three stages: image analysis, reconstruction of a quadrilateral in 3D space and image synthesis. Image analysis stage is responsible for
242
H. Shah and S. Chaudhuri
finding and segmenting the planar surfaces in the scene. It consists of weak calibration, plane fitting, planar segmentation and then locating the largest rectangular area on each of the segmented regions. Back projecting each rectangle on the corresponding planar surface, a projective reconstruction of a quadrilateral in 3D space is obtained. The image synthesis stage maps texture on the quadrilateral with that of the required billboard and performs augmentation by projecting them on each of the given image frames. It also calculates the viewing index for each billboard inserted in the image sequence as a measure of price to be paid by the sponsor. 3.1
Weak Calibration
In weak calibration the structure and motion will be determined up to an arbitrary projective transformation. For this, the interest points in the image sequence, obtained by Harris corner detector are tracked over all the frames using normalized correlation based matching. The tracked interest points are then utilized to solve for the projection matrices P1 , ..., Pn corresponding to each image frame and for recovering the 3D positions of the interest points X1 , ..., Xm with projective ambiguity as explained in Beardsley et al. [11] or by Triggs [12]. In our approach the projection matrices and the recovered positions of the interest points are used to evaluate the homography between image frames. As co-planarity of points is preserved under any projective transformation, a projective reconstruction suffices; updating the reconstruction to affine or Euclidean is not needed to deal with the planar regions. 3.2
Plane Fitting
For recovering the planar surface in the scene, interest points X1 , ..., Xm are divided on the basis of the plane they support. Thus from a point cloud in 3D space, points which are coplanar are to be identified and grouped. Hough transform and RANdom SAmple Consensus (RANSAC) [13] are powerful tools to detect specified geometrical structures among a cluster of data points. However, any one of them when used individually has the following limitations. Accuracy of the parameters recovered using Hough transform is dependent on the bin size. To obtain higher accuracy the bin size has to be smaller implying a large number of bins and thus is computationally more expensive. RANSAC, on the other hand requires many iterations when the fraction of outliers is high and trying all possible combinations can be also computationally expensive. It is able to calculate the parameters for the plane with higher accuracy in reasonable time when it is to fit one instance of the model albeit with a few outliers in the data points, as in our case there might be multiple instances of the model, i.e. plane, in the data it performs poorly on its own. To overcome the above limitations, a Hough transform followed by RANSAC on the partitioned data is adopted for recognizing planes. In the first stage Hough transform with a coarse bin size is utilized to obtain the parameters of the planes. These parameters are then utilized to partition the input points
Automated Billboard Insertion in Video
243
X1 , ..., Xm into subsets of points belonging to individual planar regions. Each one of these subset of points support a plane whose parameters are calculated using the Hough transform. Note that there will be a number of points which cannot be fit to a planar surface and they should be discarded. Each subset of data forms the input to the RANSAC algorithm which then fits a plane to recover the accurate parameters for the plane. Such an approach can efficiently calculate the equations of planes fitting the data points. Thus at the end of plane fitting operation, equations of the dominant planes in the scene are obtained. In the following subsections we explain the details of Hough transform and RANSAC method used in this study. Data Partitioning. A plane P in XY Z space can be expressed with the following equation: ρ = xsinθcosφ + ysinθsinφ + zcosθ
(1)
where (ρ, θ, φ) helps define a vector from the origin to the nearest point on the plane. This vector is perpendicular to the plane. Thus under the Hough transform each plane in XY Z space is represented by a point in (ρ, θ, φ) parameter space. All the planes passing through a particular point B(xb , yb , zb ) in XY Z space can be expressed with the following equation from eq. (1) ρ = xb sinθcosφ + yb sinθsinφ + zb cosθ.
(2)
Accordingly all the planes that pass through the point B(xb , yb , zb ) can be expressed with a curved surface described by the eq. (2) in (ρ, θ, φ) space. A three dimensional histogram in (ρ, θ, φ) space is set up to find the planes to which a group of 3D data points belong. For each 3D data point B(xb , yb , zb ) ∈ Xi , all histogram bins that the curved surface passes through are incremented. To obtain the parameters of a particular plane a search for the local maxima in the (ρ, θ, φ) space is performed. The top k significant local maxima are obtained in the (ρ, θ, φ) space the input point cloud is divided into k + 1 subsets, each containing the points that satisfy the plane eq. (1) with a certain tolerance. The last subset contains points that do not fit into any of the above k planes. Accurate plane fitting is carried out on each set using RANSAC as explained in the next section. Plane Fitting Using RANSAC. The basic idea of RANSAC method is to compute the free parameters of the model from an adequate number of randomly selected samples. Then all samples vote whether they agree with the proposed hypothesis. This process is repeated until a sufficiently broad consensus is achieved. The major advantage of this approach is its ability to ignore outliers without explicit handling. We proceed as follows to detect a plane in the subsets of points obtained using the previous step: – Choose a candidate plane by randomly drawing three samples from the set. – The consensus on this candidate is measured.
244
H. Shah and S. Chaudhuri
– Iterate the above steps. – The candidate having the highest consensus is selected. At the end of above iterations, equations for k planes π1 , ..., πk , corresponding to each subset, are obtained. 3.3
Segmentation of Planar Regions
Having estimated the dominant planar structures in the scene, we now need to segment these regions irrespective of its texture. For a given plane πi , the image frame in which the minimum foreshortening of the plane occurs is selected as the reference image Iref . This ensures maximum visibility of the region on the plane in the image Iref . When any other image frame Iother from the sequence is mapped onto the reference image using a homography for the plane πi , the region of the image frame Iother on the plane πi is pixel aligned with the region of the image on the plane πi in Iref and the rest of it gets misaligned due to non belongingness to the selected plane πi . Figure 2(b) shows the resulting image obtained by applying homography (calculated for the plane coincident with the top of the box) to images in the sequences and then taking the average color at each pixel location over all back projected frames. The pixels in the region on the top of the box in the image frames projected using homography are aligned well with the region of the box top in Iref . Thus in the averaged image the top of the box appears sharp in contrast to the surrounding region. Hence to segment the region on the plane πi in image Iref , at each pixel location a set of color is obtained by mapping equally time spaced image frame (for e.g., every 10th frame) in the sequence using their respective homography for the plane πi . Homography calculation is explained in Appendix A. For each pixel in image Iref lying on the plane, variance of image texture for all re-projected points in the scene at this pixel will be very small as compared to pixels which are not on this planar region due to misalignment. Hence pixel wise variance over all re-projected image frames can be used as a measure to segment the regions in the image on the plane. For each pixel in the image Iref the variance of the above set is calculated and compared against a threshold to obtain a binary segmentation
(a)
(b)
(c)
(d)
Fig. 2. Illustration of homography based segmentation: (a) reference image to be segmented. (b) is obtained by projecting and averaging all the frames in the sequence on (a). Notice the extensive blurring of the region not coplanar to the top of the box. (c) Variance measured at each pixel (white represents larger variance). (d) Segmented planar region obtained after performing thresholding on the basis of variance.
Automated Billboard Insertion in Video
(a)
(b)
245
(c)
Fig. 3. (a), (b) and (c) are the augmented image frames where separate billboards have been placed on two dominant planes which were automatically identified by the approach
of the reference image for the particular planar region. Figure 2(c) represents the per pixel variance of the image frame in figure 2(a). It can be readily seen that the variance of pixels on the top of the box are less as compared to the surrounding region which appears in white region. Finally, figure 2(d) is the segmented planar region obtained by thresholding the variance image. There may be occasional holes in the segmented region as seen in figure 2(d). Such small regions are filled up using a morphological closing operation. Regions corresponding to each of the plane π1 , ..., πk are obtained similarly. 3.4
Billboard Placement and Augmentation
Having obtained the segmented regions corresponding to each of the dominant planes, the largest inscribed rectangular area within each of them is located using a dynamic programming Billboards are usually rectangular in shape and are horizontally (or vertically) oriented. Hence we try to fit the largest virtual real estate possible in the segmented region. In absence of any extrinsic camera calibration, it is assumed that the reference frame is vertically aligned. The end points of these rectangles are back projected using the projection matrix, as explained in Appendix B, of the corresponding reference image on the corresponding plane to obtain a quadrilateral in 3D space. Each quadrilateral represents a possible planar region for insertion of a billboard in the 3D projective space. The quadrilaterals can be texture mapped [14] with the required advertising material and then projected onto each of the image frames in the sequence using the respective projection matrices. To reduce aliasing artifact and to increase rendering speed mipmapping [15] is used for texture mapping. 3.5
Calculation of the Viewing Index
Total viewing index can also be calculated during augmentation for each billboard inserted in the video. The total viewing index for a billboard is directly proportional to the amount of time the billboard is on the screen and is equal to the sum of the viewing index calculated per frame. The per frame viewing index
246
H. Shah and S. Chaudhuri Viewing index per frame 0.4 Billboard on the front side of box Billboard on the top side of box 0.35
Viewing Index
0.3
0.25
0.2
(b)
0.15
0.1
0.05
0
0
100
200
300 400 Frame Number
500
600
700
(a)
(c)
Fig. 4. (a) Calculated viewing index for billboard on the top and front of the box for each frame. (b) & (c) are the frames with highest viewing indices (encircled in (a)) for billboard on the front and the top, respectively.
depends on the amount of area the billboard is occupying in the image frame as well as the part of the image frame where it appears, i.e. top, middle, bottom, corners as the location matters for advertising purposes. The total viewing index for a particular billboard reflects roughly the amount of impact the billboard has on the viewer. Thus, it can be utilized to develop a fair pricing policy for the sponsor of the advertisement billboard. To calculate viewing index per frame each pixel Pi,j in the image frame is assigned a weight 2
Weight(Pi,j ) =
− 12 ( (μx −i) 1 2 σx e 2πσx σy
+
(μy −j)2 2 σy
)
(3)
where μx = (height of frame)/2, μy = (width of frame)/2, σx = (height of frame)/6, σy = (width of frame)/6. This selection of parameter assigns higher weights to the pixel in the center of the frame and the weight slowly decreases as the pixels move away from the center. Also the selection of σx and σy ascertains that sum of weights of all the pixels in a frame is almost equal to unity. The viewing index for the billboard in a frame is then equal to the sum of the weights of each pixel over which it is projected. This ensures that the viewing index is directly proportional to the area of frame occupied by the billboard and also to the billboard’s position in the frame. Figure 4 shows computed the viewing indices for the billboard on the top and front sides of the box calculated in the above manner. Using the viewing index per frame a total viewing index for a particular billboard can be calculated by summing the viewing index in each frame. It can be observed that the billboard on the top of the box has higher total viewing index as compared to the one
Automated Billboard Insertion in Video
247
on the front side. Hence the sponsor of the billboard on the top can be charged relatively higher to account for the larger occupancy of prime virtual real estate.
4
Experimental Results
The proposed approach has been implemented in MATLAB and the image sequences used for tests have been captured using a hand held camera. Each of the image sequences had 300-700 frames. In our experimentation we used factorization method proposed in [12] for weak calibration. Figure 1 shows the qualitative result of our approach on two image sequences. In the first sequence one dominant plane was detected corresponding to the wall whereas in the second sequence two dominant planes, top and front of the box, were located. Figure 1 (b,c,e,f) are resulting frames with billboard added on one plane and figure 3 shows the altered image frames with separate billboard over the two dominant planes generated by the proposed approach. Videos captured using a mobile phone camera are difficult to augment using existing approaches due to inherent jitters, low frame rate and less spatial resolution. However, the proposed approach is able to insert billboards seamlessly into such videos also. Few results obtained by augmenting videos captured using a mobile phone camera are shown in figure 5.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Augmented frames from three distinct videos captured using a mobile phone camera are shown in (a,b,c), (d,e,f) respectively. In each of the video one dominant planar region is identified and augmented with a billboard.
5
Conclusion
In this paper we have presented an automated approach for locating planar regions in a scene. The planar regions recovered are used to augment the image sequence with a billboard which may be a still image or a video. One of the
248
H. Shah and S. Chaudhuri
possibilities for the application of the approach is to use it in conjunction with the set top box. Before transmission the video is analyzed for the planar regions in the scene. Information about the identified planar regions is stored as meta data in each frame. At the receiving end before display the set top box can augment each of the image frames with billboards in real time. The billboards in this case may be adaptively selected by the set top box depending on the viewer habits learned by it or the video being shown, e.g. a health video may be augmented with a health equipment advertisement. While, evaluating the results of the current experimentation it is observed that the placement of the billboard is geometrically correct in each of the image frame. No significant drift or jitter has been observed. However, there may be photometric mismatches for the inserted billboard with its surrounding. We are currently looking into the photometric issues related to illumination, shadow and focus correction of the augmented billboard.
References 1. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B.: Recent advances in augmented reality. IEEE Comput. Graph. Appl. 21(6), 34–47 (2001) 2. Rosser, R.J., Leach, M.: Television displays having selected inserted indicia. In: US Patent 5,264,933 (2001) 3. Medioni, G., Guy, G., Rom, H., Francois, A.: Real-time billboard substitution in a video stream. In: Proceedings of the 10th Tyrrhenian International Workshop on Digital Communications (1998) 4. Rosser, R., Tan, Y., Kennedy Jr., H., Jeffers, J., DiCicco, D., Gong, X.: Image insertion in video streams using a combination of physical sensors and pattern recognition. In: US Patent 6,100,925 (2000) 5. Wilf, I., Sharir, A., Tamir, M.: Method and apparatus for automatic electronic replacement of billboards in a video image. In: US Patent 6,208,386 (2001) 6. Gloudemans, J.R., Cavallaro, R.H., Honey, S.K., White, M.S.: Blending a graphic. In: US Patent 6,229,550 (2001) 7. Bruno, P., Medioni, G.G., Grimaud, J.J.: Midlink virtual insertion system. In: US Patent 6,525,780 (2003) 8. DiCicco, D.S., Fant, K.: System and method for inserting static and dynamic images into a live video broadcast. In: US Patent 5,892,554 (1999) 9. Xu, C., Wan, K., Bui, S.H., Tian, Q.: Implanting virtual advertisement into broadcast soccer video. In: Advances in Multimedia Information Processing - PCM, vol. 2, pp. 264–271 (2004) 10. Tien, S.C., Chia, T.L.: A fast method for virtual advertising based on geometric invariant-a tennis match case. In: Proc. of Conference on Computer Vision, Graphics, and Image Processing (2001) 11. Beardsley, P.A., Zisserman, A., Murray, D.W.: Sequential updating of projective and affine structure from motion. Int. J. Comput. Vision 23(3), 235–259 (1997) 12. Triggs, B.: Factorization methods for projective structure and motion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 845–851. IEEE Computer Society Press, San Francisco, California, USA (1996) 13. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Automated Billboard Insertion in Video
249
14. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer graphics: principles and practice. Addison-Wesley Longman Publishing Co. Inc., USA (1996) 15. Williams, L.: Pyramidal parametrics. In: SIGGRAPH 1983, pp. 1–11. ACM Press, New York (1983) 16. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, New York, USA (2000)
A
Homography Calculation
Consider images Im and In with projection matrices given by Pm and Pn , respectively. Let A be any point on the plane π which projects to am and an on the images Im and In , respectively, thus πT A = 0 am = Pm A
(4) (5)
an = Pn A
(6)
where π is a 4x1 row vector and A is a 4x1 homogeneous representation of the point. Due to eq. (4), A lies in the null space ( NS ) of π T . A = NS(π T )C
(7)
where C are the coordinates of A with respect to the basis of the nullspace of A. From eq. (7) and eq. (6) C = [Pn NS(π T )]
−1
an .
(8)
Using eq. (5), eq. (7) and eq. (8) −1
am = Pm NS(π T )[Pn NS(π T )]
an
am = Hmn an where Hmn = Pm NS(π T )[Pn NS(π T )] ping an to am .
B
(9) (10)
−1
and is a 3x3 homography matrix map-
Back-Projection of a Point on a Plane
Let x be the homogenous representation of the point in the image, which is to back-projected on the plane. Let P be a 3x4 projection matrix of the image. It can be written as
250
H. Shah and S. Chaudhuri
P = [M |p4 ]
(11)
where M is a 3x3 matrix consisting of the first three columns and p4 is the last column of P . As per [16], the camera center C for the image is given by C = −M −1 p4
1
T
(12)
and if D be the point at intifinity in the direction of the ray from C passing through x. Then D = M −1 x
0
T
(13)
All the points on the line from C and D can be expressed parametrically by X(t) = C + tD
(14)
Let π = [ a b c d ]T be the equation of the plane on which the point x is to back-projected. Thus for a point, with homogenous representation Y , on the plane πT Y = 0 T
(15) T
It can also be written as π = [z d ] , where z = [a b c ] . The back projected point is on the line and the plane. Thus using eq. (14) and eq. (15) π T X(t) = π T C + tπ T D −1 −M −1 p4 M x 0 = [ z d ]T + t [ z d ]T 1 0 −1 −1 t zM x = zM p4 − d t=
zM −1 p4 − d zM −1 x
Thus the back-projection of a point x on a plane is given by zM −1 p4 − d M −1 x −M −1 p4 ∗ X = + . 1 0 zM −1 x
Improved Background Mixture Models for Video Surveillance Applications Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle Ghent University - IBBT Department of Electronics and Information Systems - Multimedia Lab Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium
Abstract. Background subtraction is a method commonly used to segment objects of interest in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a mixture of several models has been proposed. This paper proposes an update of the popular Mixture of Gaussian Models technique. Experimental analysis shows a lack of this technique to cope with quick illumination changes. A different matching mechanism is proposed to improve the general robustness and a comparison with related work is given. Finally, experimental results are presented to show the gain of the updated technique, according to the standard scheme and the related techniques.
1
Introduction
The detection and segmentation of objects of interest in image sequences is the first processing step in many computer vision applications, such as visual surveillance, traffic monitoring, and semantic annotation. Since this is often the input for other modules in computer vision applications, it is desirable to achieve very high accuracy with the lowest possible false alarm rates. The detection of moving objects in dynamic scenes has been the subject of research for several years and different approaches exist [1]. One popular technique is background subtraction. During the surveillance of a scene, a background model is created and dynamically updated. Foreground objects are represented by the pixels that differ significantly from this background model. Many different models have been proposed for background subtraction, of which the Mixture of Gaussian Models (MGM) is one of the most popular [2]. However, there are a number of important problems when using background subtraction algorithms (quick illumination changes, initialization with moving objects, ghosts and shadows), as was reported in [3]. Sect. 2 elaborates on a number of techniques which improve the traditional MGM and try to deal with the above mentioned problems. This paper presents a new approach to deal with the problem of quick illumination changes (like clouds gradually changing the lighting conditions of the environment). We propose an updated matching mechanism for MGM. As such, Sect. 3 elaborates on the conventional mixture technique and its observed shortcomings. Subsequently, Sect. 4 shows the adjustments of the original scheme Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 251–260, 2007. c Springer-Verlag Berlin Heidelberg 2007
252
C. Poppe et al.
and presents experimental results. Finally, some conclusions are formulated in Sect. 5.
2
Related Work
Toyama et al. discussed in detail several known problems when using background subtraction algorithms [4]. This list was adopted by Javed et al. and they selected a number of important problems which have not been addressed by most background subtraction algorithms [3]. Wu et al. gave a concise overview of background subtraction algorithms, of which they have chosen MGM to compare with their own technique [2]. They use a global gaussian mixture model, built upon a difference image between the current image and an estimated background. Although their system is better for localization and contour preserving, it is more sensitive to complex environmental movements (such as waving trees). Lee et al. improved MGM by introducing means to initialize the background models when moving objects are present in the environment [5]. They presented an online expectation maximization learning algorithm for training adaptive Gaussian mixtures. Their system allows to initialize the mixture models much faster then the original approach. Related to this topic, Zhang et al. presented an online background reconstruction method to cope with the initialization problem [6]. Additionally, they presented a change history map to control the foreground mergence time and make it independent of the learning rate. As such, they deal with the initialization problem and the problem of ghosts (background objects which are moved) but can not deal with quick illumination changes. In [3], Javed et al. presented a number of important problems when using background subtraction algorithms. Accordingly, they proposed a solution using pixel, region and frame level processing. They proposed a solution to the problem of the quick illumination changes, but their technique is based on a complex gradients-based algorithm. Conveniently, the paper does not provide any information about the additional processing times needed for this technique. Tian et al. [7] used a similar approach as the one used in [3]. They presented a texture similarity measure based on gradient vectors, obtained by the Sobel operator. A fixed window is used for the retrieval of the gradient vectors, which largely determines the performance (both in processing time and accuracy ) of their system. We present a conceptual simpler approach by extending the MGM algorithm. The results presented in Sect. 4.2 show that we obtain similar successes in coping with quick illumination changes. Numerous techniques have been proposed to deal with shadows. Since these are used to deal with the effects of the lighting of a scene, an evaluation is made of available shadow detection techniques to see how they can be used to manage the quick illumination changes. An interesting overview on the detection of moving shadows is given in [8]. Prati et al. divide the shadow detection techniques in four classes, of which the deterministic non-model based approach shows the
Improved Background Mixture Models for Video Surveillance Applications
253
best results for the entire evaluation set used in the overview. Since the two critical requirements of a video surveillance system are accuracy and speed, not every shadow removal technique is appropriate. MGM is an advanced object detection technique, however, the maintenance of several models for each pixel is computational expensive. Therefore, every additional processing task should be minimal. Furthermore, MGM was created to cope with highly dynamic environments, with the only assumption being the static camera. According to these constraints and following the results presented by Prati et al., we have chosen the technique described in [9] for a comparison with our system. Results hereof are presented in Sect. 4.2.
3
3.1
Background Subtraction Using Mixture of Gaussian Models The Mixture of Gaussian Models
MGM was first proposed by Stauffer and Grimson in [10]. It is a time-adaptive per pixel subtraction technique in which every pixel is represented by a vector, called Ip , consisting of three color component (red, green, and blue). For every pixel a mixture of Gaussian distributions, which are the actual models, is maintained and each of these models is assigned a weight. T −1 1 1 e− 2 (Ip −μp ) Σp (Ip −μp ) . G (Ip , μp , Σp ) = n (2π) |Σp |
(1)
(1) depicts the formula for a Gaussian distribution G. The parameters are μp and Σp , which are the mean and covariance matrix of the distribution respectively. For computational simplicity, the covariance matrix is assumed to be diagonal. For every new pixel a matching, an update, and a decision step are executed. The new pixel value is compared with the models of the mixture. A pixel is matched if its value occurs inside a confidence interval within 2.5 standard deviations from the mean of the model. In that case, the parameters of the corresponding distribution are updated according to (2), (3), and (4). μp,t = (1 − ρ) μp,t−1 + ρ (Ip,t ) .
(2)
Σp,t = (1 − ρ) Σp,t−1 + ρ (Ip,t − μp,t ) (Ip,t − μp,t )T .
(3)
ρ = αG (Ip,t , μp,t−1 , Σp,t−1 ) .
(4)
α is the learning rate, which is a global parameter, and introduces a trade-off between fast adaptation and detection of slow moving objects. Each model has a weight, w, which is updated for every new image according to (5). wt = (1 − α) wt−1 + αMt .
(5)
254
C. Poppe et al.
If the corresponding model introduced a match, M is 1 , otherwise it is 0. Formulas (2) to (5) represent the update step. Finally, in the decision step, the models are sorted according to their weights. MGM assumes that background pixels occur more frequently then actual foreground pixels. For that reason a threshold based on these weights is used to define which models of the mixture depict background or foreground. Indeed, if a pixel value occurs recurrently, the weight of the corresponding model increases and it is assumed to be background. If no match is found with the current pixel value, then the model with the lowest weight is discarded and replaced by a normal distribution with a small weight, a mean equal to the current pixel value, and a large covariance. In this case the pixel is assumed to be a foreground pixel. 3.2
Problem Description
Fig. 1 shows the results of applying MGM to the PetsD2Tec2 sequence (with a resolution of 384x288) provided by IBM Research at several time points [11]. Black pixels depict the background, white pixels are assumed to be foreground. The figure shows a fragment of the scene being subject of changing illumination circumstances causing a repetitive increase of certain pixel values in a relatively short time period (about 30s). The illumination change results in relatively small differences between the pixel values of consecutive frames. However, the consistent nature of these differences causes the new pixel values to eventually exceed the acceptance range of the mixture models. This is because the acceptance decision is based on the difference with the average of the model, regardless the difference with the previous pixel value. The learning rate of MGM is typically very small (α is usually less then 0.01), so gradual changes spread over long periods (e.g., day turning into night) can be learned into the models. However, the small learning rate makes the adaptation of the current background models not quick enough to encompass the short consistent gradual changes. Consequently, these changes, which are hard to distinguish by the human eye, result in numerous false detections. The falsely detected regions can range from small regions of misclassified pixels to regions encompassing almost half of the image.
Fig. 1. MGM output during quick illumination change
Improved Background Mixture Models for Video Surveillance Applications
4 4.1
255
Improved Background Mixture Models Advanced MGM
MGM uses only the current pixel value and the mixture model in the matching, update and decision steps. The pixel values of the previous image are not stored since they are only used to update the models. We propose to make the technique aware of the immediate past by storing the previous pixel value (prevI) and the previously matched model number (prevM odel). The matching step is then altered according to the following pseudocode: If (Model == prevModel) If (|I - prevI| < cgc * stdev) Match = true; Else checkMatch(Model,I); Else checkMatch(Model,I); update(Model,I); decide(Model); If ((Match == true) and (Decision == background)) { prevModel = Model; prevI = I; } In the matching step for each pixel, it is checked if the pixel value matches with one of the models in the mixture. For the model which matched the pixel values in the previous frame, the difference between the previous and current pixel value is taken. If this difference is small enough, a match is immediately effectuated. Otherwise the normal matching step is executed. If the matched model is considered to represent part of the background, then the model number and the current pixel value are stored, otherwise they remain unchanged. This way, passing foreground objects do not affect the recent history. If a new pixel value differs slightly from the previously matched one, but would fall out of the matching range of the model, a different outcome, compared with the original algorithm, will be obtained. Since the normal matching process is dependent on the specific model, more specifically on the standard deviation, it is better to enforce this for the threshold as well. Therefore, we have chosen for a per pixel threshold dependent on the standard deviation. We introduce a new parameter, cgc (from Consistent Gradual Change), to control the threshold. In Fig. 2 we have recorded the number of detection failures and false alarms for several values of cgc for the PetsD1TeC2 sequence (another sequence from the IBM benchmark which shows similar situations for the consistent gradual changes). A manual ground
256
C. Poppe et al. 300 2,6 290 2,4 280
2,2 2
Detection Failure
270
1,8 260
1,6
1,4
1,2 1
250
0,8
0,6
240 230 220 210 200 0
200
400
600
800
1000
1200
False Alarms
Fig. 2. ROC for different values of cgc
truth annotation has been used to calculate the false positives and negatives. The average values over the entire sequence were then plotted in the curve to find the optimal value for the parameter. A cgc of about 1.8 gives the best results. Lower values result in too much false alarms since many of the consistent gradual changes will not be dealt with then. If we increase the value of cgc, we see that the number of detection failures increases drastically; if the threshold is too high, too many foreground pixels will be mistaken for background pixels. Consequently, cgc = 1.8 is chosen and is further used in all experiments. 4.2
Experimental Results
We adopt the evaluation means of the related work [3,7] and compare the updated algorithm with the original scheme. The result of the proposed algorithm for the example frame of Fig. 1 is shown in Fig. 3. The left side shows the results of the original MGM, the right side shows the results of our system. The new matching process gives significantly less false positives, while it still detects foreground objects. In this case, no morphological post-processing has been applied, so further refinements can be done. Fig. 4 and 5 show a quantitative comparison of the regular MGM and the proposed scheme for the PetsD2TeC2 sequence (with a framerate of 30 frames per second). A manual ground truth annotation has been done for every 50th frame of the sequence. For each of these frames the ground truth annotation is matched with the output of the detection algorithms to find the amount of pixels which are erroneously considered to be foreground. As can be seen, a sudden increase occurs at the end of the video (which corresponds to a quick illumination change in the scene). We notice that
Improved Background Mixture Models for Video Surveillance Applications
257
Fig. 3. Results of MGM and the proposed scheme
14000
MGM Shadow_hsv Proposed
12000
False Positives
10000 8000 6000 4000 2000
95 0 11 00 12 50 14 00 15 50 17 00 18 50 20 00 21 50 23 00 24 50 26 00 27 50
80 0
65 0
50 0
35 0
50 20 0
0
Frame
Fig. 4. False Positives for MGM, a shadow removal technique and our proposed system
the proposed scheme succeeds to deal with the gradual lighting changes (frames 2100 to 2850) much better then the original scheme. The amount of false positives is largely reduced; in the best case we obtain a reduction of up to 95 % of the false positives compared with the normal scheme. The figure also shows that the updated technique obtains the same results as the original technique in scenes without gradual changes (frames 0 to 2050). Fig. 5 shows the false negatives recorded during the sequence. Our updated algorithm gives only a slight increase in the number of false negatives. In Sect. 2 we elaborated on alternate techniques which also give a solution for the quick illumination change problem. These methods are based on complex region level processing whereas our technique is solely pixel-based. Javed et al. presented their results on their website.1 Fig. 6 shows a scene in which a sudden 1
http://www.cs.ucf.edu/∼vision/projects/Knight/background.html
258
C. Poppe et al. 800
MGM Shadow_hsv Proposed
700
False Negatives
600 500 400 300 200 100
95 0 11 00 12 50 14 00 15 50 17 00 18 50 20 00 21 50 23 00 24 50 26 00 27 50
80 0
65 0
50 0
35 0
50 20 0
0
Frame
Fig. 5. False Negatives for MGM, a shadow removal technique and our proposed system
Fig. 6. From left to right: captured image, MGM output, result of [3], result of proposed scheme
illumination change occurs. The second image is the output from MGM and the third is the result of the system of Javed et al. The fourth image is the output of our proposed system. As can be seen, our conceptually simpler approach achieves similar results in coping with the illumination changes. As discussed in Sect. 2, some shadow techniques might provide a solution for the problem of quick illumination changes. We have evaluated the technique described in [9]. This technique uses the HSV color space since this corresponds closely to the human perception of color. Since the hue of a pixel does not change significantly when a shadow is cast and the saturation is lowered in shadowed points, the HSV color space indeed looks interesting for shadow detection. Consequently, the decision process is based on the following equation: IpV ≤ β ∧ IpS − BgpS ≤ τs ∧ IpH − BgpH ≤ τh . (6) Sp = α ≤ BgpV In (6), Bgp are the pixel values for the current background model. If Sp = true the pixel is assumed to be covered by a shadow. α should be adjusted according
Improved Background Mixture Models for Video Surveillance Applications
259
to the strength of the light source causing the shadows, β is needed to cope with certain aspects of noise, τs and τh are thresholds which respectively decide how large the difference in saturation and hue can be. This technique is therefore vastly dependent on the actual environment, but works well for shadow detection if the individual parameters can be fine-tuned according to the scene. In highly dynamic scenes as discussed in this paper, this approach would not be optimal. The illumination changes, in our situation, can cause shadows, but will mostly result in the opposite effect; pixel values get lighter color values. Therefore, we use the adjusted formula (7) for the detection of the lighting change. IpV (7) Sp = 1/β ≤ V ≤ 1/α ∧ IpS − μSp ≥ −τs ∧ IpH − μH p ≤ τh . μp Fig. 4 and 5 also show the false positives and false negatives of the adjusted shadow removal technique (Shadow hsv), respectively. We see that the shadow detection results in less false positives then the original scheme, but it cannot manage the entire change. Moreover, there is a strong increase of the false negatives.
5
Conclusion
This paper presents an updated scheme for object detection using a Mixture of Gaussian Models. The original scheme has been discussed in more detail and the incapability of dealing with quick illumination changes has been detected. An update of the matching mechanism has been presented. Furthermore, a comparison has been made with existing relevant object detection techniques which are able to deal with the problem. Experimental results show that our algorithm has significant improvements compared with the standard scheme, while only introducing minor additional processing.
Acknowledgments The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders(FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union.
References 1. Dick, A., Brooks, M.J.: Issues in automated visual surveillance. In: Proceedings of International Conference on Digital Image Computing: Techniques and Applications, pp. 195–204 (2003)
260
C. Poppe et al.
2. Wu, J., Trivedi, M.: Performance Characterization for Gaussian Mixture Model Based Motion Detection Algorithms. In: Proceedings of the IEEE International Conference on Image Processing, pp. 97–100. IEEE Computer Society Press, Los Alamitos (2005) 3. Javed, O., Shafique, K., Shah, M.: A Hierarchical Approach to Robust Background Subtraction using Color and Gradient Information. In: Proceedings of the Workshop on Motion and Video Computing, pp. 22–27 (2002) 4. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and Practice of Background Maintenance. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 255–261. IEEE Computer Society Press, Los Alamitos (1999) 5. Lee, D.: Online Adaptive Gaussian Mixture Learning for Video Applications. Statistical Methods in Video Processing. LNCS, pp. 105–116 (2004) 6. Zhang, Y., Liang, Z., Hou, Z., Wang, H., Tan, M.: An Adaptive Mixture Gaussian Background Model with Online Background Reconstruction and Adjustable Foreground Mergence Time for Motion Segmentation. In: Proceedings of the IEEE International Conference on Industrial Technology, pp. 23–27. IEEE Computer Society Press, Los Alamitos (2005) 7. Tian, Y., Lu, M., Hampapur, A.: Robust and Efficient Foreground Analysis for Real-time Video Surveillance. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1182–1187. IEEE Computer Society Press, Los Alamitos (2005) 8. Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting Moving Shadows: Algorithms and Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 918–923 (2003) 9. Cucchiara, R., Grana, C., Neri, G., Piccardi, M., Prati, A.: The Sakbot System for Moving Object Detection and Tracking. Video-Based Surveillance Systems Computer Vision and Distributed Processing, pp. 145–157 (2001) 10. Stauffer, C., Grimson, W.E.L.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 747–757 (2000) 11. Brown, L.M., Senior, A.W., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merkl, H., Lu, M.: Performance Evaluation of Surveillance Systems Under Varying Conditions. In: Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2005) http://www.research.ibm.com/peoplevision/performanceevaluation.html
High Dynamic Range Scene Realization Using Two Complementary Images Ming-Chian Sung, Te-Hsun Wang, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {qwer,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. Many existing tone reproduction schemes are based on the use of a single high dynamic range (HDR) image and are therefore unable to accurately recover the local details and colors of the scene due to the limited information available. Accordingly, the current study develops a novel tone reproduction system which utilizes two images with different exposures to capture both the local details and color information of the low- and high-luminance regions of a scene. By computing the local region of each pixel, whose radius is determined via an iterative morphological erosion process, the proposed system implements a pixel-wise local tone mapping module which compresses the luminance range and enhances the local contrast in the low-exposure image. And a local color mapping module is applied to capture the precise color information from the high-exposure image. Subsequently, a fusion process is then performed to fuse the local tone mapping and color mapping results to generate highly realistic reproductions of HDR scenes. Keywords: High dynamic range, local tone mapping, local color mapping.
1 Introduction In general, a tone reproduction problem occurs when the dynamic range of a scene exceeds that of the recording or display device. This problem is typically resolved by applying some form of tone mapping technique, in which the high dynamic range (HDR) luminance of the real world is mapped to the low dynamic range (LDR) luminance of the display device. Various uniform (or global) tone mapping methods have been proposed [19], [21]. However, while these systems are reasonably successful in resolving the tone reproduction problem and avoid visual artifacts such as halos, the resulting images tend to lose the local details of the scene. By contrast, non-uniform (or local) tone mapping methods such as those presented in [1], [3], [4], [5], [7], [16] and [18] not only provide a good tone reproduction performance, but also preserve the finer details of the original scene. Such approaches typically mimic the human visual system by computing the local adaptation luminance in the scene. When computing the local adaptation luminance, the size of the local region is a crucial consideration and is generally estimated using some form of local contrast measure. Center-surround Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 261–270, 2007. © Springer-Verlag Berlin Heidelberg 2007
262
M.-C. Sung, T.-H. Wang, and J.-J.J. Lien
functions such as the difference of Gaussian blurring images in [2] and [16] provides one means of estimating the size of this region. However, the local region size determined using this method is generally too local (or small) to reveal sufficient details. By contrast, piece-wise methods tend to improperly emphasize the local details. Furthermore, if the dynamic range of the real-world scene is very large, some of the image regions will be over-exposed, while some will be under-exposed, and hence the details in these regions will be lost. When processing such scenes using only a single image, the use of luminance compression techniques to recover the scene details achieves only limited success due to the lack of scene information available. Modern digital cameras invariably feature a built-in exposure bracketing functionality which allows a series of images to be obtained at different exposure levels via a single depression of the shutter release mechanism. This functionality is exploited to generate two simultaneous images of a high contrast scene with different levels of exposure such that the color and details of the scene can be accurately reproduced. Let the term IL refer to the low-exposure image, in which the brighter region is well-exposed, but the darker region is under-exposed. The brighter region contains good detail definition and has an abundance of color information. However, in the dark region of the image, the scene details are hidden and the true color of the scene cannot be recovered. Furthermore, let IH denote the high-exposure image, in which the darker region is well-exposed, but the brighter region is over-exposed. In this case, the darker region retains good detail definition and accurately reproduces the true color, while the brighter region is saturated such that the scene details cannot be perceived and the true color is not apparent. Although IL and IH have different exposure levels and may be not perfectly overlapped geometrically due to taken by unstable hand-held camera, coherence nevertheless exists in their color statistics and spatial constraints because they are taken simultaneously of the same scene [10]. The basic principle of the tone reproduction system developed in this study is to exploit these coherences in order to derive an accurate reproduction of the scene. In many image processing applications, the performance can be enhanced by using multiple input images to increase the amount of available information. Typical applications which adopt this approach include noise removal using flash and no-flash pairs [6], [12]; motion deblurring using normal and low-exposure pairs [10]; color transfer [15], [17]; and gray scale image colorization [11]. Goshtasby proposed an excellent method for realizing HDR reduction via the maximization of image information by using many images with different exposures [8]. However, the proposed method required the contents of the two images to be perfectly overlapped. Thus, the use of a fixed tripod was required, with the result that the practicality of the method was rather limited. In an attempt to resolve the problems highlighted above, the current study develops a novel tone reproduction system in which two input images acquired under different exposure conditions are input to a local adaptation mechanism which takes into account both the color statistics and the spatial constraint between the two images in order to reproduce the details and color of the scene. The proposed system performs a local tone mapping and local color mapping process to refine the advantage of IL and IH, respectively. Subsequently, a fusion procession is applied to make a compromise between the local optimum and the global optimum.
High Dynamic Range Scene Realization Using Two Complementary Images
(a)
(b)
(c)
(d)
(e)
(f)
263
Fig. 1. (a) Low-exposure image IL; (b) high-exposure image IH; (c) segmentation of IL into four regions based upon entropy; (d) illustration of morphological erosion operation following three iterations; (e) iteration map of IL; and (f) example of pixels and their corresponding local regions as determined by their iteration map values
2 Iteration Map Creation and Local Region Determination In order to perform the local mechanism, this study commences by finding the local region of each pixel. A histogram-based segmentation process is applied to group the luminance values of the various pixels in the image into a small number of discrete intervals. The radius of the local region of each pixel is then determined by iteration map which is derived from a morphological erosion operation. 2.1 Histogram-Based Region Segmentation Using Entropy Theorem The entropy theorem provides a suitable means of establishing an optimal threshold value T when separating the darker and brighter regions of an image [13]. According to this theorem, by maximizing the entropy of an image by maximizing the entropy of both the darker and the brighter regions of the image, an optimal threshold value can be obtained via the following formulation: t
255
T = arg max ( − ∑ p i log p i − ∑ p i log p i ) , t
i=0
d
d
b
b
(1)
i=t
where t is the candidate threshold, Pid and Pib are the probability of the darker pixels with luminance value i and brighter pixels with luminance i, respectively. Adopting a dichotomy approach, the segmentation procedure is repeated three times, yielding three separate threshold values, i.e. Llow, Lmiddle and Lhigh , which collectively segment the histogram into four subintervals, namely [Lmin, Llow], [Llow, Lmiddle], [Lmiddle, Lhigh] and [Lhigh, Lmax], respectively.
264
M.-C. Sung, T.-H. Wang, and J.-J.J. Lien
2.2 Iteration Map Creation Using Morphological Erosion Operation Having segmented the image, the proposed tone reproduction system then determines the circular local region Rx,y of each pixel (x, y). The radius of this region is found by performing an iterative morphological erosion operation in each luminance region, and creating an iteration map to record the iteration number at which each pixel is eroded. Clearly, for pixels located closer to the region boundary, the corresponding iteration value is lower, while for those pixels closer to the region center, the iteration value is higher. Hence, by inspection of the values indicated on the iteration map, it is possible not only to determine the radius of the circular local region of each pixel, but also to modulate the tone mapping function as described later.
3 Tone Reproduction System Since the light range in the brighter regions of an image is greater than that in the darker regions, the details in the over-exposed regions in IH are usually lost. Hence, in the current approach, more detailed IL is executed using the color and tone information associated with IH. 3.1 Luminance: Pixel-Wise Local Tone Mapping The proposed tone reconstruction method commences by applying a non-uniform luminance scaling process to IL to generate an initial middle-gray tone image. Due to the under-exposed darker region and well-exposed brighter region of IL, it is necessary to apply a greater scaling effect to the darker region to brighten the concealed details and a reduced scaling to the well-exposed brighter region, i.e. ⎛1 ⎞ Lk = exp⎜⎜ ∑ log(δ + Lk ( x, y)) ⎟⎟ Lk ∈ {LL , LH } ⎝ N x, y ⎠
⎛ L L( x, y ) = ⎜⎜ 2 H ⎝ LL
⎞⎛ LL ( x, y ) ⎞ ⎟ LL ( x, y ) , ⎟⎟⎜1 − 2 ⎜ Lwhite ⎟⎠ ⎠⎝
(1)
(2)
LL and LH are the log-average luminance (referred to as the key values in [9], [19] and [20]) of IL and IH , respectively, and are used to objectively measure whether the scene is low-gray, middle-gray or high gray tone. Furthermore, LL(x, y) is the luminance value of pixel (x, y) in IL, and is normalized within the interval [0, 1]. Finally, Lwhite is the maximum luminance value in IL. By applying Eqs. (1) and (2), the luminance of LL can be scaled to an overall luminance L. To mimic the human visual system which attains visual perception of a scene by locally adapting luminance differences, the system proposed in this study performs a local tone mapping process which commences by computing the local adaptation luminance. Since the radius of the circular local region Rx,y has already been determin ed for each pixel (x, y), the value of the local adaptation luminance can be obtained
High Dynamic Range Scene Realization Using Two Complementary Images
(a)
(b)
265
(c)
Fig. 2. (a) Local adaptation luminance result. Note result is normalized into interval [0, 255] for display purposes; (b) detailed term H; and (c) luminance compression term V’.
simply by convoluting the luminance values in the local region using a weighted mask, i.e. V ( x, y ) =
⎞ ⎛ ⎟ 1 ⎜ L ( x, y)G x , y (i, j ) K x, y (i, j ) ⎟ , ⎜ Z x, y ⎜ (i, j∈R ) ⎟ x, y ⎠ ⎝
∑
(4)
The significance of each neighborhood pixel (i, j) when performing this convolution, is evaluated using Gx,y and Kx,y , which are Gaussian weights corresponding to the spatial distance between pixels (x, y) and (i, j) and the difference in luminance of the two pixels, respectively. And Zx,y in Eq. (4) is a normalization term. A method known as local tone mapping was proposed by Reinhard et al. [16] for addressing the tone reproduction problem. This simple non-uniform mapping technique compresses the luminance range of the scene such that all of the luminance values fall within the interval [0,1]. The system presented in the current study goes a step further in modulating the local contrast and luminance compression by extracting the detailed term (denoted as H) and the local adaptation luminance compression term (denoted as V’) as Fig. 2 (a~c) and then modulating them in accordance with [3]: ρ
Ld =
ρ
L ⎛ L⎞ ⎛ V ⎞ = HxV ′ = ⎜ ⎟ x⎜ ⎟ 1+V ⎝V ⎠ ⎝1+ V ⎠
γ≦1
γ
(5)
ρ
. The value of controls the degree of sharpness of the where 0< <2 and 0< reproduced image, with a larger value generating a sharper result. In the current study, the value of is varied in direct proportion to the iteration value of each pixel specified in the iteration map to ensure a smooth boundary while simultaneously revealing most of the image details near the region center. Meanwhile, the value of determines the degree of luminance compression. As its value is reduced, the luminance of the darker regions in the image is compressed into a larger display is inversely proportional to the range of the interval [Lmin, range. As a result, Lmiddle]. Finally, in order to obtain the final local tone mapping result It, the value of each RGB channel is scaled according to the change ratio of the luminance which is obtained by dividing the output luminance Ld by input scaled luminance L, i.e.
ρ
γ
γ
It = I L x
Ld L
(6)
266
M.-C. Sung, T.-H. Wang, and J.-J.J. Lien
3.2 Color: Pixel-Wise Local Color Mapping Tone mapping method mentioned in last section modulates only the luminance value, and hence the original imprecise RGB color information of IL is still reserved. If a low-exposure image is modulated such that the color and detailed information of the brighter regions can be captured, the darker regions of the image will inevitably be under-exposed. Even if the scene is characterized by an extremely broad light range, the darker regions will become so dark that the true color information has not been recorded by camera and thus only an unsatisfied result of local tone mapping can be obtained. Moreover, the darker regions generally contain considerable noise and thus obtaining a crisp result is difficult. To resolve these problems, the current approach modulates IL using the color information relating to IH, and pixel-wise local color mapping module is applied to acquire the ground-truth color. Reinhard et al. [15] proposed the following simple but highly effective method, implemented in the Aαβ color space, for transferring the color statistics from IH to IL [14]: I L ′ ( x, y) = g (I L ( x, y )) = μ H +
σH ( I L ( x, y) − μ L ), σL
(7)
where IL(x, y) is the color value of pixel (x, y) in each Aαβ channel in the lowexposure image, L and L are the mean and standard deviation of the color value in IL, respectively, and H and H are the mean and standard deviation of the color value in IH., respectively. In the current system, this color transfer process is performed on each Aαβ channel and the results are then converted back to the RGB color space to obtain the preliminary result IL’. However, this global color transfer approach fails to obtain the precise local color when the source or target image contains many different color regions because it cannot distinguish which particular region each color statistic derives from, and thus mixes them all up, yielding a uniform transfer [17]. Hence, the current system applies a further pixel-wise local color mapping process in RGB channel to find the true local color result of pixel (x, y) from pixel (i*, j*) in IH in the local color mapping region Sx,y whose radius is inversely proportional to the value of pixel (x, y) in the iteration map because the color near the region center is more
μ
μ
σ
σ
Weight
LL
Fig. 3. Local color mapping region Sx,y is inversely proportional to the value of pixel (x, y) in the iteration map
Fig. 4. Finding fusion weight αx,y using double sigmoid function
High Dynamic Range Scene Realization Using Two Complementary Images
267
reliable, so we can search its precise color in IH within a smaller local color mapping region Sx,y as shown in Fig. 3.. This suggests following equation: ⎛ ( I ' ( x, y ) − I H (i, j )) 2 ⎞⎟ (i* , j* ) = arg max ⎜ exp(− L ) , ⎜ ⎟ 2σ 2 (i , j )∈S x , y ⎝ ⎠
(8)
σ
is specified as half the value of pixel (x, y) in the iteration map. where the value of When (i*, j*) in the Eq. (8) is obtained, the color value of IH (i*, j*) is used in place of IL’(x, y) to construct the local mapping result Ic. This local color mapping method resolves the image shift problem and gives the ground-truth color appearance of IH. Input Image Pairs
Output Image
IH
IL IL
Pixel-Wise Local Tone Mapping
Iteration map Creation and Local Region Determination Region Segmentation By Entropy
Id
IH
LL Lψ L
Iteration Map
Result Fusion
Local Adaptation Luminance V
Local Tone Mapping Result Local Adaptation Luminance V
IL Segmentation Result
Morphological Erosion Operation
Local Tone Mapping Result It
It
Pixel-Wise Local Color Mapping
IL
IH
Global Color ’ Transfer Result IL Local Color Mapping Result
Iteration Map
Ic IH & IL
’
Iteration Map
Local Color Mapping Result Ic
Fig. 5. Flowchart of the proposed system
268
M.-C. Sung, T.-H. Wang, and J.-J.J. Lien
I d = α x , y I t + (1 − α x , y ) I c
α x, y
β ⎧ ⎪ ⎛ ⎞ ⎪1 + exp⎜ − 2 LL ( x, y ) − Lmiddle ⎟ ⎜ | Lmiddle − Llow | ⎟⎠ ⎪⎪ ⎝ =⎨ β ⎪ ⎛ ⎪ L x, y ) − Lmiddle ⎞⎟ ( L ⎪1 + exp⎜⎜ − 2 | Lmiddle − Lhigh | ⎟⎠ ⎝ ⎩⎪
(9)
if LL ( x, y ) < Lmiddle
(10)
otherwise.
α
The function used to generate the weight x,y is the double sigmoid function as shown in Fig. 4. This function provides a virtually linear fusion over the interval [Llow, Lhigh] and a non-linear fusion over the intervals [Lmin, Llow] and [Lhigh, Lmax], respectively. In the darker regions, Ic is assigned a higher weight to obtain the true color appearance and to reduce the noise in It. Meanwhile, in the brighter regions, It is assigned a greater weight to compensate the color and details in the saturated region of Ic. The parameter in Eq. (10) is inversely proportional to the value of pixel (x, y) in the iteration map serves to further control the color appearance of the fusion result. The flowchart of proposed method is shown in Fig. 5.
β
4 Experimental Results and Conclusion The current tone reproduction experiments were performed on a PC equipped with an Intel Pentium 4 (3.2GHz) processor. The execution time of the proposed method depends on the number and size of segmentation regions. In general, the results showed that a 640 x 480 image could be processed within 10 seconds on average. When implementing the local tone reproduction methods presented in the literature, the size of the local region is a crucial factor. Figure 6 compares the local regions estimated by the schemes presented in [16] and [3], respectively, with that estimated by the current method. It can be seen that the local region size derived using the method proposed by Reinhard et al. [16] may be too small to reveal sufficient local details. Conversely, the region size estimated using the region-wise method presented in [3] is not always adaptive for each pixel in that region and may therefore result in an unnatural emphasis. However, the morphological erosion method proposed in the current study enables the derivation of a more adaptive local region size.
(a)
(b)
(c)
Fig. 6. Example of pixel and corresponding local region measured using: (a) local region method proposed by Reinhard et al. [16]; (b) region-wise method [3]; and (c) current morphological erosion operation
High Dynamic Range Scene Realization Using Two Complementary Images
(a.1)
(b.1)
(c.1)
(a.2)
(b.2)
(c.2)
(a.3)
(b.3)
(c.3)
(a.4)
(b.4)
(c.4)
269
Fig. 7. Typical image pairs and corresponding tone reproduction results: (a.1~c.1) input image pairs IL and IH; (a.2~c.2) tone mapping results It; (a.3~c.3) local color mapping results Ic; and (a.4~c.4) fusion results Id
Besides the ability of the proposed method to successfully process the HDR images as shown in Fig. 7(a.1~a.4), our method is also useful to deal with the LDR images and non-overlapped image pairs as shown in Fig. 7(b.1~b.4). The local color mapping process effectively overcomes such shift position problem between image pairs while simultaneously obtaining the true color from IH. In Fig. 7(c.1~c.4), it is seen that the proposed method effectively smoothes the noise within the darker regions. In conclusion, tone reproduction methods are essential techniques in realizing HDR scenes in LDR display devices. Many previous tone reproduction techniques fail to accurately recover the color and details of HDR scenes since they use only a single image and therefore have only limited information at their disposal. However, in the tone reproduction technique presented in this paper, two images are acquired simultaneously at different exposures and are supplied to an automatic local adaptation mechanism which takes account of both the color statistics and the spatial constraint between the images in order to maintain the color and detailed information of the original scene. The experimental results confirm that the proposed system
270
M.-C. Sung, T.-H. Wang, and J.-J.J. Lien
provides promising results full of rich detailed and color information content and is capable of generating highly realistic reproductions of HDR scenes.
References 1. Ashikhmin, M.: A Tone Mapping Algorithm for High Contrast Images. 13th Eurographics, 145–156 (2002) 2. Blommaert, F.J.J., Martens, J.-B.: An Object-Oriented Model for Brightness perception. Spatial Vision 5(1), 15–41 (1990) 3. Chen, H.T., Liu, T.L., Fuh, C.S.: Tone Reproduction: A Perspective from luminanceDriven Perceptual Grouping. In: IJCV, vol. 65, pp. 73–96 (2005) 4. DiCarlo, J.M., Wandell, B.A.: Rendering High Dynamic Range Images. SPIE: Image Sensors 3965, 392–401 (2000) 5. Durand, F., Dorsey, J.: Fast Bilateral Filtering for the Display of High-Dynamic-Range Images. ACM SIGGRAPH, 257–266 (2002) 6. Eisemann, E., Durand, F.: Flash photography enhancement via intrinsic relighting. ACM Trans. Graph. 23(3), 673–678 (2004) 7. Fattal, R., Lischinski, D., Werman, M.: Gradient Domain High Dynamic Range Compress. ACM SIGGRAPH, 249–256 (2002) 8. Goshtasby, A.: A High dynamic range reduction via maximization of image information, http://www.cs.wright.edu/~agoshtas/hdr.html 9. Holm, J.: Photographic Tone and Colour Reproduction Goals. CIE Expert Symposium ’96 on Colour Standard for Image Technology, 51–56 (1996) 10. Sun, J., Tang, J.: Bayesian Correction of Image Intensity with Spatial Consideration. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 342–354. Springer, Heidelberg (2004) 11. Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. ACM Trans Graph. 23(3), 689–694 (2004) 12. Petschnigg, G., Szeliski, R., Agrawala, M., Choen, M., Hoppe, H., Toyama, K.: Digitial photography with flash and no-flash image pairs. ACM Trans. Graph. 23(3), 664–672 (2004) 13. Pardo, A., Sapiro, G.: Visualization of High Dynamic Range Images. IEEE Trans. on Image Proc. 12(6), 639–647 (2003) 14. Ruderman, D.L., Chonin, T.W., Chiao, C.C: Statistic of cone responses to natural images: implications for visual coding. J. Optical Soc. Of America 15(8), 2036–2045 (1998) 15. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color Transfer between Images. IEEE CG&A 21, 34–41 (2001) 16. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic Tone Reproduction for Digital Images. ACM SIGGRAPH, 267–276 (2002) 17. Tai, Y.W., Jia, J., Tang, C.K.: Local Color Transfer via Probabilistic Segmentation by Expectation-Maximization. In: CVPR, vol. 1, pp. 747–754 (2005) 18. Tumblin, J., Turk, G.: LCIS: A Boundary Hierarchy for Detail-Preserving Contrast Reduction. ACM SIGGRAPH, 83–90 (1899) 19. Tumblin, J., Rushmeier, H.: Tone Reproduction for Computer Generated Images. IEEE CG&A 13(6), 42–48 (1993) 20. Ward, G.: A Contrast-Based Scale factor for Luminance Display. In: Heckbert, P. (ed.) Graphics Gems IV, pp. 415–421. Academic Press, Boston (1994) 21. Ward, G., Rushmeuer, H.E., Piatko, C.D.: A visibility matching tone reproduction operator for high dynamic range scenes. IEEE Trans. Visualization and Computer Graphics 3(4), 291–306 (1997)
Automated Removal of Partial Occlusion Blur Scott McCloskey, Michael Langer, and Kaleem Siddiqi Centre for Intelligent Machines, McGill University {scott,langer,siddiqi}@cim.mcgill.ca
Abstract. This paper presents a novel, automated method to remove partial occlusion from a single image. In particular, we are concerned with occlusions resulting from objects that fall on or near the lens during exposure. For each such foreground object, we segment the completely occluded region using a geometric flow. We then look outward from the region of complete occlusion at the segmentation boundary to estimate the width of the partially occluded region. Once the area of complete occlusion and width of the partially occluded region are known, the contribution of the foreground object can be removed. We present experimental results which demonstrate the ability of this method to remove partial occlusion with minimal user interaction. The result is an image with improved visibility in partially occluded regions, which may convey important information or simply improve the image’s aesthetics.
1
Introduction
Partial occlusions arise in natural images when an occluding object falls nearer to the lens than the plane of focus. The occluding object will be blurred in proportion to its distance from the plane of focus, and contributes to the exposure of pixels that also record background objects. This sort of situation can arise, for example, when taking a photo through a small opening such as a cracked door, fence, or keyhole. If the opening is smaller than the lens aperture, some part of the door/fence will fall within the field of view, partially occluding the background. This may also arise when a nearby object (such as the photographer’s finger, or a camera strap) accidentally falls within the lens’ field of view. Whatever its cause, the width of the partially-occluded region depends on the scene geometry and the camera settings. Primarily, the width increases with increasing aperture size (decreasing f -number), making partial occlusion a greater concern in low lighting situations that necessitate a larger aperture. Fig. 1 (left) shows an image with partial occlusion, which has three distinct regions: complete occlusion (outside the red contour), partial occlusion (between the green and red contours), and no occlusion (inside the green contour). As is the case in this example, the completely occluded region often has little highfrequency structure because of the severe blurring of objects far from the focal plane. In addition, the region of complete occlusion can be severely underexposed when the camera’s settings are chosen to properly expose the background. In [7], it was shown that it is possible to remove the partial occlusion when the location and width of the partially occluded region are found by a user. Because Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 271–281, 2007. c Springer-Verlag Berlin Heidelberg 2007
272
S. McCloskey, M. Langer, and K. Siddiqi
Fig. 1. (Left) Example image taken through a keyhole. Of the pixels that see through the opening, more then 98% are partially occluded. (Right) The output of our method, with improved visibility in the partially-occluded region.
of the low contrast and arbitrary shape of the boundary between regions of complete and partial occlusion, this task can be challenging, time consuming, and prone to user error. In the current paper we present an automated solution to this vision task for severely blurred occluding objects and in doing so significantly extend the applicability of the method in [7]. Given the input image of Fig. 1 (left), the algorithm presented in this paper produces the image shown in Fig. 1 (right). The user must only click on a point within each completely-occluded region in the image, from which we find the boundary of the region of complete occlusion. Next, we find the width of the partially occluded band based on a model of image formation under partial occlusion. We then process the image to remove the partial occlusion, producing an output with improved visibility in that region. Each of these steps is detailed in Sec. 4.
2
Previous Work
The most comparable work to date was presented by Favaro and Soatto [3], who describe an algorithm which reconstructs the geometry and radiance of a scene, including partially-occluded regions. While this restores the background, it requires several registered input images taken at different focal positions. Tamaki and Suzuki [12] presented a method for the detection of completely occluded regions in a single image. Unlike our method, they assume that the occluding region has high contrast with the background, and that there is no adjacent region of partial occlusion. A more distantly related technique is presented by Levoy et al. in [5], where synthetic aperture imaging is used to see around occluding objects. Though this
Automated Removal of Partial Occlusion Blur
273
ability is one of the key features of their system, no effort is made to identify or remove partial occlusion in the images. Partial occlusion also occurs in matte recovery which, while designed to extract the foreground object from a natural scene, can also recover the background in areas of partial occlusion. Unlike our method, matte recovery methods require either additional views of the same scene [8,9] or substantial user intervention [1,11]. In the latter category, users must supply a trimap, a segmentation of the image into regions that are either definitely foreground, definitely background, or unknown/mixed. Our method is related to matte recovery, and can be viewed as a way of automatically generating, from a single image, a trimap for images with partial occlusion due to blurred foreground objects.
3
Background and Notation
In [7], it was shown that the well-known matting equation, Rinput (x, y) = α(x, y)Rf + (1 − α(x, y))Rb (x, y),
(1)
describes how the lens aperture combines the radiance Rb of the background object with the radiance Rf of the foreground object. The blending parameter α describes the proportion in which the two quantities combine and Rinput is the observed radiance. Notionally, the quantity α is the fraction of the pixel’s viewing frustum that is subtended by the foreground object. Since that object is far from the plane of focus, the frustum is a cone and α is the fraction of that cone subtended by the occluding object. In order to remove the contribution of the occluding object, the values of α and Rf must be found at each pixel. Given the location of the boundary between regions of complete and partial occlusion, the distance d between each pixel and the nearest point on the boundary can be found. From d and the width w of the partially occluded region, the value of α is well-approximated [7] by √ 2d 1 l 1 − l2 + arcsin(l) , where l = min 2, α= − − 1. (2) 2 π w This can be done if the user supplies both w and the boundary between regions complete and partial occlusion, as in [7]. Unfortunately, this task is time consuming, difficult, and prone to user error. In this paper, we present an automated solution to this vision problem, from which we compute the values α and Rf .
4
Method
To state the vision problem more clearly, we refer to the example1 in Fig. 2. In this example, the partial occlusion is due to the handle of a fork directly in front 1
The authors of [7] have made this image available to the public at http://www.cim. mcgill.ca/∼scott/research.html
274
S. McCloskey, M. Langer, and K. Siddiqi
Fig. 2. To remove partial occlusion from a foreground object, the vision problem is to determine the boundary of the completely occluded region (green curve) and the width of the partially-occluded region (the length of the red arrow)
of the lens. In order to remove the contribution of the occluding object, we must automatically find the region of complete occlusion (outlined in green) and the width of the partially occluded band (the length of the red arrow). In order to find the region of complete occlusion within the image, we assume that the foreground image appears as a region of nearly constant intensity. Note that this does not require that the object itself have constant radiance. Because the object is far from the plane of focus, high-frequency radiance variations will be lost due to blurring in the image. Moreover, when objects are placed against the lens they are often severely under-lit, as they fall in the shadow of the camera or photographer. As such, many occluding objects with texture may appear to have constant intensity in the image. A brief overview of the method is as follows. Given the location of a point p that is completely occluded (provided by the user), we use a geometric flow (Sec. 4.2) to produce a segmentation with a smooth contour such as the one outlined in Fig. 2, along which we find normals facing outward from the region of complete occlusion. The image is then re-sampled (Sec. 4.3) to uniformly-spaced points on these normals, reducing an arbitrarily-shaped occluding contour to a linear contour. Low variation rows in the resulting image are averaged to produce a profile from which the blur width is estimated (Sec. 4.4). Once the blur width is estimated, the method of [7] is used to remove the partial occlusion (Sec. 4.5). 4.1
Preprocessing
Two pre-processing steps are applied before attempting segmentation: 1. Because our model of image formation assumes that the camera’s response is linear, we use the method of [2] to undo the effects of its tone mapping function, transforming camera image Iinput to a radiance image Rinput . 2. Before beginning the segmentation, we force the completely occluded region to be the darkest part of the image by subtracting Rp , the radiance of the user-selected point, and taking the absolute value. This gives a new image R = Rinput − Rp ,
(3)
Automated Removal of Partial Occlusion Blur
275
which is nearly zero at points in the region of complete occlusion, and higher elsewhere. As a result of this step, points on the boundary between the regions of partial and complete occlusion will have gradients ∇R that point out of the region of complete occlusion. This property will be used to find the boundary between the regions of complete and partial occlusion. 4.2
Foreground Segmentation
While the region of complete occlusion is assumed to have nearly constant intensity, segmenting this region is nontrivial due to the extremely low contrast at the boundary between complete and partial occlusion. In order to produce good segmentations in spite of this difficulty, we use two cues. The first cue is that pixels on the boundary of the region of complete occlusion have gradients of R that point into the region of partial occlusion. This is assured by the pre-processing of Eq. 3, which causes the foreground object to have the lowest radiance in R. The second cue is that points outside of the completely occluded region will generally have intensities that differ from the foreground intensity. To exploit these two cues, we employ the flux maximizing geometric flow of [13], which evolves a 2D curve to increase the outward flux of a static vector field through its boundary. Our cues are embodied in the vector field ∇R − → , where φ = (1 + R)−2 . V =φ |∇R|
(4)
∇R The vector field |∇R| embodies the first cue, representing the direction of the gradient, which is expected to align with the desired boundary as well as be orthogonal to it2 . The scalar field φ, which embodies the second cue, is near 1 in the completely-occluded region and smaller elsewhere. As noted in [6], an exponential form for φ can be used to produce a monotonically-decreasing function of R, giving similar results. The curve evolution equation works out to be ∇R →− − → → − Ct = div( V )N = ∇φ, (5) + φκR N , |∇R|
where κR is the Euclidean mean curvature of the iso-intensity level set of the image. The flow cannot leak outside the completely occluded region since by construction both φ and ∇φ are nearly zero there. This curve evolution, which starts from a small circular region containing the user-selected point, may produce a boundary that is not smooth in the presence of noise. In order to obtain a smooth curve, from which outward normals can be robustly estimated, we apply a few iterations of the euclidean curve-shortening flow [4]. While it is possible to include a curvature term in the flux-maximizing flow to evolve a smooth contour, we separate the terms into different flows which are computed in sequence. Both flows are implemented using level set methods [10]; details are given in the Appendix. 2
It is important to normalize the gradient of the image so that its magnitude does not dominate the measure outside of the occluded region.
276
S. McCloskey, M. Langer, and K. Siddiqi
Fig. 3. (Left) Original image with segmentation boundary (green) and outward-facing normals (blue) along which the image will be re-sampled. (Right) The re-sampled image (scaled), which is used to estimate the blur width.
Once the curve-shortening flow has terminated, we can recover the radiance Rf of the foreground (occluding) object by simply taking the mean radiance value within the segmented (completely occluded) region. Note that we use this instead of Rp , the radiance of the user-selected point, as there may be some low-frequency intensity variation within the region of complete occlusion. 4.3
Boundary Rectification and Profile Generation
One of the difficulties in measuring the blur width is that the boundary of the completely occluded region can have an arbitrary shape. In order to handle this, we re-sample the image R along outward-facing normals to the segmentation boundary, reducing the shape of the occluding contour to a line along the left edge of a re-sampled image Rl . The number of rows in Rl is determined by the number of points on the segmentation boundary, and pixels in the same row of Rl come from points on the same outward-facing normal. Pixels in the same column come from points the same distance from the segmentation boundary on different normals and thus, recalling Eq. 2, have the same α value. The number of columns in the image depends on the distance from the segmentation boundary to the edge of the input image. We choose this quantity to be the largest value such that 80% of the normals remain within the image frame and do not re-enter the completely-occluded region (this exact quantity is arbitrary and the method is not sensitive to variations in this choice). Fig. 3 shows outward-facing surface normals from the contour in Fig. 2, along with the re-sampled image. The task of measuring the width of the partially occluded region is also complicated by the generality of the background intensity. In the worst case, it is impossible (for human observers or our algorithm) to estimate the width if the background has an intensity gradient in the opposite direction of the intensity gradient due to partial occlusion. The measurement is straightforward if the background object had constant intensity, though this assumption is too strong. Given that the blurred region is a horizontal feature in the re-sampled image, we average rows of Rl in order to smooth out high-frequency background
Automated Removal of Partial Occlusion Blur
277
2 0.03
1.8
1.6
0.025
1.4 0.02
Fitting Error
P(x)
1.2
1
0.8
0.015
0.01
0.6
0.4 0.005
0.2
0
0
20
40
60
80
100
120
140
160
0 20
40
60
80
x
100 w′
120
140
160
Fig. 4. [Left] Profile generated from the re-sampled image in Fig. 3 (black curve). Model 50 141 with relatively high error (red curve). Model profile Pm with minimum profile Pm error (green curve). [Right] Fitting error as a function of w .
texture. While we do not assume a uniform background, we have found it useful to eliminate rows with relatively more high-frequency structure before averaging. In particular, for each row of Rl we compute the sum of its absolute horizontal derivatives Rl (x + 1, y) − Rl (x − 1, y). (6) x
Rows with an activity measure in the top 70% are discarded, and the remaining rows are averaged to generate the one dimensional blur profile P . 4.4
Blur Width Estimation
Given a 1D blur profile P , like the one shown in Fig. 4 (black curve), we must estimate the width w of the partially occluded region. We do this by first expressing P in terms of α. Recalling Eq. 3 and the fact that Rf ≈ Rp , we rearrange Eq. 1 to get (7) Rl (x, y) = (1 − α(x, y))Rbl (x, y) − Rf , where Rbl is the radiance of the background object defined on the same lattice as the re-sampled image. The profile P (x) is the average of many radiances from pixels with the same α value, so P (x) = (1 − α(x))Rbl (x) − Rf ,
(8)
where Rbl (x) is the average radiance of background points a fixed distance from the segmentation boundary (which fall in a column of the re-sampled image). As we have removed rows with significant high-frequency structure and averaged the rows of the re-sampled image, we assume that the values Rbl (x) are relatively constant over the partially-occluded band, and thus P (x) = (1 − α(x))Rbl − Rf .
(9)
Based on this, the blur width w is taken to be the value that minimizes the average fitting error between the measured profile P and model profiles. The
278
S. McCloskey, M. Langer, and K. Siddiqi
w model profile Pm for a given width w is constructed by first generating a linear ramp l and then transforming these values into α values by Eq. 2. An example is shown in Fig. 4, where the green curve shows the model profile for which the error is minimized with respect to the measured profile (black curve), and the red curve shows another model profile which has higher error. A plot of the error as a function of w is shown in figure 4. We see that it has a well-defined global minimum, which is at w = 141 pixels.
4.5
Blur Removal
Once the segmentation boundary and the width w of the partially-occluded region have been determined, the value of α can be found using Eq. 2. In order to compute α at each pixel, we must find its distance to the nearest point on the segmentation boundary. We employ the fast marching method of [10]. Recall that the radiance Rf of the foreground object was found previously, so we can recover the radiance of the background at pixel (x, y) according to Rb (x, y) =
Rinput (x, y) − α(x, y)Rf . 1 − α(x, y)
(10)
Finally, the processed image Rb is tone-mapped to produce the output image. This tone-mapping is simply the inverse of what was done in section 4.1.
5
Experimental Results
Fig. 5 shows the processed result from the example image in Fig. 2. The userselected point was near the center of the completely occluded region though, in our experience, the segmentation is insensitive to the location of the initial point. We also show enlargements of a region near the occluding contour to illustrate the details that become clearer after processing. Near the contour, as α → 1, noise becomes an issue. This is because we are amplifying a small part of the input signal, namely the part that was contributed by the background.
Fig. 5. (Left) Result for the image shown in Fig. 2. (Center) Enlargement of processed result. (Right) Enlargement of the corresponding region in the input image.
Automated Removal of Partial Occlusion Blur
279
Fig. 6. Example scene through a small opening. (Left) Input wide-aperture image. (Middle) Output wide-aperture image. (Right) Reference small-aperture image. Notice that more of the background is visible in our processed wide-aperture image.
Fig. 6 shows an additional input and output image pair, along with a reference image taken through a smaller aperture. The photos were taken through a slightly opened door. It is important to note that processing the wide aperture photo reveals background detail in parts of the scene where a small aperture is completely occluded. Namely, all pixels where α > .5 are occluded in a pinhole aperture image, though many of them can be recovered by processing a wide aperture image. In this scene, there are two disjoint regions of complete occlusion, each of which has an adjacent region of partial occlusion. This was handled by having the user select two starting points from which the segmentation flow was initialized, though the method could also have been applied separately to each occluded region. The method described in this paper can also be extended to video processing. In the event that the location of the camera and the occluding object are fixed relative to one another, we need only perform the segmentation and blur estimation on a single frame of the video. The recovered value of α at each pixel (the matte) can be used to process each frame of the video separately. A movie, keyholevideo.mpg, is included in the supplemental material with this submission, and shows the raw and processed frames side-by-side (as in Fig. 1).
6
Conclusion
The examples in the previous section demonstrate how our method automatically measures the blur parameters and removes partial occlusion due to nearby objects. Fig. 6 shows that pictures taken through small openings (such as a fence,
280
S. McCloskey, M. Langer, and K. Siddiqi
keyhole, or slightly opened door) can be processed to improve visibility. In this and the case of the text image shown in Fig. 5, this method reveals important image information that was previously difficult to see. The automated nature of this method makes the recovery of partially-occluded scene content accessible to the average computer user. Users need only specify a single point in each completely occluded region, and the execution time of 10-20 seconds is likely acceptable. Given such a tool, users could significantly improve the quality of images with partial occlusions. In order to automate the recovery of the necessary parameters, we have assumed that the combination of blurring and under-exposure produces a foreground region with nearly constant intensity. Methods that allow us to relax this assumption are the focus of ongoing future work, and must address significant additional complexity in each of the segmentation, blur width estimation, and blur removal steps.
References 1. Chuang, Y., Curless, B., Salesin, D.H., Szeliski, R.: A Bayesian Approach to Digital Matting. In: CVPR 2001, pp. 264–271 (2001) 2. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: SIGGRAPH 1997, pp. 369–378 (1997) 3. Favaro, P., Soatto, S.: Seeing Beyond Occlusions (and other marvels of a finite lens aperture). In: CVPR 2003, pp. 579–586 (2003) 4. Grayson, M.: The Heat Equation Shrinks Embedded Plane Curves to Round Points. Journal of Differential Geometry 26, 285–314 (1987) 5. Levoy, M., Chen, B., Vaish, V., Horowitz, M., McDowall, I., Bolas, M.: Synthetic Aperture Confocal Imaging. In: SIGGRAPH 2004, pp. 825–834 (2004) 6. Perona, P., Malik, J.: Scale-Space and Edge Detection using Anisotropic Diffusion. IEEE Trans. on Patt. Anal. and Mach. Intell. 12(7), 629–639 (1990) 7. McCloskey, S., Langer, M., Siddiqi, K.: Seeing Around Occluding Objects. In: Proc. of the Int. Conf. on Patt. Recog. vol. 1, pp. 963–966 (2006) 8. McGuire, M., Matusik, W., Pfister, H., Hughes, J.F., Durand, F.: Defocus Video Matting. ACM Trans. Graph. 24(3) (2005) 9. Reinhard, E., Khan, E.A.: Depth-of-field-based Alpha-matte Extraction. In: Proc. 2nd Symp. on Applied Perception in Graphics and Visualization 2005, pp. 95–102 (2005) 10. Sethian, J.A.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (1999) 11. Sun, J., Jia, J., Tang, C., Shum, H.: Poisson Matting. ACM Trans. Graph. 23(3) (2004) 12. Tamaki, T., Suzuki, H.: String-like Occluding Region Extraction for Background Restoration. In: Proc. of the Int. Conf. on Patt. Recog. vol. 3, pp. 615–618 (2006) 13. Vasilevskiy, A., Siddiqi, K.: Flux Maximizing Geometric Flows. IEEE Trans. on Patt. Anal. and Mach. Intell. 24(12), 1565–1578 (2002)
Appendix: Implementation Details For the experiments shown here, we down-sample the original 6MP images to 334 by 502 pixels for segmentation and blur width estimation. Blur removal is
Automated Removal of Partial Occlusion Blur
281
performed on the original 6MP images. Based on this, blur estimation and image processing takes approximately 10 seconds (on a 3 GHz Pentium IV) to produce the output in Fig. 5. Other images take more or less time, depending on the size of the completely-occluded region. Readers should note that some of the code used in this implementation was written in Matlab, implying that the execution time could be further reduced in future versions. As outlined in section 4.2, we initially use a flux-maximizing flow to perform the segmentation, followed by a euclidean curve-shortening flow to produce a smooth contour. For the flux-maximizing flow, we evolve the level function with speed Δt = 0.1. This parameter was chosen to ensure stability for our 6MP images; in general it depends on image size. The evolution’s running time depends on the size of the foreground region. The curve evolution is terminated if it fails to increase the segmented area by 0.01% over 10 iterations. As the flux-maximizing flow uses an image-based speed term, we use a narrow-band implementation [10] with a bandwidth of 10 pixels.
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint Fan Zhang, Wenyu Liu, and Chunxiao Liu Huazhong University of Science and Technology, Wuhan, 430074, P.R. China {zhangfan,liuwy}@hust.edu.cn
[email protected]
Abstract. High-capacity image watermarking scheme aims at maximize bit rate of hiding information, neither eliciting perceptible image distortion nor facilitating special watermark attack. Texture, in preattentive vision, delivers itself by concise high-order statistics, and holds high capacity for watermark. However, traditional distortion constraint, e.g. just-noticeable-distortion (JND), cannot evaluate texture distortion in visual perception and thus imposes too strict constraint. Inspired by recent work of image representation [9], which suggests texture extraction and mix probability principal component analysis for learning texture feature, we propose a distortion measure in the subspace spanned by texture principal components, and an adaptive distortion constraint depending on image local roughness. The proposed spread-spectrum watermarking scheme generates watermarked images with larger SNR than JND-based schemes at the same distortion level allowed, and its watermark has a power spectrum approximately directly proportional to the host image’s and thereby more robust against Wiener filtering attack.
1 Introduction Image watermarking is applied to copyright protection and covert communication. Efficiency of watermarking scheme can be measured by hiding capacity. Generally, a watermarking scheme achieves a high hiding capacity by qualifying the tradeoff between the achievable information-hiding rates and the allowed distortion constraint for the information hider and the attacker [1]. Specially, for additional watermarking scheme, e.g. spread-spectrum watermarking scheme [2], the embedding algorithm is designed to add maximal intensity watermark into the host image satisfying the distortion constraint for information hider. In this way, hiding capacity is restricted by distortion level allowed. Information-theoretical analysis [1] has proved the augment. A variety of distortion constraints have been proposed to incorporate certain psychovisual properties of the human visual system (HVS). The most popular method maybe is Just-noticeable-distortion (JND), which is firstly applied to image compression [3]. JND provides each signal a threshold level of error visibility, below which reconstructed signals are rendered without noticeable distortion. JND profile of a still image is a function of local signal properties, including background intensity, activity of luminance changes and dominant spatial frequency. For more accurate JND estimation, edge region and nonedge region should be distinguished [4]. Edge is Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 282–291, 2007. © Springer-Verlag Berlin Heidelberg 2007
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint
283
directly related to image content that demarcates object boundaries, surface crease, and reflectance change, and thus distortion around edge is easier to be noticed. Nonedge texture, therefore, can hide more error than smooth or edge areas. Psychovisual studies by Julesz [5] suggest that image textures with the same second-order statistics are perceived as identical by the human visual system. Among the work of texture synthesis, some synthesis-by-analysis methodologies [6~8] verify that texture can be reconstructed by matching features’ statistics, where the features are generally texture’s projection to a set of direction determined by a bank of suitable filters. Meanwhile, other resampling approaches [11,12] implies local neighborhoods can be alternated without inciting distortion. These works suggest that distortion constraint should tolerance local diversity but guard the global difference of statistics. This paper holds a hypothesis that local texture can be warped more intensely across the direction where the own texture has larger variance. The augment may be explained as (1) noise (watermark) having the same statistics with the texture will cause less perceptible distortion than white noise, and (2) when warped across the direction respect to large variance, the texture patch has more probability to resemble another patch somewhere, hence the warp is equivalent to alternation and may be allowable. Inspired by HIMPA (Hybrid ICA-Mixture of PPCA Algorithm) [9], which use an independent component analysis (ICA) model for edge representation followed by a mixture of probabilistic principal components analyzers (MPPCA) for textural surface representation, this paper proposes (1) a nonedge texture extraction method by a three-factor image model, and (2) a texture distortion measure based on its principal components and an adaptive distortion constraint depending on image local roughness, so that we can embed spread-spectrum watermark with maximal intensity into the nonedge regions of the host image to achieve a high capacity watermarking scheme. The watermarking scheme is introduced in section 2. Selected experimental results are presented in section 3. The paper closes with concluding remarks in section 4.
2 Proposed Scheme Our scheme is shown in Figure 1. Without loss of generality, we embed one bit of message m whose value is either –1 or +1 into image blocks vector x. Corresponding to secret key k, p is a pseudorandom noise sequence with zero mean, and its values are equal to +1 or –1. Modulated by m, the sequence p is weighted added to or subtracted from x to form the watermarked image block vector y as
y = x + mdiag( p)γ .
(1)
where diag(A) is a diagonal matrix having vector A as its diagonal. The pixelwise weighted coefficient γ represent the intensity of watermark, and it is maximized under the adaptive distortion constraint as formula (2), which will be analyzed in section 2.1. d ( y, x ) < Dw .
(2)
The watermarked image is possibly corrupted by an attacker’s noise. We only consider the additive noise, as formula (3)
284
F. Zhang, W. Liu, and C. Liu
z = y + n.
(3)
The receiver knows the secret key k and can recover the p, and then the detection is performed. Firstly, the (normalized) correlation is calculated as
r =< z,p >= m
∑ γ + < x ,p > + < n ,p > .
(4)
where
denotes correlation, i.e., inner product of sequence A and sequence B. The two latter terms in (4) can be neglected due to independency between x and p and independency between n and p, then watermark is estimated as the sign of r mˆ = sign (r ).
key adaptive distortion constraint
(5)
PRN
PRN
γ
p
p m y
x
Nonedge texture extraction
key
z
correlation detector
mˆ
n channel & attack noise Fig. 1. The proposed watermarking scheme
2.1 Nonedge Texture Extraction
For extraction of nonedge texture, we propose a three-factor image model. Three factors are assumed to contribute to the appearance of nature image: (a) objects’ intrinsic luminance and shading effects, (b) dominant edges, and (c) nonedge texture. Different from HIMPA model [9], our model isolates low frequency band of image, xL, as the factor (a), and uses HIMPA model to analyze the high frequency band xH, so as to discriminate factor (b) and factor (c). Factor (b) is edge region of xH while factor (c) is the surface region. The reason for isolating factor (a) is that clustering of MPPCA will not be affected by local average luminance, and thereby will depend more on local contrast. According to HIMPA [9], we use independent component analysis (ICA) to extract nonedge texture from the high frequency band of image. An image patch xH is represented as a linear superposition of columns of the mixing matrix A and a residue noise which can be neglected. x H = As.
(6)
The columns of A are learned from nature images by making the components of the vector s statistically independent. In the learning algorithm, A is constrained to be an orthogonal transformation. So the vector s is calculated by
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint
285
s = AT x H .
(7)
The elements of s with large magnitudes signify the presence of structure in an image block, corresponding to edges, oriented curves, etc. The elements of s with smaller magnitudes represent the summation of a number of weak edges to yield nonedge texture. For each image block, we divide A into two groups, AE and AT, keeping all the absolute values of s corresponding to columns of AE are larger than AT. The image block is also decomposed into two subcomponents. xH = xE + xT .
(8)
x E = AE ( AE T AE ) −1 AET x H .
(9)
where
The ratio between the column amounts of AT and A is determined by ICA threshold. In this paper, ICA threshold is 0.375 and A has 160 columns, therefore AE has 100 columns, while AT has 60 columns. Due to ratio control by the ICA threshold, nonedge texture is the residual after removing the relatively dominant edge instead of the absolutely sharp edge in HIMPA [9], therefore, much intenser nonedge texture will be extracted from the rough region than the smooth region. Figure 2 illustrates the process of nonedge texture extraction. 2.2 Statistical Distortion Measure
The basic idea of the proposed measurement is to valuate distortion in the subspace which is spanned by the texture’s principal components. PPCA is a suitable stochastic model for a homogenous nonedge region as suggested by HIMPA [9]. MPPCA is able to model more complex nonedge regions in a real image scene due to utilization of clustering. We describe the salient feature of MPPCA model and the detail is contained in [9]. The nonedge texture can be assumed to sample from a limited number of clusters. Each cluster is assumed homogenous, and efficient basis can be constructed, where a texture block from cluster k is generated using the generative model
∈
x k = W k s k + μ k + ε k , k = 1,..., K .
(10)
where xk Ra is a column vector elongated from the host image block and has a dimension of a, sk Rq is the lower dimensional source manifold assumed to be Gaussian distributed with zero mean and identity covariance. Note that, in this section, image block always refers to the nonedge texture block. The dimension of sk is q and a > q. Wk is a mixing matrix of a by q. μk is the cluster observation mean, εk is Gaussian white isotropic noise with zero mean, i.e. εk ~ N (0, σ2I). Hence, Ra conforms to a distribution N (μk, WkWkT+σ2I). Wk, Rq, μk, and εk are hidden variables that need to be estimated from the data of observed texture. μ and columns of W in one cluster are visualized as 8×8 block at the right bottom of Figure 2.
∈
286
F. Zhang, W. Liu, and C. Liu
xT
MPPCA analyzing
ICA xH
coding
averaging x
host image
xE
filtering classified 8×8 blocks xL μk
Wk
Fig. 2. Three-factor model for nonedge extraction and features learned by MPPCA
We define the distortion between image block vector x and y as their quadratic,
d ( y, x ) = E xy2 = ( x − y )T C −1 ( x − y )
.
(11)
where C is covariance of the maximum-likelihood cluster which x belongs to, and
C = σ 2 I + WW T
(12)
Formula (11) can be transformed into a standard quadratic form according to [10]
d ( y, x ) = v T D −1v
(13)
where v is the projection of x – y, v = UT(x – y). If sample covariance matrix learned in MPPCA has an eigenvalues matrix arranged in descending order denoted by diag(λ1, λ2, …, λq, …, λa), U is the corresponding eigenvectors matrix, D–1 is a diagonal matrix, noted as diag(1/λ12, 1/λ22, …, 1/λq2, 1/σ2,…, 1/σ2), and σ is obtained by averaging the last a – q eigenvalues. Similar with principal component analyzer (PCA), the transform UT projects the observation into a set of orthogonal directions, across which the observations have a large variation. In fact, formula (13) is equal to mean square error metric, when D is a unit matrix. So the measurement is a general case of mean square error metric. It is noteworthy that formula (13) looks like Karhunen Loeve Transform (KLT) norm, but substantial difference is that formula (13) is a measurement in an orthogonal space estimated by the whole observations’ covariance, instead of a single observation’s covariance. Therefore, formula (13) is a statistical measurement. 2.3 Adaptive Distortion Constraint
We define adaptive distortion constraint by setting Dw as Dw = α ( x T − μ) T C −1 ( x T − μ).
(14)
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint
287
where α is a positive coefficient scaling the upper limit of distortion, μ is mean of the maximum-likelihood cluster which xT belongs. Because xT more tends to μ in the smooth region than in the rough region, Dw for smooth region will be generally smaller than for rough region. Therefore Dw is loosened for rough region which is often full of intenser texture. According to (4), optimization of watermark hiding capacity is to maximize sum of γ under quadratic inequality constraint from (2) and (13). By Lagrange multiplier, we obtain the solution,
γ~ =
Dw diag ( p)Cp . pT Cp
(15)
We also consider luminance masking effect in JND theory, and the luminance masking TL conforms to definition in [4]. So the final solution γ for each pixel is
γ = γ~ + TL .
(16)
3 Experimental Results ICA mixing matrix is estimated beforehand using samples from a training set of 13 natural images downloaded with the FastICA package [17]. Note that ICA mixing matrix is independent with host images and need not be held by watermark detector. At the watermark encoder, after 8×8 averaging filtering and extracting by ICA coding, nonedge texture is obtained and then, partitioned into 8×8 blocks and vectorized into 64×1 vectors. Parameters of MPPCA model has eight clusters with 4 principal components in each cluster, where the program code was developed based on the NETLAB Matlab package [13]. Then, adaptive distortion constraint Dw and the maximal intensity γ for each pixel is calculated, considering luminance masking of JND. Finally, spread-spectrum watermark is embedded. At the watermark decoder, only correlation detection is enough to extract the watermark message without any image analysis or attack estimation. So the proposed scheme is a blind watermarking scheme without need of any side information. 3.1 Distortion We compare our scheme with the spread-spectrum watermarking schemes based on spatial JND model [4] and wavelet JND model [14]. We measure the watermark intensity by signal-to-noise-ratio (SNR) of watermarked image related to the host image. SNR is also used to measure indirectly the hiding capacity for spread-spectrum watermark because intenser watermark scheme supports higher bit ratio of watermark to host image under the same detecting error rate at the watermark decoder. Our experiment makes the watermarked images of three schemes have the same SNR (Baboon 20.1dB, Bridge 21.7dB, and Lena 27.3dB) by adjusting their parameters, so that experimenters can make subject distortion evaluation and further compare the distortion levels allowed under different constraints. The watermarked image patches are shown in Figure 3.
288
F. Zhang, W. Liu, and C. Liu
Baboon
Bridge
Lena (a)
(b)
(c)
(d)
Fig. 3. Watermarked patches. (a) host image. (b) our scheme, (c) spatial JND, (d) wavelet JND.
(a)
(b)
(c)
Fig. 4. Watermark of our scheme. The top row contains host images, and the bottom contains watermarks, where dark pixels denote the negative signals and the light denote the positive.
In the experiment, two JND-based schemes expose more distortion than our scheme when embedding saturated watermark. Spatial JND scheme (Figure 3c) reveals noticeable noise either across sharp edges or at smooth surfaces. Wavelet JND
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint
289
scheme (Figure 3d) fails to keep image quality around fine edges, which correspond to high frequency wavelet bands. Moreover, wavelet JND scheme generates glaring dark or light speckles in texture region, where, in fact, inverse wavelet transform of the coefficients watermarked exhaustedly is overbound. Our scheme has followed advantages. Firstly, it does not incite watermark intensity at sharp edges due to the three-factor image coding before watermarking. Secondly, it prefers to embed watermark into rough region rather than smooth region due to the adaptive distortion constraint Dw. Lastly, it activates the intensity of watermark conforming to the principal components of texture and inhibit the unconformable watermark. The characters are clearly exhibited in the watermark signal as shown in Figure 4. Since our scheme exploits the texture region, its advantages are not obvious for images with sparse texture, e.g. image Lena. 3.2 Robustness Against Estimation-Based Attack Sophisticate attackers can make estimation-based attack if they can obtain some prior knowledge of host image or watermark’s statistics [16]. Image denoising provides a natural way to develop estimation-based attacks, where watermark is treated as a noise. Given the power spectrum of host image and watermark, one of the most malevolent attacks is the denoising method by adaptive Wiener filter. So the most robust watermark should have a power spectrum directly proportional to the power spectrum of the host image [16]. We compare the power spectrum of watermark by three schemes in Figure 5.
(a)
(b)
(c)
(d)
Fig. 5. Power spectrum of image Baboon by Fourier analysis. (a) host image. (b) ~ (d) watermark signals, (b) our scheme, (c) spatial JND scheme, (d) wavelet JND scheme. Signal is normalized to 0 mean and unit variance before Fourier analysis. The luminance denotes logarithmic amplitude of Fourier components. Zero-frequency component is shifted to the center.
It is clear that spatial JND scheme generates a nearly white watermark, and wavelet JND scheme embed watermark focusing on the middle frequency components. Among the three schemes, watermark power spectrum of our scheme mostly resembles the power spectrum of the host image, and it is nearly directly proportional to the power spectrum of the host image. So our scheme gives a suboptimal watermarking solution against the estimation-based attack of Wiener filter.
4 Conclusion The proposed watermarking scheme can hide more information than traditional schemes because it loosens distortion constraint for nonedge texture, and it is also
290
F. Zhang, W. Liu, and C. Liu
robust against Wiener filtering attack due to the watermark power spectrum resembling the host image. Meanwhile, the watermark detection can be blind and fast. High-capacity watermark should qualify the tradeoff between the bit ratio of watermark to host image and the error rate of detector. Although this paper guides a one-bit watermarking scheme with maximal watermark intensities, it still provides a potential design for high-capacity watermarking scheme, since multiple-bit watermark is to choose a suitable number of image blocks for embedding each one bit of watermark. Our schemes only consider the attack of additional noise. Some well known strategies against geometric attack and some smart detection algorithms may be merged into the proposed scheme so as to resist more sophisticate attacks. Acknowledgement. This work is supported by NSFC (NO: 60572063), Doctoral Fund of Ministry of Education of China (NO: 20040487009), and the Cultivation Fund of the Key Scientific and Technical Innovation Project Ministry of Education of China (NO: 705038). The author would like to thank Nikhil Balakrishnan and Yizhi Wang for discussions.
,
References 1. Mounlin, P., O’Sullivan, J.A.: Information-theoretic analysis of information hiding. IEEE Trans. Information Theory 49, 563–593 (2003) 2. Cox, I.J., Kilian, J., Leighton, F.T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Trans. Image Proc. 6, 1673–1687 (1997) 3. Jayant, N.: Signal compression: Technology targets and research directions. IEEE J. Select. Areas Commun. 10, 314–323 (1992) 4. Yang, X.K., Li, W.S., Lu, Z.K., et al.: Motion-compensated residue preprocessing in video coding based on just-noticeable-distortion profile. IEEE Trans. Circuits & System Video Tech. 15, 742–752 (2005) 5. Julesz, B.: Visual pattern discrimination. IRE Trans. Inf. Theory 8, 84–92 (1962) 6. Heeger, D., Bergen, J.: Pyramid-based texture analysis/synthesis. In: Proc. ACM SIGGRAPH, pp. 229–238 (1995) 7. Zhu, S.C., Wu, Y.N., Mumford, D.B.: Filter, Random fields, and Maximum Entropy (FRAME) –Towards a Unified Theory for Texture Modeling. Int’l Journal of Computer Vision 27, 107–126 (1998) 8. Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision 40, 49–71 (2000) 9. Balakrishnan, N., Hariharakrishnan, K., Schonfeld, D.: A new image representation algorithm inspired by image submodality models, redundancy reduction, and learning in biological vision. IEEE Trans. Pattern Analysis & Machine Intelligence 27, 1367–1378 (2005) 10. Tipping, M., Bishop, C.: Mixtures of Probabilistic Principal Component Analyzers. Neural Computation 11(2), 443–482 (1999) 11. Lefebvre, S., Hoppe, H.: Parallel controllable texture synthesis. In: ACM SIGGRAPH, pp. 777–786 (2005) 12. De Bonet, J.S.: Multiresolution sampling procedure for analysis and synthesis of texture images. In: ACM SIGGRAPH, pp. 361–368 (1997)
High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint
291
13. Nabney, I., Bishop, C.: Netlab Neural Network Software (July 2003), http://www.ncrg. aston.ac.uk/netlab 14. Huang, X.L., Zhang, B.: Perceptual watermarking using a wavelet visible difference predictor. IEEE ICASSP 2, 817–820 (2005) 15. Liu, W.Y., Zhang, F., Liu, C.X.: Spread-Spectrum Watermark by Synthesizing Texture. In: Pacific-Rim Conf on Multimedia (to be published, 2007) 16. Voloshynovskiy, S., Pereira, S., Pun, T., et al.: Attacks on Digital Watermarks: Classification, Estimation-Based Attacks, and Benchmarks. IEEE Communications Magazine 39, 118– 126 (2001) 17. Hyrarinen, A.: Fast ICA Matlab package (April 2003), http://www.cis.hut.fi/projects/ica /fastlab
Attention Monitoring for Music Contents Based on Analysis of Signal-Behavior Structures Masatoshi Ohara1,2, Akira Utsumi1 , Hirotake Yamazoe1 , Shinji Abe1 , and Noriaki Katayama2 ATR Intelligent Robotics and Communication Laboratories 2-2-2 Hikaridai, Seikacho, Sorakugun, Kyoto 619-0288, Japan 2 Osaka Prefectural College of Technology 26-12 Saiwaicho Neyagawashi, Osaka 572-8572, Japan
1
Abstract. In this paper, we propose a method to estimate user attention to displayed content signals with temporal analysis of their exhibited behavior. Detecting user attention and controlling contents are key issues in our “networked interaction therapy system” that effectively attracts the attention of memory-impaired people. In our proposed method, user behavior, including body motions (beat actions), is detected with auditory/vision-based methods. This design is based on our observations of the behavior of memory-impaired people under video watching conditions. User attention to the displayed content is then estimated based on body motions synchronized to auditorial signals. Estimated attention levels can be used for content control to attract deeper attention of viewers to the display system. Experimental results suggest that the proposed method effectively extracts user attention to musical signals.
1
Introduction
Human behavior is considered to have a close relation with mental and/or physical states, intention, and individual interests. For instance, when a person is shopping, human eye movement may provide significant information regarding interest in specific products in the shop. Since the issue of human state/intention estimation based on behavior can be related with a wide range of application domains, estimation mechanisms have been widely investigated in the field of computer vision and pattern recognition [1,2]. However, generally speaking, achieving reliable behavior-based state/intention estimation is hard without any domain specific knowledge because there are a wide variety of human behaviors and environment/context dependencies. On the other hand, estimating human attention and interest to a specific stimuli is relatively easy when the timing, position, and content of the stimuli are known. In this paper, we propose a system to estimate viewer’s interest in a displayed content based on viewer behavior. We focus on a method to estimate viewer attention levels to music contents by observing the synchronous behavior of users to the music signals. In a human-computer interaction task, for instance, attracting and retaining the motivation of users becomes significant for extracting positive reactions. To Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 292–302, 2007. c Springer-Verlag Berlin Heidelberg 2007
Attention Monitoring for Music Contents
293
achieve this, the system has to estimate individual concentration levels and control the style and amount of displayed information. Since reactions vary from user to user, such control must occur dynamically. The same situation prevails in our “networked interaction therapy system,” which effectively attracts the attention of memory-impaired people and lightens the burden of helpers or family members [3]. “Networked interaction therapy” requires that the system provide remote communication with helpers and family members as well as video contents and other services. To attract the attention of users for long periods of time, the system has to detect user behaviors and control the order and timing of provided services based on estimates of their concentration levels. Before the above research, we presented several video contents to memoryimpaired people and analyzed their behaviors when they watched video and music contents. We observed the following positive reactions to video and musical contents: head nodding, laughing, singing, and clapping. Based on the results, we implemented a content switching mechanism based on user head movements. Then we estimated user concentration levels to the displayed video contents due to head directions and controlled the switching of contents to retain more user attentions to the video. However, since human behavior can change quickly based on content categories, such a tendency is remarkable for musical contents that do not require user attention to the video display. Therefore, in this paper, we examine the possibility of estimating viewer concentration levels to presented musical contents using user “beat” actions. “Beat” is a basic property by which humans recognize music, and there are close relations between music’s “beat” and human behavior. For example, Shiratori et al. analyzed the relation between the “beat” structure in music and human dance behavior. Since synchronized motion to music is a commonly observed human behavior when listening music, it should function as a useful observable feature to estimate user attention to and interest in music. As we mention below, synchronized motion to music contents is behavior that is not always observed for all users. However, we still consider that this useful cue complements other cues such as head movements in concentration level estimation. In the next section, we briefly summarize our “networked interaction therapy” project. Section 3 describes our observations of the TV watching behavior of dementia patients. Section 4 explains the framework of our attention estimation, and Section 5 shows the experimental results. In Section 6, we conclude this paper.
2
Information Therapy Project
Memory is frequently impaired in people with such acquired brain damage problems as encephalitis, head trauma, subarachnoid haemorrhage, dementia, and cerebral vascular injury. Such people have difficulty leading normal lives due to memory impairment or higher brain dysfunction; consequently, constant care and attention impose a heavy burden on their families. Networked Interaction Therapy, the name of our method that relieves the stress suffered by memory-impaired
294
M. Ohara et al.
people and their family members, provides easy access to the services of networked support groups [3]. The main goals of Networked Interaction Therapy include supporting the daily activities of memory-impaired people and reducing the burden on their families. Its primary function is to automatically detect the intentions of a memoryimpaired person. This therapy provides the necessary information to guide the individual to comfortable situations on behalf of family members before beginning to experience behavioral problems such as wandering at night, incontinence, temper tantrums, and so on. These anxieties are often caused by a lack of information. A system based on Networked Interaction Therapy needs to detect situations where communication can help the individual overcome the difficulties encountered in daily activities. In this paper, we describe a method to estimate user attention/interest levels to presentation contents. For instance, estimation results can be used for effectively switching contents to hold user interest/attention. In the next section, we give some typical behavior of memory-impaired people that was observed through experiments using video contents display.
3
Video Watching Behavior of Dementia People
Before system implementation, we examined the behavior of actual dementia patients watching TV through observation experiments [4]. Here we briefly summarize the results. In the experiments, we prepared both interesting/uninteresting video programs for subjects based on their preexamined preferences and personalized reminiscence video produced by photographs selected from their photo albums [5]and sequentially displayed such video contents to them. We performed experiments with eight subjects including three subjects (A, B, C) living at home with their families and five in nursing facilities (D-H). Table 1 shows brief profiles of the subjects. Though observed behavior depends on individuals and their symptom levels, behavior where subjects look away from the TV due to “uninterested” contents is very common. We observed significant differences of watching time for particular subjects. Based on this observation, we consider face orientation a crucial behavior replated for attention/interest to video contents. The following are additional typical behavior related to level of attention/ interest. Example of uttterances: – presenting positive or negative utterances to displayed video contents. – explaining the contents of the reminiscence video to family members. – singing songs from the video. From subject utterances, both positive and negative words are observed. Thus, it is difficult to estimate subject interest to the displayed contents only from the existence of utterances.
Attention Monitoring for Music Contents
295
Table 1. Brief subject profiles Subjects A B C D E F G H
Age 62 81 69 83 90 89 89 92
Case History Preference cerebral contusion Japanese chess, music Alzheimer’s disease train travel, music cerebral infarction baseball games, music cerebral dementia children’s songs senile dementia N/A Alzheimer’s disease movies Alzheimer’s disease music senile dementia N/A
Example of hand motions: – beating time with hands to music contents. – pointing at TV while viewing reminiscence videos and news programs. The “beating time with hands to music” action is considered a typical positive reaction to the displayed contents. However subject A, for instance, presented the “beat” action without gazing at the TV and left the room just after making this reaction. In addition, in one case, a “pointing” action was observed with a negative utterance to the contents. So hand motions can be either “positive” or “negative” reactions. Therefore, estimating user concentration levels is difficult from only the existence of hand motions. The entire summary of observations is as follows. – Effective content switching attracts user attention to the displayed contents longer and can be realized by measuring user attention levels with face directions. – Although user utterances contain information regarding user interest in the displayed contents, estimating the user interest level is hard from only the presence of utterances. – Although frequency depends on individuals, such reactions as hand beckoning, clapping, and singing were observed while subjects watched the video contents. The above results suggest that interest in user contents can be estimated from the direction of user faces, the content of utterances, and hand motions. We previously developed a content switching system based on user face orientations and confirmed its effectiveness through experiments. However, in some cases, estimating user interest levels is difficult from face orientations, such as music contents that can be enjoyed without being constantly watched. Therefore, the following sections examine a method to estimate user interest levels from another cue, i.e., synchronous body motions to the displayed music signals.
296
4
M. Ohara et al.
Attention Monitoring Using Image and Audio Analysis
From the results of the previous section, we found that laughing, singing, and clapping can be observed as “positive” reactions. Therefore, we developed a system to estimate user attention to the displayed music contents based on synchronization between user behavior and displayed music contents. Here, the “tempo” of user behavior is extracted with image and auditorial analysis. Assuming a known temporal structure for music signals like MIDI signals, we can directly compare user behavior with such structures. However, since temporal structures are generally unknown for normal TV/radio programs, CDs, DVDs, etc., we first extract the “beat” structures from music signals through frequency analysis and then compare them with viewer body motions. 4.1
System Configurations
Figure 1 shows a diagram of the proposed system. The music signal output from a speaker system is simultaneously taken to the system with an A/D converter, where it is then processed with FFT to extract the frequency elements and sent to the beat extraction process. During beat extraction, a voting process determines the dominant (fundamental) tempo value of the input music signal. User cyclic behavior is detected as follows. User motions are observed through both image and audio analysis. Image analysis detects user movements as numbers of fluctuated pixels in inter-frame subtractions, and user “beat” actions can be extracted as cyclic changes. Audio analysis extracts sound related with user motions (e.g., clapping) from input audio signals with a subtraction process of original music signals through an LMS algorithm. Then the “beat” of the user motion is determined from the extracted sound signals. Finally, we compare the tempos of user behavior and music signals and estimate the attention (concentration) level of users to the music contents by the degree of synchronicity. Sections below explain the details of the above processes.
♪ Music
Motion
Music Data
FFT
Beat Extraction
FFT
Beat Extraction
Rhythm Estimation
Speaker
Microphone
Clapping Noise Extraction
Frame Subtraction
Beat Extraction
Camera
Fig. 1. System diagram
Rhythm Estimation
Attention Monitoring for Music Contents
297
Frequency 9 6 0-250Hz
3 250-500Hz
0
500Hz-1KHz
1-2KHz
25
40
2-4KHz 0
100
200
300
400
500
[frame]
Fig. 2. Input music signal and results of frequency analysis
4.2
60
30
35
20 15 PHASE 80 100 10 5 120 BPM 140 160 0
Fig. 3. Voting results for tempo estimation (Music No. 3)
Beat Detection
First, we describe how to detect the beat information in a music signal. The proposal method employs a frequency-based beat extraction algorithm [6,7]. Here, the input signal is converted into frequency domain with FFT analysis every 1/32 sec. It separates the input signal into the following five frequency bands: 0-250 Hz, 250-500 Hz, 500-1 kHz, 1-2 kHz, and 2-4 kHz. We detect the envelope of each frequency band and extract the “beat” as common rising points in multiple frequency bands. Figure 2 shows the example results for a pop song. 4.3
Tempo Estimation
In this section, we explain our tempo estimation method. Current implementation assumes that the music tempo stays in a range of 60 to 145 bpm (the range most pop music falls), and the tempo and the phase of the “beat” are determined using a voting algorithm. Figure 3 shows the result for a country song of 81 bpm. 4.4
Detection of Body Behavior by Image
For body motion detection, a silhouette-based method [8,9], a model-based method [10], etc. have previously been proposed. However, since it is difficult to specifically target a body part for body motion extraction in advance and only the relatively lower frequency one-dimensional signal must be extracted in “beat” extraction, we employed simple motion analysis based on inter-frame extraction. Here, the magnitude of motion is measured as the number of moving pixels Nt . “Beat action” in which the user takes hand and feet action can be detected as the times when number Nt of the pixels becomes minimum: dNt−1 t 1 ( dt < 0, dN dt 0 and Nt 0) beat = (1) 0 (otherwise).
298
M. Ohara et al.
(1) Original Music Signal 200 100 0 0
100
200
300
400
500
[frame]
-100 -200
(2) Sound data captured with microphone 200 100 0 -100 -200 0
100
200
300
400
500
[frame]
(3) Extracted user origin sound 200 100 0 -100 -200 60000 65000 70000 75000 80000 85000 90000 95000 100000 0
100
200
300
400
500
frame
[frame]
Fig. 4. Examples of “beat” action extractions (top: hand, middle: foot, bottom: head)
Fig. 5. Separation of user origin sound through LMS algorithm
Figure 4 shows an example of an extracted beat action. Here, the person was expressing the beat by using hand (left), foot (middle), and nodding (right). 4.5
Extraction of User Origin Sound
Audio signal observed contains the displayed music signal output from the speaker in addition to the sound related with user actions such as clapping. Here, we separate these signals by an LMS algorithm [11] that is commonly used for echo canceling in TV conferencing systems. Generally, there are multiple reflection paths for sound output from a speaker before it reaches a microphone. The LMS algorithm models this echo path as an linear system and determines its impulse response with a steepest descent method as follows: hN (k + 1) = hN (k) + αe(k)x(k − N ).
(2)
Here, x(k) is observation at time k. hN (k) is a N -th coefficient at time k. e(k) is estimation error for an observation value at time k. α is a step gain. We estimate the music signal level in an input audio signal through the estimated model and extract sound related to user motions by a subtraction process. Figure 5 shows the results of the extraction process. Figure 5 (top) shows the original music signal. Figure 5 (middle) shows the observed signal with a microphone. Figure 5 (bottom) shows the subtraction results based on an LMS algorithm. As can be seen, the sound signal related with user motion is properly extracted by the above process. We estimate the “tempo” of user motions with a similar process in Section 4.2.
Attention Monitoring for Music Contents
4.6
299
Synchronous Judgment
Finally, we judge the synchronicity between the estimated tempos of the music signal and user “beats.” First, we estimate the “tempo” of user motions by applying the voting algorithm described in Section 4.3 to the “beat” information (image and sound). Then, we compare the estimation results with music “tempo,” and if the difference is less than 10 bpm, we determine that both signals are synchronized. As mentioned in the next section, we estimate the song to which a user is listening as one that has a tempo difference with user motions less than 10 bpm and the difference is minimum in the multiple candidates (if any). ⎧ |bmusic,i − b| < threshold and ⎨ i |bmusic,i − b| < |bmusic,j − b|(j = i) result = (3) ⎩ −1 otherwise. In addition, there is a case where a user’s “beat” action happened once every two beats of the music signal. Therefore, we may have to consider two signals synchronized if the tempo of one signal is an integral multiple of another.
5
Experiment Results
First, the example of extracting user “beat” motion was synchronized to displayed music signals. Figure 6 shows the voting results of tempo extraction for clapping while listening to the music of Figure 3 (Figure 6 left) and random clapping (Figure 6 right). As can be seen, the “beat” action case has a clear peak signal, although the non-“beat” action case does not. This suggests that the presence of synchronized behavior can be distinguished from others. Figure 7 shows the difference of the number of detected synchronous “beats” in “beat” and non-“beat” actions for a longer sequence. Since non-“beat” action does not have a cyclic property and is not synchronized to the music signal, “beat” and non-“beat” motions can be distinguished with the proposed method. Next, we performed the following experiments to confirm the possibility of behavior-based estimation of user attention to music signals. We prepared five Frequency
Frequency 9
9
6
6
3
3 0
0 30
40
60
80 100 10 5 120 140 BPM 160 0
Beat action
35
25 20 15 PHASE
40
60
80 100 10 5 120 140 BPM 160 0
non-beat action
Fig. 6. Results of beat and non-beat actions (body motion)
35 30 25 20 15 PHASE
300
M. Ohara et al.
Number of Sync. Beats Per Minute
75 70 65 60 55 50 45 40 35 30
Beat Action
Random
Fig. 7. Discrimination of beat/non-beat actions Table 2. Genre and extracted tempo for sample music Music No. Genre Tempo extraction results [bpm] 1 country 113 fusion 74 2 country 81 3 Japanese pop 130 4 popular 113 5
music signals. Table 2 shows their genres and estimated tempos that are distributed from 74 to 130 bpm. Music 1 and 5 have similar tempos. We employed five subjects (healthy adults) who listened and clapped to the displayed music. The tempos of their body motions were estimated by the proposed method. Table 3 shows the tempo of subject 1’s motion estimated with both image and sound cues. We obtained similar estimation results for both cues. Table 4 shows the estimation results of all subjects (sound cues). Here, denotes cases where the system correctly estimated the music from user behavior. × denotes cases where estimation failed, and denotes others. In the following, we discuss the “” cases. Music 1 has two instruments related to “beat.” The extracted tempo changes along with the instrument’s part to which the user clapped. As in the above case, the extracted tempo may change depending on the part (instrument) to which the user is attracted. The result for music 2 resembles that of subject 3. Therefore, we may have to consider two signals synchronized if the tempo of one signal is a integral multiple of another. On the other hand, since music 1 and 5 have similar tempos, deciding what song the user is listening to is difficult. However, since the main purpose of the proposed method is to estimate attention levels to currently displayed music contents, synchronization with other music does not matter. Consequently, we accurately estimated synchronous properties between user “beat” actions and displayed music in more than 90% of all cases. This suggests the effectiveness of the proposed method for estimating user attention levels to displayed music contents.
Attention Monitoring for Music Contents
301
Table 3. Tempo extraction results (subject 1, image, and sound cues) Music No. 1 2 3 4 5 tempo extracted with image cues 61 77 88 130 130 tempo extracted with sound cues 61 77 81 130 113 Table 4. Tempo extraction for all subjects (sound cues) Music No. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5
1 61 67 60 61 61
2 77 74 145 77 74
3 81 130 × 81 81 81
4 130 64 × 135 135 135
5 113 61 61 113 113
Furthermore, since two false cases belong to the same subject, perhaps the temporal accuracy of this subject’s clapping was less than others. We must investigate this point in the future.
6
Summary
In this paper, we examined a method to estimate user interest in music contents from the synchronicity of the presented music and human behavior. The results are presented while switching two or more video contents for memory-impaired people and observing user attention behavior. User reactions differ based on presented contents: clapping, speaking, and sideways turning of the face. We examined an effective content switch technique for reflecting user intention. Future work includes improvements of the estimation accuracy of user interests by integrating other user behaviors with beat actions and developing a contents control system to switch displayed contents based on user attention levels to entertain users longer. This research was supported in part by the National Institute of Information and Communications Technology.
References 1. Boiman, O., Irani, M.: Detecting irregularities in images and in video. In: Proceedings of International Conference on Computer Vision, pp. 462–469 (2005) 2. Osawa, T., Wu, X., Wakabayashi, K., Yasuno, T.: Human tracking by particle filtering using full 3d model of both target and environment. In: Porceedings of International Conference on Pattern Recognition, pp. 25–28 (2006) 3. Kuwahara, N., Kuwabara, K., Utsumi, A., Yasuda, K., Tetsutani, N.: Networked interaction therapy: Relieving stress in memory-impaired people and their family members. In: Proc. of IEEE Engineering in Medicine and Biology Society, IEEE Computer Society Press, Los Alamitos (2004)
302
M. Ohara et al.
4. Utsumi, A., Kawato, S., Abe, S.: Attention monitoring based on temporal signalbehavior structures. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, pp. 100–109. Springer, Heidelberg (2005) 5. Kuwahara, N., Kuwabara, K., Tetsutani, N., Yasuda, K.: Reminiscence video helping at-home caregivers of people with dementia. In: HOIT 2005, pp. 145–154 (2005) 6. Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. Journal of Accoustical Society of America 103(1), 588–601 (1998) 7. Goto, M.: An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research 30(2), 159–171 (2001) 8. Kanade, T., Rander, P., Narayanan, P.J.: Virtualized reality: Constructing virtual worlds from real scenes. IEEE MultiMedia 4(1), 34–47 (1997) 9. Waggg, D.K., Nixon, M.S.: On automated model-based extraction and analysis of gait. In: Proc. of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 11–16. IEEE Computer Society Press, Los Alamitos (2004) 10. Lim, J., Kriegman, D.: Tracking humans using prior and learned representations of shape and appearance. In: Proc. of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 869–874. IEEE Computer Society Press, Los Alamitos (2004) 11. Widrow, B., Hoff, M.E.: Adaptive switching circuit, 96–104 (1960)
View Planning for Cityscape Archiving and Visualization Jiang Yu Zheng and Xiaolong Wang Department of Computer Science Indiana University Purdue University Indianapolis (IUPUI), USA
Abstract. This work explores full registration of scenes in a large area purely based images for city indexing and visualization. Ground-based images including route panoramas, scene tunnels, panoramic views, and spherical views are acquired in the area and are associated with geospatial information. In this paper, we plan distributed locations and paths in the urban area based on the visibility, image properties, image coverage, and scene importance for image acquisition. The criterion is to use a small number of images to cover as large scenes as possible. LIDAR data are used in this view evaluation and real data are acquired accordingly. The extended images realize a compact and complete visual data archiving, which will enhance the perception of spatial relations of scenes. Keywords: Multimedia, panoramas, visibility, urban space, LIDAR data, model based vision, heritage.
1 Introduction This work explores a framework to represent spaces with images. With densely taken images in a space, one can establish a complete database for scene access. This image registration task is required at excavation sites, museums, markets, real estate property, heritage districts, and large urban areas. Moreover, a full archive of scenes will allow navigation, location finding, guidance, crisis preparing, etc. Employing 3D models, either manually constructed or automatically extracted from LIDAR (Light Detection and Ranging) data, have been a long-standing trend in showcasing an urban area [6]. The process is usually laborious particularly in the model construction and texture mapping. To obtain higher resolutions and more proper viewing angles than overhead images (satellite or aerial) and map, we emphasize ground-based-views in this work. Although video clips can capture walkthrough views continuously [12], they have not been extended to large areas because of the vast data size. Mosaicing translating images is also difficult because of the disparity inconsistencies from the depth variations [8]. A multi-perspective image has been composed only for scenes close to a planar surface [1]. On the other hand, the slit scanning approaches including push-bloom [14] and X-slit [9][16] have been implemented to create route panorama in order to avoid inter-frame matching, depth estimation, morphing and interpolation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 303–313, 2007. © Springer-Verlag Berlin Heidelberg 2007
304
J.Y. Zheng and X. Wang
Our work in this paper aims at a ground-based image database representing largescale spaces. The goals to achieve are: (1) Less images for rich spatial information, and (2) Pervasive view archive of an environment. Taking images at every location in a large area can be very inefficient due to data redundancy, and sometimes impossible. We plan viewpoints to cover as large spaces as possible to achieve data reduction. The view significance is evaluated according to the scene visibility and distribution. The sparse viewpoints and paths are then planned for acquiring cityscapes using route panoramas [14], scene tunnels [15][20], panoramic view, spherical views [4][5][11] and traditional digital images. As these images are compact in data size, as well as complete and continuous in the scene coverage, we can enlarge the area to model significantly. In the following sections, we propose a view significance measure using LIDAR data in Section 2, for various types of spaces. Section 3 describes view selection approaches based on the viewpoint evaluations, followed with a real process of view acquisition to form an image based virtual environment.
2 Ground-Based Views and View Significance 2.1 Views Categorization According to Virtual Actions In archiving a space at a city level, the image storage and transmission demand a scene selection scheme to cover more scenes with less data. Taking images at every viewpoint in a space is not necessary. Extended from normal digital images, we use local panoramas, route panoramas, and object panoramas to cover sites, paths and architectures, respectively. In a real environment, a visitor takes different actions in spaces such as looking around, walking, and examining things from various directions. It is preferable that our images are categorized according to these actions. In addition, a map view (or satellite image) can provide global locations. Figure 1 shows a diagram to register an area with these images. All these views can be classified more formally as follows: A local panorama V(0) is taken from a single viewpoint outward with an FOV possibly up to 360 degree in orientations. Examples include wide/fish-eye images, panoramic/spherical images, which employ perspective, central, cylindrical, and spherical projections respectively. A route panorama V(1) is a long image along a straight or mildly curved path for representing positions. Examples include slit-scanning route panoramas and scene tunnels, which employ parallel-perspective and parallel-central projections. Multiperspective projection (mosaicing multiple images on a path) can also be used if scenes are close to an equal-depth [1]. An object panorama V(2) is a set of inward images taken around an object or building (also called object movie and similar to aspect views). They are useful in showing all building facades or full appearances of an object. Now, it is necessary to investigate where these views should be located first in a large urban area if we cannot take images everywhere, i.e., how to cover large areas pervasively with a limited number of ground-based images. We use Airborne LIDAR data as an elevation map for view prediction. LIDAR is an optical remote sensing
View Planning for Cityscape Archiving and Visualization
305
(a)
(b)
(c) Fig. 1. A visualization model of urban area. Real scenes are projected towards a grid of paths and positions. (a) Image framework containing V(0), V(1), and V(2). (b) An object panorama V(2), (c) A section of route panorama V(1).
technology that measures properties of scattered light to find range and intensity of a remote target. The resolution of airborne LIDAR data with height and texture measured by laser scanning can be as fine as half meter per dot. 2.2 Viewpoint Evaluation Based on Visibility To take images in a large space, we have to plan the camera positions. We propose a view significance estimation based on how much of 3D surfaces an image can cover. This view significance can be measured in an urban area using LIDAR data or be examined on typical structure layouts. Our main interest is on walkthrough views taken roughly at eye-level; compared to an overhead image, ground-based-images capture more vertical surfaces and details. Intuitively, a view with large portions as horizon is not as visually significant as a view full of objects in conveying spatial information. Similarly, a view with a large sight from an overlook is more significant than a view from a narrow valley of buildings in telling global locations. Figure 2 illustrates an idea to compute the view significance from a view shed that is a view-covered region on 3D surfaces. Let P(X,Y,Z) denote a position in the space and a ray from P is defined by n(φ,ϕ), in orientation φ∈[0, 2π] and azimuth angle ϕ∈[-π/2, π/2]. If the ray hits a surface point at distance D(ϕ, φ), a sign function λ(φ, ϕ) takes value 1 and otherwise 0. The visible
306
J.Y. Zheng and X. Wang
surfaces from P thus form a view shed. We define view significance σ(P) to be the area of the view shed (∑λ), which is calculated by σ (P) =
D (φ , ϕ )
∫∫ w λ (φ , ϕ ) D (φ , ϕ ) + D
φ ,ϕ
d φd ϕ
(1)
0
where w is a weight of importance assigned on each surface, and D0 is a large constant (e.g., 100m). The denominator of Eq. 1 counts for the image quality degradation on distant scenes due to atmospheric haze. It avoids a close-to-infinity scene to be integrated largely into σ(P). The weight w, assigned building-wise in a space, takes a uniform value in this paper to show the influence from the geometric layout only, unless some important façades or landmarks need emphasis particularly.
View shed View volume P
Fig. 2. Computing view significances from the view shed at a viewpoint for V(0)
We calculate a continuous distribution of σ(P) using an elevation map H(X,Z). For an urban area, LIDAR data are reduced in resolution to a hole-less map first. At each small grid region (e.g., 1∼5m2) in the LIDAR data, non-zero elevation points are median-filtered to yield an integer value in discrete metrics H(X,Z) (Fig. 4a). Second, all reachable points P(X,Y,Z) at eye-level are marked (if Y>0). Third, we compute lines of sight in all discrete orientations originating from every viewpoint P(X,Y,Z). Each line is stretched out until it hits an obstacle, where the front tip of the line, Pl(Xl,Yl,Zl), satisfies Yl
Fig. 3. Typical spatial structures and distributions of the view significances with spherical views. A crossing, square, square with roads, and two connected yards are enclosed with walls.
View Planning for Cityscape Archiving and Visualization
307
Figure 4 shows a city area with buildings of different heights. Three spherical views in Fig. 4a are evaluated in depth (Fig. 4b). The σ(P) of entire reachable space are calculated with D0=100m in Fig. 4c, d. According to σ(P) evaluated for spherical views [11], the positions marked in red and yellow at a round square and an open crossing in Fig. 4a are more significant than the third position marked in green in a parking lot. This can be confirmed in Fig. 4b where a large sky appears in Fig. 4(b.3) and the high buildings at distance degrade in the significance estimation. For cylindrical panoramas [4] (limited FOV at ϕ∈[-45°,45°]), the distribution of σ(P) is shown Fig. 4d. The significance value is lower at positions close to high rises than at positions farther away now because the close positions miss high surfaces on buildings.
(a)
(b.1)
(b.2)
(c)
(b.3)
(d)
Fig. 4. View significance evaluations at all positions in an area. (a) LIDAR elevation map with the intensity representing height. (b.1)(b.2)(b.3) 360°×180° spherical depths at three positions marked in (a). FOVs of cylindrical projection are in dashed frames. (c) Reachable positions and their view significances in gray levels. (d) View significance evaluated with cylindrical panoramas.
308
J.Y. Zheng and X. Wang
(a)
(b)
(c) Fig. 5. Depth map from LIDAR data and corresponding views. (a) The mechanism to scan a scene tunnel. (b) Depths along a camera path calculated from LIDAR data (intensity inversely proportional to depth). Vertical axis indicates azimuth angle of rays and the horizontal axis indicates the route distance. (c) Blocks of scene tunnel taken along the real street.
2.3 View Significance Evaluation for Paths and Buildings The view significance can also be defined for a route panorama and object panorama similarly. Figure 2b depicts the geometry in getting a scene tunnel when a camera facing sideways on a vehicle moves along a street [15]. Denote a ray by n(ϕ) in the Plane of Scanning (PoS) at position l on the path (Fig. 5a). A surface point at distance D(l,ϕ) is projected to the scene tunnel. The view significance is measured at the position is defined as σ ( P(l )) = ∫ wλ (l , ϕ ) ϕ
D (l , ϕ ) dϕ D(l , ϕ ) + D0
(2)
where ϕ∈[-π/2, π/2] is the azimuth angle of the ray. The significance of an entire street is σ (L) =
∫ σ ( P ( l )) dl L
L
(3)
where L is the street length. Figure 5 shows the depth map D(l,ϕ) visible from a street. It is predicted using LIDAR data and the half-side scene tunnel along the real street is displayed for comparison as well. The σ(L) is high in this section. On the contrary, we can predict that very open streets such as highways at suburb may be insignificant for their monotonic scenes and large sky area. An architecture can displayed with its object panoramas V(2), i.e., side views adjacent to each other. Here, the object panorama does not strictly fall into the definition of aspect views that are divided according to object surfaces. In evaluating significant viewpoints for buildings or landmarks, visibility is more meaningful than the shooting distance because distance can be compensated by using different lenses. We define view significance σ(P,B) at a surrounding point P for watching B by
View Planning for Cityscape Archiving and Visualization σ (P, B) =
∫∫ w λ
φ ,ϕ
2
(φ , ϕ )
D (φ , ϕ ) d φd ϕ D (φ , ϕ ) + D 0
309
(4)
where λ2(φ,ϕ) is 1 if a ray reaches building B from P and 0 otherwise. As an example, we use a convex block, a building complex, and scattered towers in Fig. 6a to calculate their view significance distributions in Fig. 6b,c. The block in Fig. 6a can be captured with a normal lens (Fig 6b.1), while the scattered architecture group is suitable to be captured in a wide image from a point at the central area (Fig. 6b.3). In general, the view significance is continuous in the reachable area of H(X,Z) and decreases as the viewing distance gets farther away. The significance value also tends to be higher in an orientation from which multiple facades become visible, and drops to zero when the target is occluded completely. Moreover, the smaller the vertical field of view of a camera, the farther the significant viewpoints tend to be, as we compare individual pairs in Fig. 6b and 6c. In real situation, many buildings are aligned in street blocks, which is simpler to be captured in V(1) images.
(a)
(b.1)
(c.1)
(b.2)
(c.2)
(b.3)
(c.3)
Fig. 6. View significances of buildings displayed in levels. (a) Elevation map of three groups of architecture. (b.1)(b.2)(b.3) distributions of view significances in (a) when a cylindrical panorama (vertical FOV is 90°) is used. (c.1)(c.2)(c.3) as references, the view significances are counted with spherical views for each building group again.
3 View Selections and Pervasive Image Acquisition 3.1 Street View Acquisition from a Moving Vehicle We start with the scanning of route panoramas V(1) because imaging from a moving vehicle is the most efficient way to obtain cityscapes. Route panoramas and scene tunnels are typically planned along streets with rich scenes (σ(L) is high). The route panoramas are scanned continuously with a pixel line in the video frame. This slit scanning method is much more efficient in capturing translating scenes with depth changes than most mosaicing methods that need to merge discrete images at consecutive positions, because it avoids image correspondence, depth estimation, and image integration.
310
J.Y. Zheng and X. Wang
According to the simulated depth map along a street, we select a lens up to 180 degree FOV for a street. The lens scope may exclude tops of some high rises if other parts of the street are low architectures in average. The route panoramas also have some deformations. Because of the employed parallel-perspective projection in composing a long image, the aspect ratios of objects at different depths from the path are not constant in the route panorama. The aspect ratio can be adjusted well at one depth but is hard to satisfy other depths. At an adjusted depth, scenes are exposed to the route panorama sharply, while beyond the depth the stationary blur [10] appears along horizontal direction in the route panorama. Therefore, we select a sampling rate and vehicle speed after the camera lens is determined from the height coverage of side scenes. If we denote the curvature of the path by κ, the depth of scene by z, the vehicle speed by v, the angle of the plane of sight with respect to the motion vector by α, and the focal length by f, the blurring rate from the original image (It/Ix) can be calculated as ⎛1 Il κ ⎞ = fv⎜⎜ − 2 ⎟⎟ Ix z sin α⎠ ⎝
(5)
where It and Ix are contrast in the route panorama and video image respectively. The detailed proof is in [21]. Setting the ratio to be 1, we can inversely obtain v from known κ, α, f and average z. This will reduce the distortion of aspect ratio as well as the stationary blur on major scenes. On the other hand, we can also keep the vehicle velocity as constant as possible and normalize the length of route panorama/scene tunnel according to GPS output or satellite images. 3.2 Placing Local Panoramas in Urban Area After route panoramas V(1) cover major streets, we place a number of local panoramas V(0) in the urban area. There are various strategies of viewpoint selection. For representing larger 3D surfaces, we plan V(0) at peaks of σ(P) distribution. A viewpoint close to a selected peak may share a large portion of scenes with the peak. Therefore, we avoid local maxima of σ(P) in the same hill with a selected peak. As an automatic procedure, we decrease a threshold level gradually over the σ(P) distribution in Fig. 4d, in order to locate peaks and island regions, E(level)={Pe |σ(Pe)≥level}. Assuming two viewpoints should not be closer than a predefined distance r, we select an emerging local maximum of σ(P) as a viewpoint if it is not closer to any island region than r. We locate those peaks in σ(P) for panoramic view in order according to the following algorithm. Set E to be an empty set For level decreasing from max(σ) to min(σ) For every point P∈H(X,Z) satisfying σ(P)≥level and P∉E If P is a peak and is father away from E than r then select P as a viewpoint; // Mark P as a point in E Add point P to E;
This algorithm selects peaks of σ(P) on major hills and ignores local maxima on the examined hills of the view significance distribution. Figure 7 gives an output of the algorithm for placing local panoramas. After selecting viewpoints, images are
View Planning for Cityscape Archiving and Visualization
311
Fig. 7. Planned positions in white spots are plotted for local panoramas in the elevation map of the area
taken with digital cameras [17] and spherical views are taken with a fish-eye lens if necessary. In real cases, the camera positions have to be deviated from the planned ones due to busy traffics. Also, V(0) images here are more suitable to represent spatial layouts rather than symbolic scenes with cultural and sightseeing values. Additional V(2) images can always be taken to emphasize meaningful scenes. 3.3 Locating Images Around Objects Now, we locate multiple discrete images around buildings of interest to generate an image group namely object panorama (in QTVR). After the calculation of σ(P,B) for a building or building block, picking viewpoints manually at positions with high σ(P,B) values is feasible. Alternatively, we can also calculate viewpoints automatically. First, the center of heights, P0(X0, Z0), of a building is obtained from H(X,Z) as (X 0 , Z 0 ) = ∑ ( X , Z ) H ( X , Z ) P∈ B ( 2 )
∑ H (X ,Z )
P∈ B ( 2 )
(6)
In each orientation φ around P0, we find a distance d(φ) where the viewpoint has the highest significance, i.e., d (φ ) = arg max σ (P , d
B)
(7)
which results in a closed curve surrounding B. We then select several orientations φ1, φ2, φ3, ... on the curve which local maxima of σ(P,B) to imaging B inward. The images can be dense to share partial views (or facades) so that the orientations are more conceivable through images. Various lenses can be used for these images as long as scenes are located in the frame properly. Beside above three types of images, discrete images can always be taken at spots of interest or partial scenes on the buildings to highlight details. We did experiments at an urban area of 1.6×1 km2, which includes 50 panoramic images, 34 route panoramas of both sides and 44 buildings. The image has been normalized to a fixed height for display and the lengths are proportional to their real lengths [22].
312
J.Y. Zheng and X. Wang
4 Conclusion This work creates a framework to pervasively record scenes in a large-scale area for image archive and visualization. The compact nature of the employed images has significantly enlarged the scale of cityscape modeling. We proposed an evaluation of viewpoints, namely view significance, according to the visibility and effective image coverage of cityscapes. Based on the significance distribution estimated from LIDAR data, we precisely located different types of images so that they can include as many scenes as possible but avoid data redundancy. The visual data can be displayed in a web and loaded further to portable navigation devices for on-site urban area guidance.
References [1] Agarwala, A., et al.: Photographing long scenes with multi-viewpoint panoramas. ACM Transactions on Graphics 25(3), 853–861 (2006) [2] Aliaga, D.G., Carlbom, I.: Plenoptic Stitching: A scalable method for reconstructing 3D interactive walkthroughs. In: SIGGRAPH 2001, pp. 443–450 (2001) [3] Aliaga, D.G., Funkhouser, T., Yanovsky, D., Carlbom, I.: Sea of Images. In: IEEE Conf. Visualization, pp. 331–338 (2002) [4] Chen, S.E., Williams, L., Quicktime, V.R.: An image-based approach to virtual environment navigation. In: SIGGRAPH 1995, pp. 29–38 (1995) [5] Coorg, S., Master, N., Teller, S.: Acquisition of a large pose-mosaic dataset. In: IEEE CVPR 1998, pp. 23–25 (1998) [6] Frueh, C., Zakhor, A.: Constructing 3D city models by merging ground-based and airborne views. In: IEEE CVPR 2003, pp. 562–569 (2003) [7] McMillan, L., Bishop, G.: Plenoptic modeling: an image based rendering system. In: ACM SIGGRAPH 1995 (1995) [8] Peleg, S., Rousso, B., Rav-Acha, A., Zomet, A.: Mosaicing on adaptive manifolds. IEEE Trans. PAMI 22(10), 1144–1154 (2000) [9] Roman, A., Garg, G., Levoy, M.: Interactive design of multi-perspective images for visualizing urban landscapes. In: IEEE Conf. Visualization 2004, pp. 537–544 (2004) [10] Shi, M., Zheng, J.Y.: A slit scanning depth of route panorama from stationary blur. In: IEEE CVPR (2005) [11] Szeliski, R., Shum, H.-Y.: Creating full view panoramic image mosaics and texturemapped models. In: ACM SIGGRAPH 1997, pp. 251–258 (1997) [12] Uyttendaele, M., et al.: Image-based interactive exploration of real-world environments. IEEE Computer Graphics and Application 24(3) (2004) [13] Zhao, H., Shibasaki, R.: A Vehicle-borne urban 3D acquisition system using single-row laser range scanners. IEEE Trans. on SMC, B 33(4), 658–666 (2003) [14] Zheng, J.Y.: Digital Route Panorama. IEEE Multimedia 10(3), 57–68 (2003) [15] Zheng, J.Y., Zhou, Y., Mili, P.: Scanning Scene Tunnel for city traversing. IEEE Trans. Visualization and Computer Graphics 12(2), 155–167 (2006) [16] Zomet, A., Feldman, D., Peleg, S., Weinshall, D.: Mosaicing New Views: The crossedslits projection. IEEE Trans. on PAMI, 741–754 (2003) [17] Li, S.: Full-View Spherical Image Camera. ICPR (4), 386–390 (2006) [18] Li, S., Nakano, M., Chiba, N.: Acquisition of spherical image by fish-eye conversion lens. IEEE Virtual Reality 235–236 (2004)
View Planning for Cityscape Archiving and Visualization
313
[19] Zheng, J.Y., Tsuji, S.: Panoramic Representation for route recognition by a mobile robot. IJCV 9(1), 55–76 (1992) [20] Zheng, J.Y., Li, S.: Employing a fish-eye camera in scanning scene tunnel. In: 7th ACCV, vol. 1, pp. 509–518 (2006) [21] Zheng, J.Y., Shi, M.: Depth from stationary blur with adaptive filtering. In: 8 th ACCV (2007) [22] http://www.cs.iupui.edu/ jzheng/ACCV07
Synthesis of Exaggerative Caricature with Inter and Intra Correlations Chien-Chung Tseng and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {ed,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. We developed a novel system consisting of two modules, statisticsbased synthesis and non-photorealistic rendering (NPR), to synthesize caricatures of exaggerated facial features and other particular characteristics, such as beards or nevus. The statistics-based synthesis module can exaggerate shapes and positions of facial features based on non-linear exaggerative rates determined automatically. Instead of comparing only the inter relationship between features of different subjects at the existing methods, our synthesis module applies both inter and intra (i.e. comparisons between facial features of the same subject) relationships to make the synthesized exaggerative shape more contrastive. Subsequently, the NPR module generates a line-drawing sketch of original face, and then the sketch is warped to an exaggerative style with synthesized shape points. The experimental results demonstrate that this system can automatically, and effectively, exaggerate facial features, thereby generating corresponding facial caricatures. Keywords: Exaggerative rate, Exaggerative caricature synthesis, Eigenspace, Non-photorealistic rendering (NPR).
1 Introduction Caricatures often appear in newspapers, comic, and even at popular tourist spots where some artists draw for sightseers. Generally, caricature is a kind of exaggerative representation. It exhibits the funny and extraordinary characteristics of a person, and it also can be a human-like agent because it contains lots of recognizable facial features. The basic definition of facial caricature is that it is the exaggeration for all the facial features which are found by comparing the impression features of the subject with the average face. The first caricature generation system is made by Brennan [4], who used the interactive algorithm to create the sketch with exaggeration. Following works base on two approaches: with or without training process. The works with training process usually build some standards which are used to compare with the testing data to find the difference between their features. Koshimize et al [11] proposed a template-based approach to create caricature with exaggerative rates. However, the method is linebased drawing; the result will be unrecognizable when the exaggerative rates are too Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 314–323, 2007. © Springer-Verlag Berlin Heidelberg 2007
Synthesis of Exaggerative Caricature with Inter and Intra Correlations
315
large. The works [12], [13], [17] developed example-based approaches by using partial least squares or neural networks to learn the drawing style of an artist, but they can not exhibit some particular characteristics like beards, nevus and etc. without training them in advance. Chiang et al [7] analyzed facial features and warped the color caricature created by artist to the exaggerative style with analyzed result. However, the representation of result is limited by the prototype drawn by an artist. Xu [18] exaggerated both the shapes and the positions of facial features by eigenspace, but the parameters used to exaggerate facial features are determined empirically. Except the approach with training process, some works for caricature creation has no training process. Gooch et al [9] proposed an approach to create an exaggerative black-and-white drawing by using Blommaert and Martens’ model and manual exaggeration [3], and Akleman [1] created a caricature by a forward triangle morphing. Those methods involve not only exaggerated facial features but also other particular characteristics. However, they also need a lot of manual works. Besides, there are other works [5], [6], [10], [19] creating good-quality cartoon face without exaggeration, and our work can be views as the extension of their results. In order to solve existing problems discussed above, we developed a novel system based on the advantages of processes with training and without training by implementing statistics-based synthesis (with training process) and non-photorealistic rendering (without training process). The exaggerative facial feature caricatures generated by our method contain exaggerative facial features and other particular characteristics which are always eliminated in order to simplify the training process. In the statistics-based synthesis module, we find the major principle components of features by using the principal component analysis (PCA) process, and expand them to emphasize personal specialties. In addition, the exaggeration is based on two relationships proposed by Redman [16]: one is the relationship between different subjects’ features, and we define this relationship as inter process; the other is the relationship between facial features of the same subject ignored by previous works, and we define it as intra process.
2 Statistics-Based Synthesis Module: Training Process In the training process of the statistics-based synthesis module, we compiled 69 male and 55 female images from the AR face database [14], 90 images are for training and 34 images are for testing. All the facial images are taken at the same distance from camera, and each training facial image is normalized as a canonical image (512×512 pixels) by affine transformation including only translation and rotation factors by two inner corner points of the eyes. The reason of eliminating scaling factor is that all the position of eyes of training images will be the same if applying scaling factor, thus we can’t exaggerate the position of eyes by comparing the difference from the position of average eyes. For each canonical image, the feature shape points of eyebrows (20 points), eyes (16 points), nose (12 points), mouth (20 points), and facial contour (19 points) are selected manually and denoted by Fi where i = 1 to 7. Also, we choose 7 control points corresponding to left eyebrow, right eyebrow, left eye, right eye, nose, mouth,
316
C.-C. Tseng and J.-J.J. Lien
and face contour respectively and denoted by P where P = (p1x, p1y, p2x … p7x, p7y). Each control point represents the position of each feature shape. Furthermore, all the feature shape and position vectors are separated into x and y coordinates denoted by Fix, Fiy, Px, Py respectively to avoid mutual effects of x and y coordinates of featureshapes or positions. Based on all feature shape vectors, Fix and Fiy, we can calculate their corresponding average vectors, MFix and MFiy, and by applying the PCA, the corresponding eigenspaces, UFix and UFiy, can be obtained. Subsequently, the average position point of each feature is calculated in stead of generating eigenspaces, because there is only one position point for one facial feature. The information is not enough to build position eigenspaces. The framework of training process is shown in Fig. 1.
Geometric normalization
I
S
Feature shape F1
F3
F2
F at x and y coordinates. F1x Eigenspaces at x and y UF 1x coordinates. I: Canonical image.
FF1y1y F
1y
UF1y
F4
F5
…
F1y
…
UF7x
F7x
F6
F7
F1y
Feature position
F7y
P at x and y Px coordinates Px
PPxy
UF7y
Average P x, P y
MPy
MPx
S: Feature shape points.
Fig. 1. The training framework of statistics-based synthesis module
3 Statistics-Based Synthesis Module: Testing Process To start with the testing process of statistics-based synthesis module, as shown in Fig. 2, we extract the facial feature points by applying an active appearance model (AAM) [8]. If the extracting result is not good, we label the feature points manually. Then, as the training process, we can have TFi where i = 1 to 7, TP = (tp1x, tp1y, tp2x … tp7x, tp7y), TFix, TFiy, TPx, TPy. Subsequently, the eigenspaces and average position points which have been generated at training process are used to perform statistics-based synthesis module. The statistics-based synthesis module is divided into two parts: one is the exaggeration of feature shape points on x and y coordinates; and the other one is the exaggeration of feature position points on x and y coordinates. Both two parts further apply inter and intra processes. By inter and intra process we can automatically determine which features should be exaggerated. The exaggeration of feature position points is performed after the exaggeration of feature shape points. And to simplify the explanation, we take feature shape exaggeration on x coordinate as examples to explain inter and intra processes for facial feature shape. As for the feature position exaggeration, fuller discussion will be presented in Section 3.4.
Synthesis of Exaggerative Caricature with Inter and Intra Correlations
317
Statistics-based synthesis module Fig. 4
FFiyiy TFix TFi
TP
T
Iaf TFi’
Ini
TPx
TS
AAM
Irf
FFiyiy TFiy
Irp
TPy
Iap
TS’ TP’
Non-photorealistic rendering module Line-drawing sketch generation
Metamorphosis ’
T
Caricature
Ini: Exaggerative rates initialization. Irf: Inter exaggeration for feature shape. Iaf: Intra exaggeration for feature shape. Irp: Inter exaggeration for feature position. Iap: Intra exaggeration for feature position. Fig. 2. The framework of exaggerative caricature creation
3.1 The Initialization of Exaggerative Rates In Fig. 3, there are two ellipses, a small one and a big one. Having the same horizontal variation, the small ellipse is more distinctive than the big ellipse, and so does the variation on displacement. This example indicates that the degree of shape or position exaggeration is negatively relative to the length or width of feature itself [15]. Therefore, we develop following equation to model such condition:
Dix = exp(1 −
length(TFix )
∑i=1length(TFix ) 7
)
(1)
where Dix represents the ratio which is negatively and exponentially relative to the horizontal or vertical length of feature; and length(TFix) is the horizontal length of feature where length(TFix)=max(TFix) min(TFix). The reasons we use exponential function are that the exponential function is always positive (never violates the truth of feature), and the rate of growth is proportional to its value, namely the larger variation the feature has, the higher degree of exaggeration it should have. Basically, the exaggeration of feature is in fact the extension of the difference from corresponding average feature. In order to implement the exaggeration process, the “exaggerative rate” is defined to be the scalar. There are two kinds of exaggerative rates: the shape exaggerative rates and the position exaggerative rates. The initial
-
shape exaggerative rate of x coordinate of i-th feature is
ef ixk =1 = 1 + c1 ⋅ Dix , and it
satisfies the rule of exaggeration, c1 is the constant to controlling the degree of
318
C.-C. Tseng and J.-J.J. Lien
(a)
(b)
Fig. 3. (a) Ellipses having the same horizontal variances. (b) Ellipses having the same horizontal displacement.
exaggeration and c1=0.2 in this study. The initialization of position exaggerative rates will be discussed later. 3.2 Exaggerative Feature Shape Creation: Inter Exaggeration After initializing the shape exaggerative rates, the unbiased shape vector of i-th facial feature is calculated by subtracting corresponding average shape MFix, and then projected to the corresponding eigenspace UFix to get the weights of eigenvectors. Then we expand the weights by multiplying shape exaggerative rates. Finally, the exaggerative feature shape is obtained by the reconstruction with the expended weights:
Eixk = ∑ j =1 ( ρ ⋅ ef ixk ⋅ w jx ⋅ u jx ) + MFix
(2)
w jx = (TFix − MFix )T ⋅ u jx
(3)
n
T
Eixk
is the result of inter exaggeration at the k-th iteration. ρ is the proportion of each eigenvector, which means the higher proportion of eigenvector, the bigger shape exaggerative rate the eigenvector has. wjx is the projection weight of the j-th eigenvector, and ujx is the j-th eigenvector of the eigenspace. The result of inter process is shown in Fig. 5.(d). After comparing with the average face, we further apply the intra exaggeration process which considers the relationship with other features of the same subject to increase the contrast between all the facial features. where
3.3 Exaggerative Feature Shape Creation: Intra Exaggeration As we have said above, an artist considers the difference not only from the average face but also other neighbor features of the same subject. In other words, the major facial features of the subject are expected to be extracted, enhanced and then exaggerated in order to emphasize personal style of this face. As for other non-major facial features, the degree of exaggeration will decrease in order to enhance the contrast between the major features and the non-major features. For example, if the subject’s mouth is bigger than average mouth, but his/her nose is much bigger and
Synthesis of Exaggerative Caricature with Inter and Intra Correlations
319
rounder. Then the artist will draw the mouth smaller, but still bigger than average mouth, to emphasize the variation of nose. In order to increase the contrast between features, all the variances of exaggerative features should be calculated by the following equation: 1 ∑ (Eix − TFix ) 2 . n vix = length(TFix )
(4)
Here, vix represents the variance of the x coordinate of the i-th feature normalized by the feature length. n is the number of points of each feature. After calculating all the variances of features, these variances are sorted to determine which shape exaggerative rate of feature should be increased and which one should be decreased. Then, we update the shape exaggerative rates by the equations based on the sorted variances. Indeed, the updated process should satisfy the condition mentioned in Section 3.1, and the power of effect will decrease based on a Gaussian weighting function.
∑ (r ⋅ d ( x, y)) ⋅ D − A⋅ ∑ d ( x, y ) 7
ef
( k +1) ix
= ef
k ix
j =1
j
7
ix
(5)
j =1
⎧ 1 if v jx > vix ⎪ r j = ⎨ 0 if v jx = vix ⎪− 1 if v < v jx ix ⎩
(6)
⎡ tpix − tp jx 2 tpiy − tp jy 2 ⎤ ) +( ) ⎥) d ( x, y ) = exp(− ⎢( σx σy ⎣⎢ ⎦⎥
(7)
where A is a constant controlling the power of effect and A = 0.2 in this study. rj is a switch which increases the shape exaggerative rate of feature with bigger variance and decreases the shape exaggerative rate of feature with smaller variance. tpix and tpiy are the i-th elements of vector TPx and TPy used to calculate the distance between facial features in the 2-D Gaussian weighting function d(x,y). After the intra exaggeration process, we use the new shape exaggerative rates and original feature points TF consists of all facial feature points to perform inter exaggeration process again in order to increase the contrast of these features and find major features as shown in Fig. 4. Thus, we can determine all of the shape exaggerative rates automatically instead of setting them one by one. As long as any exaggerative rate equals to zero, the iteration process terminates to prevent the feature from exaggerating in the contrary direction. For example, if we draw a nose small but in fact it is bigger than general. Obviously, the feature is exaggerated in the contrary direction. Additionally, we need to set the maximum iteration time to avoid none of the shape exaggerative rates being zero. When satisfying the termination condition
320
C.-C. Tseng and J.-J.J. Lien
after n-1 time iteration, original feature shape TF and last shape exaggerative rates ef n are used in inter process to generate the x coordinate of exaggerative shape. As for the exaggeration of feature shapes on y coordinate, inter and intra processes mentioned above is applied in the same way. The result of iteration is shown in Fig. 5.(e). ef 1 TF
ef 2 TF
ef n-1 TF
ef n TF
Inter process
Inter process
Inter process
Inter process
E1 Intra process
E2 Intra process
…
E
n-1
Intra process
Exaggerative shape
ef k: shape exaggerative rate at k-th iteration Ek: the intermediate at k-th iteration Fig. 4. The diagram of entire iteration process for inter and intra shape exaggerations
3.4 Exaggerative Feature Position Creation Having finished the exaggeration of feature shape, the positions of features are further exaggerated. For the position vectors TPx and TPy, the difference from average feature position is calculated by subtracting MPx, MPy, and then expended by the equations as follow with the position exaggerative rate which also contains two parts: inter position exaggerative rate epinter and intra position exaggerative rate epintra:
⎧tp 'ix = mpix + epix (tpix − mpix ).Dix ⎨ ⎩tp 'iy = mpiy + epiy (tpiy − mpiy ).Diy
(8)
⎧⎪ep ix = ep ixint er + ep ixint ra ⎨ int er int ra ⎪⎩ep iy = ep iy + ep iy
(9)
⎧ int ra length(TF7' x ) − length( MF7 x ) ⎪epix = length( MF7 x ) ⎪ . ' ⎨ ( ) − length( MF7 y ) length TF 7 y int ra ⎪epiy = ⎪⎩ length( MF7 y )
(10)
Here, inter position exaggerative rate is given by the initial process where epixint er = 1 + c2 ⋅ Dix , c2 is a constant to control the exaggerative degree and c2 = 0.3 in this study. After comparing the difference from average feature positions, artists always adjust the results of comparisons based on the width or the length of face. If the face is wider or longer than average face, the distance between the two features will be increased horizontally or vertically, and vise versa. Thus, the intra exaggerative rate only depends on the width and length of face. Indeed, Dix, Diy are added to ensure that the displacement of each position of feature is also satisfies the
Synthesis of Exaggerative Caricature with Inter and Intra Correlations
321
condition discussed in Section 3.1. After the feature position exaggeration, we translate all the exaggerative feature shape points to the specific location based on exaggerative positions, and the result TS’ is shown in Fig. 5.(f).
4 Non-photorealistic Rendering Module Since we have exaggerated the shapes and positions of facial features, we will apply non-photorealistic rendering processes to generate a caricature with exaggerative style. First, the facial contour is emphasized by modifying a method proposed by Gooch [9]. Instead of generating a binary image, the values of pixels of result are replaced with gray values of original image to make the result more lifelike. Then after creating a line-drawing sketch T’ without exaggeration, as shown in Fig. 5.(g), the line-drawing sketch is warped to an exaggerative style by using the image metamorphosis [2] with exaggerative feature shape. Finally, a caricature with exaggerative feature positions and shapes and other particular characteristics, as shown in Fig. 5.(h), are generated.
(b)
(c) (f)
(d)
(e)
(g)
(h)
Fig. 5. (a) Original input image. (b) Original feature shape. (c) Average feature shape. (d) The result after only inter exaggeration process for shape. (e) The result after inter and intra exaggeration processes only for shape. (f) The result after inter and intra exaggeration processes for both shape and position. (g) The line-drawing sketch of original face. (h) The exaggerative caricature of original face keeping the information of nevus.
5 Experimental Results For the testing image of frontal face, as shown in Fig. 5.(a), the diagrams of shape exaggerative rates are shown in Fig. 6. These two diagrams indicate that the contrast between shape exaggerative rates increases by iteration, and major features can be extracted with high exaggerative rates. In other words, our system can automatically set the exaggerative rates and extract major facial features. In addition, by comparing with Xu’s method [18], shown in Fig. 7, our result can keep the contour of original feature after the increase of exaggerative rates, but the result of Xu’s method occurs distortion which decreases the likeness of features. Therefore, our result is subtler than Xu’s. The experimental results demonstrate that our system can automatically and effectively exaggerate facial features, and generate corresponding facial caricatures of other particular characteristics. More results with exaggerative sizes and positions of the facial features are shown in Fig 8.
322
C.-C. Tseng and J.-J.J. Lien
Left Eyebrow Left Eye Nose 3 Face
Right Eyebrow Right Eye Mouth
Left Eyebrow Left Eye Nose 3.5 Face
ER: Exaggerative rate
2.5
ER
ER
2 1.5 1 0.5 0 1
3
5
7
9 11 13 15 17 19 Iteration
Right Eyebrow Right Eye Mouth ER: Exaggerative rate
3 2.5 2 1.5 1 0.5 0 1
(a)
3
5
7
9 11 13 15 17 19 Iteration
(b)
Fig. 6. The variation of shape exaggerative rates on (a) x coordinate (b) y coordinate
Fig. 7. (a) The results of our approach (b) The results of Xu’s approach [18]
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
Fig. 8. More results of (a) original face (b) line-drawing sketch (c) exaggerative caricature
6 Conclusion We have developed a novel system serving to generate an exaggerative caricature which involves not only exaggerative facial features but also other particular characteristics, like beards or nevus. Besides, we model the drawing skills of artists to
Synthesis of Exaggerative Caricature with Inter and Intra Correlations
323
simplify the determination of exaggerative rates. However, our system can only handle the image data taken from fixed distance and under fixed lighting. The variation of lightness and distance will affect our result. In the future work, we want to solve the factors of environment by simulating the image with visual distance and appropriate lighting to make our system more robust.
References 1. Akleman, E.: Making Caricature with Morphing. In: Proc. of ACM SIGGRPH, p. 145 (1997) 2. Beier, T., Neely, S.: Feature-based Image Metamorphosis. In: Proceedings of ACM SIGGRAPH, pp. 35–42 (July 1992) 3. Blommaert, F.J.J., Martens, J.-B.: An Object-Oriented Model for Brightness Perception. Spatial Vision 5(1), 15–41 (1990) 4. Brennan, S.: Caricature Generator. Master’s thesis, Cambridge, MIT (1982) 5. Chen, H., Xu, Y.Q., Shum, H.Y., Zhu, S.C., Zheng, N.N.: Example-based Facial Sketch Generation with Non-parametric Sampling. In: Proceedings of International Conference on Computer Vision, Vancouver, Canada (July 2001) 6. Chen, H., Zheng, N.N., Liang, L., Li, Y., Xu, Y.Q., Shum, H.Y.: PicToon: A Personalized Image-based Cartoon System. In: Proc. of ACM Int’l Conf. on Multimedia (2002) 7. Chiang, P.Y., Liao, W.H., Li, T.Y.: Automatic Caricature Generation by Analyzing Facial Features. In: Proceedings of Asia Conf. on Computer Vision (2004) 8. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6) (2001) 9. Gooch, B., Reinhard, E., Gooch, A.: Human Facial Illustrations: Creation and Psychophysical Evaluation. ACM Trans. on Graphics 23(1), 27–44 (2004) 10. Hsu, R.L., Jain, A.K.: Generating Discriminating Cartoon Faces Using Interacting Snakes. IEEE Trans. on Pattern Analysis and Machine Intelligence 25, 1388–1398 (2003) 11. Koshimize, H., Tominaga, M., Fujiwara, T., Murakami, K.: On KANSEI Facial Processing for Computerized Facial Caricaturing System Picasso. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 294–299 (1999) 12. Lai, K.H., Edirisinghe, E.A., Chung, P.W.H.: A Facial Component based Hybrid Approach to Caricature Generation using Neural Networks. In: Proceedings of ACTA on Computational Intelligence (2006) 13. Liang, L., Chen, H., Xu, Y.Q., Shum, H.Y.: Example-based Caricature Generation with Exaggeration. In: Proc. 10th Pacific Conf. on Computer Graphics and Applications (2002) 14. Martinez, A.M., Benavente, R.: The AR Face Database. CVC Technical Report #24 (June 1998) 15. Mo, Z., Lewis, J.P., Neumann, U.: Improved Automatic Caricature by Feature Normalization and Exaggeration. In: Proceedings of ACM SIGGRAPH Conf. on Abstracts and Applications, New York (2004) 16. Redman, L.: How to Draw Caricatures. Contemporary Books (1984) 17. Shet, R.N., Lai, K.H., Edirisinghe, E.A., Chung, P.W.H.: Use of Neural Networks in Automatic Caricature Generation: An Approach Based on Drawing Style Capture. In: IEE International Conference on VIE, UK, pp. 23–29 (2005) 18. Xu, G.Z., Kaneko, M., Kurematsu, A.: Synthesis of Facial Caricature Using Eigenspaces. In: Proceedings of Electronics and Communications in Japan. Part 3, vol. 87(8) (2004) 19. Xu, Z., Chen, H., Zhu, S.C.: A High Resolution Grammatical Model for Face Representation and Sketching. In: Proceedings of Computer Vision and Pattern Recognition, pp. 470–477 (2005)
Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates Shiro Kumano1 , Kazuhiro Otsuka2 , Junji Yamato2 , Eisaku Maeda2 , and Yoichi Sato1 Institute of Industrial Science, The University of Tokyo, 4–6–1 Komaba, Meguro-ku, Tokyo, 153–8505 Japan {kumano,ysato}@iis.u-tokyo.ac.jp 2 NTT Communication Science Laboratories, NTT 3–1 Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243–0198 Japan {otsuka,yamato}@eye.brl.ntt.co.jp, [email protected] 1
Abstract. In this paper, we propose a method for pose-invariant facial expression recognition from monocular video sequences. The advantage of our method is that, unlike existing methods, our method uses a very simple model, called the variable-intensity template, for describing different facial expressions, making it possible to prepare a model for each person with very little time and effort. Variable-intensity templates describe how the intensity of multiple points defined in the vicinity of facial parts varies for different facial expressions. By using this model in the framework of a particle filter, our method is capable of estimating facial poses and expressions simultaneously. Experiments demonstrate the effectiveness of our method. A recognition rate of over 90% was achieved for horizontal facial orientations on a range of ±40 degrees from the frontal view.
1
Introduction
Facial expression recognition is attracting a great deal of attention because of its usefulness in many applications such as human-computer interaction and the analysis of conversation structure [1]. Most existing methods for facial expression recognition assume that the person in the video sequence does not make any large movements and that the image shows a nearly frontal view of the face [2][3][4][5]. However, in situations such as multi-party conversations (e.g. meetings), people will often turn their faces to look at other participants. Hence, we must simultaneously handle the variations in facial pose as well as facial the expression changes. Facial expression recognition methods handling facial pose variations require a facial shape model of the user’s neutral expression and a model of facial expression to treat the variations of facial pose and expression separately. The shape model and the facial expression model are together referred to as the face model in this paper. The face model expresses facial pose variations by globally translating and rotating the shape model in three-dimensional space, and facial Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 324–334, 2007. c Springer-Verlag Berlin Heidelberg 2007
Pose-Invariant Facial Expression Recognition
325
expression changes by locally deforming the shape model according to the facial expression model. Existing methods require an accurate face model, because image variations caused by facial expression change are often smaller than those caused by facial pose change. Accordingly, the use of inaccurate shape models degrades the accuracy of the facial pose and expression estimates because those two components cannot be separated reliably. Face models are divided broadly into two groups: person-dependent models and person-independent models. Previous methods generate a person-dependent face model for each user by using stereo cameras [6][7]. Accordingly, this approach cannot be applied to monocular video sequences. On the other hand, personindependent models can be applied to arbitrary users [8][9][10]. However, it has been reported that person-independent models cannot cover large interpersonal variations of face shape and facial expression with sufficient accuracy [11]. Motivated by these problems, we propose a novel method for facial expression recognition; based on variable-intensity templates, it offers the following advantage and goals: 1. It supports monocular video capture systems. 2. It can easily generate a face model for each person. 3. Facial expressions can be estimated even with a large change in facial pose. The variable-intensity template consists of three components: (1) a simple shape model, (2) a set of interest points, (3) an intensity distribution model. We use a cylinder as the shape model because it can be easily generated. However, to use such a simple shape model, we must handle the following problem. The interest points are defined on a face image in the frontal view (white points in the right figure of Fig.3). Their image positions for arbitrary facial poses are then calculated by projecting them onto the shape model, translating and rotating the shape model according to the pose, and projecting the resulting three-dimensional positions onto the image plane. Hence, if there is an error in the shape model, the calculated image position of an interest point shifts from its actual position as the facial pose angle increases. This problem can be effectively avoided by employing pairs of points that straddle the edge of different facial parts (the left part of Fig.1) [1][12]. Even if the interest points are shifted due to the error in the model, the change in the intensity of the points is small because the points are defined away from the edges where the intensity changes significantly (the center part of Fig.1). The intensity distribution model describes how interest point intensity varies for different expressions of a face. Our method prepares it to recognize facial expressions and to improve the robustness of facial pose estimataion against changes in facial expressions which cause large changes in intensity (the right part of Fig.1). Our contribution is that we propose a facial expression recognition method for varying facial poses based on the key idea that facial expressions can be correctly recognized by knowing how interest point intensity varies for different facial expressions. The main advantage of our method is that a face model for each person is created simply by capturing frontal face images of the person in
S. Kumano et al.
Enlarged
326
The shift of interest points due to error in shape model Intensity is not largely changed
Interest points
The shift of facial parts due to change in facial expression Intensity is largely changed
Face image Eyebrow Eye Expression
Neutral
Neutral
Angry
Fig. 1. Our method absorbs errors in shape models and recognizes facial expressions by treating the changes in intensity of multiple points defined around facial parts. Interest points are composed of pairs of points that straddle the edges of facial parts.
the target facial expressions, unlike existing methods. Furthermore, we implement a particle filter utilizing the variable-intensity template as a face model to simultaneously estimate facial poses and expressions. The remainder of this paper is organized as follows. First, our proposed method is described in Section 2. Then, in Section 3, experimental results are given. Finally, a summary and future works are given in Section 4.
2
Proposed Method
Our method consists of two stages: a calibration stage and a test stage (see Fig.2). The calibration stage generates a variable-intensity template. In the test stage, our method estimates the facial pose and expression simultaneously within the framework of a particle filter. 2.1
Variable-Intensity Template
The variable-intensity template M consists of the following three components: M = {S, P, L} Calibration Stage
Variable-Intensity Template
Face detection
Shape Model
Interest Points Extraction
Interest Points
Calibration
Intensity Distribution Model
Fig. 2. System flow chart
(1) Test Stage Input Video Sequences Estimation (Particle Filter ) Estimated States
Pose-Invariant Facial Expression Recognition
327
Edge detection
Extraction of pairs of points
Eyebrow Input image
Edge image
Extraction results
Interest points
Enlarged
Face image in neutral expression
Fig. 3. Left: The method to extract interest points P. Right: The extraction result. Small white rectangles represent interest points; point pairs are indicated by the lines. The set of large rectangles represents the regions holding the interest points.
where S, P, and L denote a rigid facial shape model, a set of interest points, and an intensity distribution model, respectively. The intensity distribution model describes the intensity distribution of each interest point for different facial expressions. Shape Model S. A cylinder is used as the facial shape model because of its geometric simplicity. The radius of the cylinder is calculated as the width of the face region detected in a neutral expression image by the method of [13] multiplied by a fixed constant. Set of Interest Points P. The set of interest points P is described as follows: P = {p1 , · · · , pN }
(2)
where pi denotes the image coordinates of point i and N denotes the number of interest points in the same image used to generate shape model S. An interest point constitutes a pair of points. The pairs that satisfy the following conditions are extracted, from the region including four kinds of facial parts (eyebrows, eyes, nose, and mouth) in ascending order of the interpair difference in intensity until the number of pairs in each facial part reaches the limit [12]: (1) Each pair of interest points straddles and is centered on an edge. (2) Pairs are separated by at least a predefined distance (see Fig.3). The edges are obtained as zero-cross boundaries of the Laplacian-Gaussian filtered image. Intensity Distribution Model L. The intensities of interest points are assumed to follow independent normal distributions, because each point is apart from the other point and the normal distribution can adequately express the effect of the position shifts due to shape model error and imaging noise. The intensity distribution model L describes how the means and standard deviations of the distributions vary for discrete facial expressions. That is, the intensity distribution model of each interest point is the mixture distribution that consists of distributions for each facial expression.
328
S. Kumano et al. Interest point
(2) Eyebrow lowerer
Probability
Eyebrow
Face Expression Neutral
Angry
(3) Change in intensity of interest point
Intensity of interest point
Intensity distribution model (1) Change in facial expression
Fig. 4. Intensity distribution model L: The intensity distributions of interest points, described as normal distributions, change in facial expressions. The colors in the right part correspond to the interest points in the left part.
The intensity distribution model L is described as follows: L = {N1 , · · · , NN } , Ni = N μi (e), σi2 (e) , σi (e) = kμi (e)
(3) (4)
where N (μ, σ 2 ) denotes a normal distribution with mean μ and standard deviation σ, e ∈ {0, · · · , Ne − 1} denotes facial expression, Ne denotes the number of target expressions, e = 0 expresses neutral expression, μi (e) and σi (e) denote the mean and standard deviation of the intensity of point i for expression e, respectively. Standard deviation σi is assumed to be proportional to mean μi with a constant of proportionality k. The changes in facial expressions cause large changes in intensity around facial parts (see Fig.4). The intensity distribution model is used to these changes for different facial expressions. Our method generates intensity distribution models for each person by using one frontal face image for each facial expression without any head movement during the capture process. By using this neutral expression image to also generate shape model S and extract the set of interest points P, the intensity means of the interest points in each expression can be set to be the intensities on the pixels where these points were defined. For the standard deviation, our method employs a large k, to reduce the effect of calibration errors and changes in intensity of the face caused by changes in facial poses which alter the illumination direction of the face. 2.2
Simultaneous Estimation of Facial Pose and Expression by Using a Particle Filter
Our method simultaneously estimates the facial pose and expression by calculating the likelihood of the intensity of interest points for the intensity distribution model. The joint distribution of facial pose and expression at time t given all face images up to that time (z 1:t ) is recursively represented as follows:
Pose-Invariant Facial Expression Recognition
P (ht , et |z 1:t ) = αP (z t |ht , et )
P (ht |ht−1 )
329
P (et |et−1 )
et−1
P (ht−1 , et−1 |z 1:t−1 )dht−1
(5)
where the facial pose state ht and expression state et follow first order Markov processes; ht and et are assumed to be conditionally independent given image z t ; Bayes’ rule and conditional independence are used along with marginalization [14], and α = 1/P (z t ) is a normalization constant. The facial pose state ht consists of the following six continuous variables: the coordinate of the center of the template on the image plane, three-dimensional rotation angles (roll, pitch, and yaw), and scale. We adopt a random walk model for each parameter of facial pose yielding state transition model P (ht |ht−1 ), and set P (et |et − 1) to be equal for all facial expression combinations. Equation (5), unfortunately, cannot be calculated exactly, because parameters of facial pose ht are continuous, and their distributions are complex due to occlusion, etc. Hence, we use a particle filter, which calculates Equation (5) by approximating the posterior density as a set of weighted samples called particles. Each particle expresses a state and its weight. In our method, the state and (l) (l) (l) (l) weight of the lth particle are expressed as [ht , et ] and ωt , where ωt is propor (l) (l) (l) tional to P (z t |ht , et ) calculated using Equation (6) and satisfies l ωt = 1. Likelihood of Observation. The likelihood of face image z t for facial pose ht and expression et is expressed as P (z t |ht , et ). Since the intensity of each interest point zi,t is assumed to be independent as described in Section 2.1, likelihood P (z t |ht , et ) is transformed as follows: P (z t |ht , et ) = P (zi,t |ht , et ) (6) i∈P
where P denotes the set of non-occluded interest points. Here, we consider that the interest point is not occluded if the surface normal of its corresponding point on the facial shape model is pointing toward the camera. We define the likelihood of point i in facial pose ht and expression et , P (zi,t |ht , et ), by adopting a robust estimation as follows:
zi,t − μi (et ) 1 1 exp − ρ P (zi,t |ht , et ) = √ , (7) 2 σi (et ) 2πσi (et ) 2 if x2 < x , (8) ρ(x) = , otherwise where μi (et ) and σi (et ) are the mean and standard deviation of the intensity of interest point i in facial expression et , respectively, and ρ(·) denotes a robust function. Intensity zi,t is the intensity of image z t at the image coordinate of point i at time t, q i,t . The image coordinate q i,t is calculated as q i,t = f (pi , S, ht ), where function f returns q i,t by projecting the image coordinate pi onto the shape model S, with translation and rotation of S according to pose ht , and performing weak perspective projection of the resulting threedimensional position onto the image plane.
330
S. Kumano et al.
Estimator of Facial Pose and Expression. Estimators of facial pose and
t and e t ) are calculated as follows: expression at time t (h (l) (l)
t = h ωt ht , (9) l
e t = arg max et
(l)
ωt
(10)
(l) l∈et =et
where estimated facial expression e t is defined as the maximum probability in facial expression P (et |z1:t ).
3
Experimental Results
To evaluate the usefulness of our method, we performed two types of tests on video sequences wherein subjects exhibited multiple facial expressions with the head fixed (Test 1) and freely changed (Test 2). The target facial expressions were neutral, angry, sad, surprise and happy. Grayscale video sequences with a size of 512 × 384 pixel were captured by an IEEE1394 camera at 15 fps for the subjects. The number of particles was set to 1,500, and the processing time was about 80 ms/frame on a Pentium D processor at 3.73GHz with 3.0GB RAM. 3.1
Details of the Experiment
Five male subjects participated in Test 1 once, and one male subject participated in Test 2 once. In Test 1, the subject showed five facial expressions one by one with the head fixed to an horizontal direction for a duration of 60 frames followed by a 60 frame interval, according to instructions displayed on a monitor. We targeted the following five yaw angles of the face relative to the camera direction: −40, −20, 0, 20, and 40 (degrees). In Test 2, the subject freely showed five facial expressions one by one while shaking the head left and right. 3.2
Facial Expression Recognition Results
The recognition rates of facial expression were calculated for Test 1. Ground truth of facial expression at every frame was defined to be the expression indicated by the instruction. In consideration of the time lag between the instruction and the exhibition of the facial expression, we excluded the first 20 frames of each expression just after the instruction was displayed. Figure 5 shows some estimation results of facial poses and expressions in Test1, where facial poses and expressions were correctly estimated in the images of all subjects. Recognition rate was calculated as the ratio between the number of frames wherein the estimated expression equaled the ground truth and the total number of target frames. Table 1 shows the average facial expression recognition rates of the five tests for each target yaw angle. The average rate in the range of ±40 degree yaw angles exceeded 90(%), although the recognition rate decreased as the yaw
Pose-Invariant Facial Expression Recognition Neutral Angry Sad Surprise Happy
331
Estimated probabilities of facial expressions
Grids of shape model (White lines)
Interest points (Small points)
Ground truth: Neutral
Ground truth: Angry
Sad
Ground truth: Surprise
Happy
Fig. 5. Some estimation results of facial poses and expressions in Test 1: White grids and small points on each face denote the shape model and interest points in the estimated facial pose. The width of each bar in the upper right part of each image denotes the estimated probability of each facial expression, P (et |z1:t ).
Table 1. Average recognition rates of facial expressions for each yaw angle: Test 1 Yaw angle (degree) total -40 Recognition rate (%) 90.7 91.4
-20 100.0
0 99.9
20 84.3
40 78.1
332
S. Kumano et al.
Table 2. Confusion matrix of average recognition rate of facial expressions in Test 1: The expressions in the rows are the ground truths and the expressions in the columns describe the recognition results Expression Neutral Angry Sad Surprise Happy Neutral 97.0 0.0 2.8 0.1 0.1 1.9 80.1 12.3 1.0 4.7 Angry 9.2 0.6 77.7 12.2 0.3 Sad 0.3 0.1 0.3 99.0 0.3 Surprise 0.1 0.0 0.0 0.0 99.9 Happy
(a) Input video sequence (from left to right, frame number 100, 290, 400, 560, 660). GT 1 NEUTRAL
Probability
0 1 ANGRY 0 1 SAD 0 1 SURPRISE 0 1 HAPPY 0 0
100
200
300
400
500
600
Frame Number
Angle (degrees)
(b) Ground truth (top) and recognition results (others) of facial expression: The probability of correct expression is remarkably higher than that of other expressions.
80 60 40 20 0 −20 −40 −60 −80 0
PITCH
100
200
300
YAW
ROLL
400
500
600
Frame Number
(c) Estimation results of facial pose (horizontal axis equals to that of (b)): Facial poses are estimated with enough accuracy to detect three cycles of head shake movement (red solid line). Fig. 6. Images and estimation results in Test 2
angle increased. It seems that the recognition rates decrease more with positive yaw angles than with negative yaw angles because of lighting asymmetry. That is, if the lighting condition is horizontally symmetric against the face, it
Pose-Invariant Facial Expression Recognition
333
is expected that the recognition rates with positive yaw angles will increase to a similar extent as negative yaw angles. Table 2 shows the confusion matrix for the recognition rates. It seems that angry and sad expressions were most similar, because they were sometimes confused each other. Five frames of the video sequence and the estimated results of facial expression and pose in each frame in Test 2 are shown in (a), (b) and (c) of Fig.6. Moreover, ground truth of the facial expression at every frame was determined manually. Figure 6(b) shows that facial expressions were recognized correctly in almost all frames. In addition, the correct expressions were assigned remarkably higher probabilities than other expressions. Fig.6(c) also shows that facial poses were estimated with enough accuracy to detect three cycles of head shake movement. Video sequences for the result in Test 1 and 2 are available from [15].
4
Summary and Future Works
In this paper, we presented a particle filter-based method for estimating facial pose and expression simultaneously by using a novel face model called the variable intensity template. Our method has the distinct advantage that a face model for each person can be prepared very easily with a simple calibration step. With our method, five facial expressions were recognized with 90.7% of accuracy for horizontal facial orientations over the range of ±40 degrees from the frontal view. In the future, we would like to conduct more experiments with additional subjects to complete the statistical evaluation. In the current framework, our method cannot correctly estimate facial poses and expressions under the large variation in intensities caused by large head movements, especially vertical direction, or lighting variations. Hence, we are planning to handle such intensity variations by updating the intensity of each interest point (e.g. [16]). Another goal is to achieve a fully automatic system by applying an online clustering technique for extracting target facial expressions, instead of calibrating the intensities for facial expressions in advance. In addition, We would also like to improve the way interest points are defined by using the component corresponding to the point of a projection matrix that maximizes the ratio of between and within class scatter of the facial expression images (i.e. Fisherfaces [17]) for optimizing point selection.
References 1. Otsuka, K., Yamato, J., Takemae, Y., Murase, H.: Conversation scene analysis with Dynamic Bayesian Network based on visual head tracking. In: Proc. of the IEEE International Conference on Multimedia and Expo, pp. 949–952. IEEE Computer Society Press, Los Alamitos (2006) 2. Cohen, I., Sebe, N., Chen, L., Garg, A., Huang, T.: Facial expression recognition from video sequences: Temporal and static modeling. Computer Vision and Image Understanding 91, 160–187 (2003)
334
S. Kumano et al.
3. Kaliouby, R., Robinson, P.: Generalization of a vision-based computational model of mind-reading. In: Proc. of the First International Conference on Affective Computing and Intelligent Interatction, pp. 582–589 (2005) 4. Chang, Y., Hu, C., Feris, R., Turk, M.: Manifold based analysis of facial expression. Image and Vision Computing 24, 605–614 (2006) 5. Bartlett, M., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia 1, 22–35 (2006) 6. Gokturk, S.B., Tomasi, C., Girod, B., Bouguet, J.Y.: Model-based face tracking for view-independent facial expression recognition. In: Proc. of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 287–293. IEEE Computer Society Press, Los Alamitos (2002) 7. Oka, K., Sato, Y.: Real-time modeling of face deformation for 3D head pose estimation. In: Proc. of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 308–320. IEEE Computer Society Press, Los Alamitos (2005) 8. Dornaika, F., Davoine, F.: Simultaneous facial action tracking and expression recognition using a particle filter. In: Proc. of the Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1733–1738 (2005) 9. Zhu, Z., Ji, Q.: Robust real-time face pose and facial expression recovery. In: Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 681–688. IEEE Computer Society Press, Los Alamitos (2006) 10. Lucey, S., Matthews, I., Hu, C., Ambadar, Z., Torre, F., Cohn, J.: AAM derived face representations for robust facial action recognition. In: Proc. of the 7th International Conference on Automatic Face and Gesture Recognition, pp. 155–160 (2006) 11. Gross, R., Matthews, I., Baker, S.: Generic vs. person specific Active Appearance Models. Image and Vision Computing 23, 1080–1093 (2005) 12. Matsubara, Y., Shakunaga, T.: Sparse template matching and its application to real-time object tracking. IPSJ Transactions on Computer Vision and Image Media 46(9), 17–40 (2005) 13. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 14. Rich, E., Knight, K.: Artificial intelligence, pp. 537–583. McGraw-Hill Book Company, New York (1991) 15. http://www.hci.iis.utokyo.ac.jp∼ kumano/papers/accv2007.html 16. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1296–1311 (2003) 17. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 711–720 (1997)
Gesture Recognition Under Small Sample Size Tae-Kyun Kim1 and Roberto Cipolla2 2
1 Sidney Sussex College, University of Cambridge, Cambridge, CB2 3HU, UK Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
Abstract. This paper addresses gesture recognition under small sample size, where direct use of traditional classifiers is difficult due to high dimensionality of input space. We propose a pairwise feature extraction method of video volumes for classification. The method of Canonical Correlation Analysis is combined with the discriminant functions and Scale-InvariantFeature-Transform (SIFT) for the discriminative spatiotemporal features for robust gesture recognition. The proposed method is practically favorable as it works well with a small amount of training samples, involves few parameters, and is computationally efficient. In the experiments using 900 videos of 9 hand gesture classes, the proposed method notably outperformed the classifiers such as Support Vector Machine/Relevance Vector Machine, achieving 85% accuracy.
1
Introduction
Gesture Recognition Review Gesture recognition is an important topic in computer vision because of its wide ranges of applications such as human-computer interfaces, sign language interpretation and visual surveillance. Not only spatial variation but also temporal variation among gesture samples make this recognition problem difficult. For instance, different subjects have different hand appearance and may sign gesture in different pace. Recent work in this area tends to handle the above variations separately and therefore leads to two smaller areas, namely posture recognition (static) and hand motion or action recognition (dynamic). In posture recognition, the pose or the configuration of hands is recognised using silhouette [5] and texture [6]. By contrast, hand motion or action recognition interprets the meaning of the movement using full trajectory [9], optical flow [4] and motion gradient [11]. Compared with hand motion recognition, posture recognition is easier in the sense that state-of-the-art classifiers, e.g. Support Vector Machine, Relevance Vector Machine [11] or Adaboot [6] can be directly applied to it. Gesture recognition, on the other hand, has adopted rather different approaches, e.g. Hidden Markov Model [9]) or Dynamic Time Warping [3]), to discriminate dynamic/or temporal information which is typically highly non-linear in a data space. These methods, especially the Hidden Markov Models, have many parameters to set, a large amount of training examples, and difficulty for extension to large vocabulary [2]. Besides, these traditional methods have not integrated the posture and Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 335–344, 2007. c Springer-Verlag Berlin Heidelberg 2007
336
T.-K. Kim and R. Cipolla
temporal information and thus are difficult to differentiate gestures of similar movements signed by different hand shapes. Some recent works [8] directly operate with full spatiotemporal volume considering both posture and temporal information of gestures to a certain degree, but are still unreliable in cases of motion discontinuities and motion aliasing. Also, the method [8] requires the manual setting of the important parameters such as positions and scales of local space-time patches. Another important line of methods exploits visual code words (for representation) with either a Support Vector Machine (SVM) or a probabilistic generative model [12,13]. Again, for their good performance, it is critical to properly set the parameters associated with the representation, for e.g. space-time interest points and code book size. Motivation and Summary of This Study To avoid empirical setting of the parameters in the existing methods, it seems obvious to seek a more generic and simpler learnable approach for gesture recognition. Note that many of critical parameters in the previous methods are incurred in the step of representing gesture videos prior to using classifiers. In that case, it could be better to apply learnable classifiers directly to the videos which can be simply converted into column vectors. Unfortunately, this is not a good way either. Vectorization of a video by concatenating all pixels in the three-dimensional video volume causes a high dimension of N 3 , which is much larger than N 2 of an image. Also, it may be more difficult to collect sufficient number of video samples for classifiers than images (see that a single video consists of multiple images). So called small sample size problem is more serious in learning classifiers with videos than images. Getting back to the representation issue, this work focuses on how to learn useful features from videos for classification, discussing its benefits over direct using classifiers. With the given discriminative features, even a simple Nearest Neighbor classifier (NN) achieved a very good accuracy. An extension of Canonical Correlation Analysis (CCA) [1,15]-a standard tool of inspecting linear relationships of two sets of vectors- is proposed to yield robust pairwise features of any two gesture videos. The proposed method is closely related to our previous framework of Tensor Canonical Correlation Analysis [14], which extends the classical CCA into multidimensional data arrays by sharing either a single axis or two axes. The method of sharing two axes, i.e. planes between two video data, is updated and combined with the discriminative functions and the Scale-Invariant-Feature-Transform for further improvements. The proposed method does not require any significant meta-parameters to be adjusted and can learn both posture and temporal information for gesture classification. The rest of the paper is organized as follows: Next section explains the proposed method with the discriminant functions, discussing the benefit of the method over traditional classifiers. The SIFT representation for video data is combined to the method for improvements in Section 3. Section 4 shows the experimental results and Section 5 draws conclusion.
Gesture Recognition Under Small Sample Size
2
337
Discriminative Spatiotemporal Canonical Correlations
Canonical Correlation Analysis (CCA) has been a standard tool of inspecting linear relationships of two random variables, or two sets of vectors. This was recently extended to two multidimensional data arrays in [14]. The method of spatiotemporal canonical correlations (which is related to the previous work in exploiting planes rather than scan vectors of two videos) is explained as follows: A gesture video is represented by firstly decomposing an input video clip (i.e. a spatiotemporal volume) into three sets of orthogonal planes, namely XY-, YT- and XT-planes as shown in Figure 1. This allows posture information in XY-planes and joint posture/dynamic information in YT and XT-planes. Three kinds of subspaces are learnt from the three sets of planes (which are converted into vectors by raster-scanning). Then, gesture recognition is done by comparing these subspaces with the corresponding subspaces from the models by classical canonical correlation analysis, which measures principal angles between subspaces1 . By comparing subspaces of an input and a model, robust gesture recognition can be achieved up to pattern variations on the subspaces. The similarity of any model Dm and query spatiotemporal data Dq is defined as the weighted sum of the normalized canonical correlations of the three subspaces by 3 wk N k (Pkm , Pkq ) F (Dm , Dq ) = Σk=1
(2)
N k (Pkm , Pkq ) = (G(Pkm , Pkq ) − mk )/σ k ,
(3)
where, P1 , P2 , P3 denotes a matrix containing the first few eigenvectors in its columns of XY-planes, XT-planes, YT-planes respectively and G(Pm , Pq ) sum of the canonical correlations computed from Pm , Pq . The normalization parameters with index k are mean and standard deviation of matching scores, i.e. G of all pairwise videos in a validation set for the corresponding planes. The discriminative spatiotemporal canonical correlation is defined by applying the discriminative transformation [10] learnt from each of the three data domains as 3 wk N k (h(QkT Pkm ), h(QkT Pkq )), (4) H(Dm , Dq ) = Σk=1 1
Canonical correlations between two d-dimensional linear subspaces L1 and L2 are uniquely defined as the maximal correlations between any two vectors of the subspaces [1]: (1) ρi = cos θi = max max uTi vi ui ∈L1 vi ∈L2
= = 1, = viT vj = 0, j = 1, ..., i − 1. We will refer to subject to: ui and vi as the i-th pair of canonical vectors. Multiple canonical correlations are defined by having next pairs of canonical vectors orthogonal to previous ones. The solution is given by SVD of PT1 P2 as uTi ui
viT vi
uTi uj
PT1 P2 = LΛRT
where
Λ = diag{ρ1 , ..., ρd }.,
where P1 , P2 are the eigen-basis matrix, L, Λ, R are the outputs of SVD.
338
T.-K. Kim and R. Cipolla
where h is a vector orthonormalization function and Qk are the discriminative transformation matrix learnt over the corresponding sets of planes. The discriminative matrix is found to maximize the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets by analogy to the optimization concept of Linear Discriminant Analysis (LDA) (See [10] for details). On the transformed space, gesture video classes are more discriminative in terms of canonical correlations. In this paper, this concept has been validated not only for the spatial domain (XY-subspaces) but also for the spatiotemporal domains (XT-, YT-subspaces). Discussions The proposed method is a namely divide-and-conquer approach by partitioning original input space into the three different data domains, learning the canonical correlations on each domain, and then aggregating them with proper weights. By this way, the original data dimension N 3 , where N is the size of each axis, is reduced into 3 × N 2 so that the data is conveniently modelled. As shown in Figure 2a-c, each data domain is well-characterized by the corresponding lowdimensional subspace (e.g. hand shapes in XY-planes, joint spatial and temporal information in YT-, and XT- planes).
Y X T
Spatiotemporal Volume
XY-planes
XT-planes
YT-planes
Fig. 1. Spatiotemporal Data Representation
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Principal Components and Canonical Vectors: The first few principal components of the (a) XY (b) XT (c) YT subspaces of the two different illumination sequences of the same gesture class (See Figure 5) are shown at the top and bottom row respectively. The corresponding pairwise canonical vectors are visualized in (d) (f). Despite the different lighting conditions of the two input sequences, the canonical vectors in the pair (top and bottom) are very much alike, capturing common modes.
Gesture Recognition Under Small Sample Size
339
Moreover, the method is robust using mutual (or canonically correlated) components of the pairwise subspaces. By finding the mutual components of maximum correlations, which are canonical correlations, some undesirable information for classification can be filtered out. See Figure 2 for the principal components and canonical vectors for the given two sequences of the same gesture class which were captured under the different lighting conditions. Whereas the first few principal components mainly corresponded to the different lighting conditions (in Figure 2a-c), the canonical vectors (in Figure 2d-f) well captured the common modes of the two sequences, being visually same in each pair. In other words, the lighting variations across the two sets were removed in the process of CCA, as it is invariant to any variations on the subspaces. Many previous studies have told that lighting variations are often confined to a low-dimensional subspace. In summary, the proposed method has a benefit over direct learning classifiers under small sample size as drawn in Figure 3. High dimensional input space and a small training set often cause overfitting of classifiers to the training data and poor generalization to new test data. Distribution of the test samples taken under different conditions can be largely deviated from that of the training set, resulting in the majority of the test samples of class 1 misclassified in Figure 3. Nevertheless, the two Fig. 3. Canonical Correlaintersection sets of the train and test sets are still tion Based Classification placed in the correct decision regions learnt over the training sets. As discussed above, canonical correlation analysis can be conceptually seen as a process to find mutual information (or an intersection set) of any two sets.
3
SIFT Descriptor for Spatiotemporal Volume Data
Edge-based description of each plane of videos can help the method achieve more robust gesture recognition. In this section we propose a simple and effective SIFT (Scale-Invariant Feature Transform) [7] representation for a spatiotemporal data by a fixed grid. As explained, the spatiotemporal volume is broken down into three sets of orthogonal planes (XY-, YT- and XT-planes) in the method. Along each data domain, there is a finite number of planes which can be regarded as images. Each of these images is further partitioned into M × N patches in a predefined fixed grid and the SIFT descriptor is obtained from each patch (see Figure 4a). For each image, the feature descriptor is obtained by concatenating the SIFT descriptors of several patches in a predefined order. The SIFT representation of the three sets of planes is directly integrated into the proposed method in Section 2 by replacing the sets of image vectors with the sets of the SIFT descriptors prior to canonical correlation analysis. The experimental results tell that the edge-based representation generally improves the intensity-based
340
T.-K. Kim and R. Cipolla
representation in both of the joint space-time domain (YT-, XT-planes) and the spatial domain (XY-planes).
Region of Interest (ROI)
Histogram "Mask" on XY-plane
(a)
SIFT on XY-plane Region of Interest (ROI)
Histogram "Masks" on XY- and YT-plane
Composite descriptor of size 256
Analysis on dx, dy and dt under ROI
SIFT on YT-plane
dx dy Analysis on dx and dy under the ROI
SIFT Descriptor of Size 128
SIFT obtained from 3D blocks. This section presents a general 3D extension of SIFT features. Traditional classifiers such as Support Vector Machine (SVM)/ Relevance Vector Machine (RVM) are applied to the video data represented by the 3D SIFT so that they can be compared with the proposed method (with SIFT) in the same data domain. Given a spatiotemporal volume representing a gesture sequence, the volume is firstly partitioned into M × N × T tiny blocks. Within each tiny block, further analysis is done along XY-planes and YT-planes (see Figure 4b). For analysis on a certain plane, say XY-planes, derivatives along X- and Y- dimensions are obtained and accumulated to form several regional orientation histograms (under a 3D Gaussian weighting scheme). For each tiny block, the resultant orientation histograms of both planes are then concatenated to form the final SIFT descriptor of dimension 256. The descriptor for the whole spatiotemporal volume can then be formed by concatenating the SIFT descriptors of all tiny blocks in a predefined order. The spatiotemporal volume is eventually represented as a single long concatenated vector.
(b)
Fig. 4. SIFT Representation: (a) SIFT used in [7]. (b) SIFT from 3D blocks (refer to text).
4
Empirical Evaluation
4.1
Cambridge Hand Gesture Data Set and Experimental Protocol
We have acquired the hand-gesture data base 2 consisting of 900 image sequences of 9 gesture classes. Each class has 100 image sequences (5 different illuminations × 10 arbitrary motions of 2 subjects). Each sequence was recorded in a frame rate of 30fps and a resolution of 320×240. The 9 classes are defined by the 3 primitive hand shapes and 3 primitive motions (See Figure 5). See Figure 5c for the example images captured under the 5 different illumination settings. The 2
The database is available upon request. Contact e-mail:
[email protected]
Gesture Recognition Under Small Sample Size
Flat/Leftward
class1
Flat/Rightward
class2
Flat/Contract
class3
Spread/Leftward
class4
Spread/Rightward
class5
Spread/Contract
class6
V-shape/Leftward
class7
V-shape/Rightward
class8
V-shape/Contract
class9
341
(a) 9 gesture classes formed by 3 shapes and 3 motions.
(b) 5 illumination settings. Fig. 5. Hand-Gesture Database
data set has temporally isolated gesture sequences which exhibit variations in initial positions, postures of hands and speed of movements in different sequences. All training was performed on the data acquired in a single illumination setting while testing was done on the data acquired in the remaining settings. The 20 sequences in the training set were randomly partitioned into the 10 sequences for training and the other 10 for the validation. 4.2
Results and Discussions
We compared the accuracy of 9 different methods: – Applying Support Vector Machine (SVM) or Relevance Vector Machine (RVM) on Motion Gradient Orientation Images [11] (MGO SVM or MGO RVM), – Applying RVM on the 3D SIFT vectors described in Section 3 (3DSIFT RVM), – Using the canonical correlations (CC) (i.e. the method using G(P1m , P1q ) in (2), spatiotemporal canonical correlations (ST-CC), discriminative ST-CC (ST-DCC), – Using the canonical correlations of the SIFT descriptors (SIFT CC), spatiotemporal canonical correlations of the SIFT vectors (SIFT ST-CC), and SIFT ST-CC with the discriminative transformations (SIFT ST-DCC).
342
T.-K. Kim and R. Cipolla
Fig. 6. Recognition Accuracy: The identification rates (in percent) of all comparative methods are shown for the plain lighting set used for training and all the others for testing
In the proposed method, the weights wk were set up proportionally to the accuracy of the three subspaces for the validation set and Nearest Neighbor classification (NN) was done with the defined similarity functions. Figure 6 shows the recognition rates of the 9 methods, when the plain lighting set (the leftmost in Figure 5c) was exploited for training and all the others for testing. The approaches of using SVM/RVM on the motion gradient orientation images are the worst. As observed in [11], using RVM improved the accuracy of SVM by about 10% for MGO images. However, we got much poorer accuracy than those in the previous study [11] mainly due to the following reasons: The gesture classes in this study were defined by hand shapes as well as motions. Both methods often failed to discriminate the gestures which exhibit the same motion of the different shapes, as the methods are mainly based on motion information of gestures. A much smaller number of sequences (of a single lighting condition) used in training is another reason to get the performance degradation. The accuracy of the RVM on the 3D-SIFT vectors was also poor. The high dimension of the 3D-SIFT vectors and small sample size might prevent the classifier from learning properly, as discussed. We measured the accuracy of the RVM classifier for the different numbers of the blocks in the 3D-SIFT representations (2-2-1,3-3-1,4-4-1,4-4-2 for X-Y-T) and obtained the best accuracy for the 2-2-1 case, which yields the lowest dimension of the 3D-SIFT vectors. Canonical correlation-based methods significantly outperformed the previous approaches. The proposed spatiotemporal canonical correlation method (ST-CC) improved the simple canonical correlation method by about 15%. The proposed discriminative method (ST-DCC) unexpectedly decreased the accuracy of STCC, possibly due to overfitting of discriminative methods. The train set did not reflect the lighting conditions in the test set. However, note that the discriminative method improved the accuracy when it was applied to the SIFT representations rather than using intensity images (See SIFT ST-CC and SIFT
Gesture Recognition Under Small Sample Size
343
Table 1. Evaluation of the individual subspace
(%) mean std
XY 64.5 1.3
CC XT 40.2 5.9
YT 56.2 5.3
ST 78.9 2.4
XY 70.3 2.1
SIFT CC XT 61.8 3.3
YT 58.3 4.0
ST 80.4 3.2
Table 2. Evaluation for different numbers of blocks in the SIFT representation: E.g. 2-2-1 indicates the SIFT representation where X,Y,and T axes are divided into 2,2,1 segments respectively 2-2-1 3-3-1 4-4-1 4-4-2 (%) ST-CC ST-DCC ST-CC ST-DCC ST-CC ST-DCC ST-CC ST-DCC mean 80.3 80.0 78.9 83.8 80.4 85.1 75.9 83.4 1.9 2.5 3.6 2.7 3.2 2.8 2.4 0.7 std
ST-DCC in Figure 6). The proposed three methods using the SIFT representations are better than the respective three methods of the intensity images. The best accuracy was achieved by the SIFT ST-DCC at 85%. Table 1 and Table 2 show more results about the proposed method, where all 5 experimental results (corresponding to each illumination set used for training) are averaged. As shown in Table 1 canonical correlations of the XY subspace obtained better accuracy with smaller standard deviations than the other two subspaces, but all three are relatively good compared with the traditional methods, MGO SVM/RVM and 3DSIFT RVM. Using the SIFT representation considerably improved the accuracy of the intensity images for each subspace, whereas the improvement for the joint representation was relatively small. Table 2 shows the accuracy of ST-CC and ST-DCC for the different numbers of the blocks of the SIFT representation. The best accuracy was obtained for the case of 4-4-1 for XYT (each number indicates the number of divisions along one axis). Generally, using the discriminative transformation improved the accuracy of ST-CC for the SIFT representation. Note that accuracy of the method is not sensitive about settings in number of the blocks, which is practically important. Also, the proposed approach based on canonical correlations is computationally cheap taking computations O(3 × d3 ), where d is the dimension of each subspace (which was 10), and thus facilitates efficient gesture recognition in a large data set.
5
Conclusion
A new method based on subspace has been proposed for gesture recognition under small sample size. Unlike typical classification approaches directly operating with input space, the proposed method reduces input dimension using the three sets of orthogonal planes. The method provides robust spatiotemporal volume
344
T.-K. Kim and R. Cipolla
matching by analyzing mutual information (or canonical correlations) between any two gesture sequences. Experiments for the 900 gesture sequences showed that the proposed method significantly outperformed the traditional classifiers and yielded the best classification result using the discriminative transformations and SIFT descriptors jointly. The method is also practically attractive as it does not involve significant tuning parameters and is computationally efficient.
References 1. Bj¨ orck, ˚ A., Golub, G.H.: Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123) 579–594, 1973. 2. Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: ECCV, pp. 390– 401 (2004) 3. Darrell, T., Pentland, A.: Space-time gestures. In: Proc. of CVPR, pp. 335–340 (1993) 4. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. of ICCV, pp. 726–733 (2003) 5. Freeman, W., Roth, M.: Orientation histogram for hand gesture recognition. In: Int’l Conf. on Automatic Face and Gesture Recognition (1995) 6. Just, A., Rodriguez, Y., Marcel, S.: Hand posture classification and recognition using the modified census transform. In: Int’l Conf. on Automatic Face and Gesture Recognition (2006) 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 8. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: Proc. of CVPR 2005, pp. 405–412 (2005) 9. Starner, T., Pentland, A., Weaver, J.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998) 10. Kim, T., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. on PAMI 29(6), 1005–1018 (2007) 11. Wong, S., Cipolla, R.: Real-time interpretation of hand motions using a sparse bayesian classifier on motion gradient orientation images. In: Proc. of BMVC 2005, pp. 379–388 (2005) 12. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. In: BMVC (2006) 13. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR 2004, pp. 32–36 (2004) 14. Kim, T., Wong, S., Cipolla, R.: Tensor Canonical Correlation Analysis for Action Classification. In: CVPR (2007) 15. Hardoon, D., Szedmak, S., Taylor, J.S.: Canonical correlation analysis; An overview with application to learning methods. Neural Computation 16(12), 639–2664 (2004)
Motion Observability Analysis of the Simplified Color Correlogram for Visual Tracking Qi Zhao and Hai Tao Department of Computer Engineering University of California at Santa Cruz, Santa Cruz, CA 95064 {zhaoqi,tao}@soe.ucsc.edu
Abstract. Compared with the color histogram, where the position information of each pixel is ignored, a simplified color correlogram (SCC) representation encodes the spatial information explicitly and enables an estimation algorithm to recover the object orientation. This paper analyzes the capability of the SCC (in a kernel based framework) in detecting and estimating object motion and presents a principled way to obtain motion observable SCCs as object representations to achieve more reliable tracking. Extensive experimental results demonstrate the reliability of the tracking procedure using the proposed algorithm.
1
Introduction
The computer vision community has witnessed the development of several excellent tracking algorithms using color statistics, e.g., color histogram, as representations. There statistics features can be convolved with an isotropic kernel to allow gradient estimation of the representation [1]. One inherent limitation of such kernel based methods is the singularity problem, where the representation is blind to certain motion. Most existing kernel based tracking algorithms are concerned only with the tracking of object locations [1], or object locations and scales [2]. Since isotropic kernels are often used [1,3,4], rotational motion can not be estimated using these methods. Recently, Zhao and Tao [5] proposed the simplified color correlogram (SCC) representation to efficiently track location and orientation simultaneously. Although the kernel is also rotationally-symmetric, the underlying SCC representation is sensitive to orientation changes. This property makes the representation capable of tracking rotational motion as well as translational motion. As in most kernel based algorithms, the assumption is that the statistics of the SCC feature be sufficient to determine the motion of the object [4]. However, this assumption needs to be validated. This paper shows that under certain degenerated cases, ill-conditioned cases may occur, i.e., translational and/or rotational motion may not cause changes to the SCC, therefore the motion is not observable. In this study, we derive a criterion to evaluate the numerical stability of the tracking solution, according to which, schemes for the SCC selection are designed. The paper is organized as follows. Section 2 reviews the simplified color correlogram (SCC). Section 3 analyzes the properties of the SCC in a kernel based Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 345–354, 2007. c Springer-Verlag Berlin Heidelberg 2007
346
Q. Zhao and H. Tao
framework, and proposes the solution to obtain the motion observable SCCs. Section 4 discusses implementation details. Section 5 shows experimental results and demonstrates the improvement over the standard mean shift algorithm, and section 6 concludes the paper.
2
Introduction to Simplified Color Correlogram (SCC)
Color correlogram expresses the correlation between color pairs in an image, which has been commonly used in the literature of image retrieval/indexing [6]. Zhao and Tao [5] has recently proposed a simplified version of color correlogram (SCC) for tracking purpose. Instead of including pixel pairs along all directions and with a set of distances, the SCC only counts pairs along one or several axes, i.e., pre-selected directions, with predefined distances. Formally, the SCC with L axes are defined as Su,v = P r(I(p1 ) = u ∧ I(p2 ) = v | f (p1 − p2 ) = (θ, d)).
(1)
Here, f is a function to obtain the direction and the distance of a pixel pair, representing the spatial relationship of two pixels p1 and p2 . θ ∈ {θl , l = 1, ..., L} and d ∈ {dl , l = 1, ..., L}, where L is the number of axes, θl is the direction of axis l and dl is the pair distance along axis l. We use the L2 norm to measure the distance between pixels. Though the conclusions about the singularity problem in the kernel based representations made in this study is not restricted to the SCC representation, we focus our discussions on the SCC due to the following reasons [5]: 1. The SCC achieves a natural integration of color and spatial information. 2. Since certain directions are emphasized, the SCC is effective in manifesting rotational variations. 3. The SCC is computationally inexpensive. 4. Being a middle ground of the template based methods and the pure statistics based representations, the SCC can have desired properties from both sides. Similar to the conventional color histogram, the SCC is also integrated into a kernel framework to allow efficient localization, thereby the singularity problem exists. This paper presents two methods to approach this issue in the SCC context, where both translation and rotation are concerned. One is to select more than one axis to form a multi-axis SCC, therefore motion ignored by pairs along one axis can be recovered along other axes, as will be justified later. This strategy suffices in most cases. However, efficiency consideration suggests an alternative of obtaining one optimal axis and its corresponding pair distance, so that the resulting SCC is the most sensitive to all different motion. Details of the two approaches are provided in the next section.
3
Motion Observability Analysis
The SCC based tracking method enables the detection of both translational and rotational motion, therefore reliable tracking in this context requires that both types of motion be distinctly observed and reliably recovered.
Motion Observability Analysis of the SCC for Visual Tracking
3.1
347
Representing Objects Using SCC
In this subsection, we first address necessary notations for the later analysis. Specifically, each pair of pixels counted in the SCC representation can be parameterized using a 3-dimensional vector Φ = [cx, cy, θ]T , where (cx, cy) are the image coordinates of the midpoint of each pair and θ is the angle between the axis of the object and the object coordinate system. Consider for a moment a target model/candidate for the SCC with one axis l and the index l is omitted to keep the notation simple. Target Model. Similar to [4], we define the matrix form of the target model as T M = αUM K(0). (2) In Eqn.2, UM = [u11 , u12 · · · u1m , u21 · · · u2m · · · um1 , um2 · · · umm ], where urs = [δ(I(Φ111 ) − Irs ), δ(I(Φ112 ) − Irs ) · · · δ(I(ΦW HO ) − Irs )]T , r, s = 1 · · · m. I(Φijk ) represents the colors of the pixel pair Φijk in the image I. W , H and O are the numbers of cxs, cys and θs considered in the SCC. If the colors of pixels in the pair Φijk are r and s, then the corresponding element in the vector is assigned 1, otherwise it is 0. The subscript M represents “model” and α normalizes the Φ112 ΦW HO T )] , where K is a kernel representation. K(0) = [K( Φ111 h ), K( h ) · · · K( h function that assigns a smaller weight to the locations and orientations that are farther from the center of the object, h is the kernel radius, and the kernel is centered at 0. By definition, L different axes yield L different target models M 1 , ..., M L . Target Candidate. Similarly, the target candidate is defined as C(Φ0 ) = βUCT K(Φ0 ),
(3)
where Φ0 is the initialized location and orientation in the current frame. β and UC are defined the same way as α and UM for the target model, and the subscript −Φ0 T C denotes “candidate”. K(Φ0 ) = [K( Φ111h−Φ0 ), K( Φ112h−Φ0 ) · · · K( ΦW HO )] . h 1 L L target candidates C , ..., C can be defined for L axes. 3.2
The Objective Function and Solution for Reliable Tracking
We focus on the single-axis case in this subsection and the multi-axis cases would be analyzed in the next subsection. For the mean shift based tracking algorithms [1,5], the objective is to seek the maximum of the Bhattacharyya coefficient [7]. Its well known connection with the Matusita metric opens the possibility that we analyze the Matusita metric other than the Bhattacharyya coefficient to better illustrate the inherent problem of kernel based tracking [4]. Using the notations given in section 3.1, the objective of tracking using the Matusita metric is defined as √ minΦ (DM (Φ)) = minΦ || M − C(Φ)||2 = minΦ ( ( Mu,v − Cu,v (Φ))2 ), u,v
(4) where Φ is the object location and orientation in the current frame.
348
Q. Zhao and H. Tao
A Newton-style iterative procedure is applied to convert this optimization problem to a more explicit form (Derivations provided in the Appendix) √ 1 d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )), (5) where Φ0 is the initialized object location and orientation for the current frame and d(C(Φ0 )) denotes the matrix with C(Φ0 ) on its diagonal. In Eqn.5, JK (Φ0 ) = Φ −Φ −Φ0 T [∇Φ K( Φ111h−Φ0 ), ∇Φ K( Φ112h−Φ0 ) · · · ∇Φ K( ΦW HO )] , where ∇Φ K( ijkh 0 ) = h Φijk −Φ0 2 2 || ), and g(x) = −k (x), k(||x||2 ) = K(x). h2 (Φijk − Φ0 )g(|| h 1 Denoting A = d(C(Φ0 ))− 2 (U )TC JK (Φ0 ) and converting the matrix before ΔΦ to a square matrix for further analysis, we obtain √ (6) AT AΔΦ = 2AT ( M − C(Φ0 )). Therefore the solution to the optimization problem in this 3-dimensional case is unique if and only if the 3 × 3 matrix AT A is of full rank. Additionally, the stability of the solution depends on the magnitude of its condition number. In the single-axis cases, the SCC with the parameters (θ, d) corresponding to the smallest condition number of AT A is the optimal SCC. 3.3
SCC with Multiple Axes
If multi-axis correlograms are used, denote ⎡ ⎤ 1 1 (Φ0 ) d(C 1 (Φ0 ))− 2 (UC1 )T JK ⎢ ⎥ .. AL = ⎣ ⎦, . L − 12 L T L d(C (Φ0 )) (UC ) JK (Φ0 ) and
l Al = d(C l (Φ0 ))− 2 (UCl )T JK (Φ0 ), l = 1, .., L, 1
then we have ATL AL =
L l=1
(Al )T Al .
In this paper, we explore further into this multi-axis problem considering the simple yet effective two-axis cases. A useful property of the semi-positive definite matrices (A1 )T A1 and (A2 )T A2 states as min(cond((A1 )T A1 ), cond((A2 )T A2 )) ≤ cond(AT2 A2 ) ≤ max(cond((A1 )T A1 ), cond((A2 )T A2 )),
(7)
where AT2 A2 = (A1 )T A1 + (A2 )T A2 . These inequalities indicate that the condition number of a two-axis SCC is between the two condition numbers of the corresponding single-axis SCCs. A consequence is that for a SCC defined with two axes, unfavorable condition numbers are less possible to be generated, since it requires both corresponding single-axis SCCs to have sufficiently large condition numbers. To make full use of this point, the two axes should be as independent as possible and in our work, two orthogonal axes are used.
Motion Observability Analysis of the SCC for Visual Tracking
3.4
349
Visual Interpretations and Verifications on Example Patterns
Using the SCC as an object representation, an image patch can have different SCCs, with some being more favorable than others in terms of motion observability. To provide a visual interpretation on the SCC representation, we make further analysis into the matrix A, which is ⎡ ⎤ − 12 Φ −Φ 2 (Φijk − Φ0 )g(|| ijkh 0 ||2 ) h2 C1,1 ⎢ ⎥ I(Φijk )=I11 ⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ − 12 ⎢ Φijk −Φ0 2 ⎥ 1 2 −2 T C (Φ − Φ )g(|| || ) ⎢ ⎥. 2 ijk 0 d(C(Φ0 )) UC JK (Φ0 ) = ⎢ h 1,m h ⎥ I(Φijk )=I1m ⎢ ⎥ ⎢ ⎥ . .. ⎢ ⎥ ⎢ ⎥ ⎣ 2 − 12 Φijk −Φ0 2 ⎦ C (Φ − Φ )g(|| || ) m,m 2 ijk 0 h h I(Φijk )=Imm
(8) To obtain a unique solution for ΔΦ in Eqn.5, at least three of the row vectors of the above matrix need to be linearly independent. In this paper, we analyze two typical image patterns that are inherently unobservable to certain motion.
Fig. 1. Illustration for visual interpretation of SCC choice
Concentric Circles (Fig.1(a)): In color histogram based kernel methods, concentric circles are regarded as a degenerated case [4], where translation can not be detected. In this SCC based kernel methods, due to the spatial information encoded in the pixel pairs, translation along the SCC axis can now be observed. Without loss of generality, we set the SCC axis to be along the y direction, as shown in Fig.1(a), and validate the translation observability through Eqn.8 by Φ −Φ examining the weighted distance vectors (Φijk −Φ0 )g(|| ijkh 0 ||2 ) for pixel pairs of two certain distinct colors. The cx component (recall that Φ = [cx, cy, θ]T ) of this term is cancelled out by every two corresponding pixel pairs, i.e., two symmetric pairs w.r.t. the SCC axis, like pairs a and b in Fig.1(a). However, the cy component of the term can not be always cancelled out by corresponding pairs, i.e., symmetric ones w.r.t. the x direction. For example, pair c in Fig.1(a) is of color (i, j), while its corresponding pair (pair d ) is of color (j, i), therefore they
350
Q. Zhao and H. Tao
can neither cancel out each other for Cij , nor for Cji . Since the representation is rotationally-symmetric, the θ components are all cancelled out. As a result, the row vectors in Eqn.8 is κ[0, 1, 0]T . The intuition behind this is that among the three degrees of motion, only the translation along the SCC axis can cause sufficient changes to the SCC. Although one may use two axes to detect translation in both dimensions, the blindness to rotation is the inherent limitation of Concentric Circles. Parallel Stripes (Fig.1(b)): Independent of the axis choice in the SCC, the parallel stripes pattern is sensitive to motion along the x direction, while blind to motion along the y direction. However, its observability to rotation depends on the direction of the SCC axis. If the axis is defined to the be along the x direction, Φ −Φ then the elements for the orientation dimension in (Φijk − Φ0 )g(|| ijkh 0 ||2 ) cancel out. Intuitively, this means that slight rotation does not cause enough change to the SCC. On the other hand, if the axis is some degrees away from the x direction, then rotation makes a difference in the SCC by causing some pixels in the boundary of two stripes to be the other color.
Fig. 2. Quantitative relationship between condition numbers and parameters (θ, d): (a) Image patch (96 × 96), (b)-(d) Condition numbers w.r.t. axis directions with different pair distances: (b) d = 5 (Symbols at the top line represent overflow values), (c) d = 10, (d) d = 40
We evaluated the quantitative relationship between condition numbers and parameters (θ, d) for the image patch of the Parallel Stripes pattern, shown in Fig.2(a). The axis direction θ is defined in the object coordinate system, i.e., with along the x direction being 0 degree, and increases counterclockwise. We observe the condition numbers of the single-axis SCCs with different axis directions and those of two-orthogonal-axis SCCs, where the directions of the axes are θ and θ + 90. Fig.2(b-d) show the relationship of condition numbers w.r.t.
Motion Observability Analysis of the SCC for Visual Tracking
351
different axis directions with pair distances of 5, 10, and 40, respectively. From the illustrated outputs, following conclusions can be made: 1. The pair distance can not be too small. This is due to the discrete nature of the image and the fact that the colors for the SCC are assigned in a nearestneighbor manner. For short pixel pairs, small image rotation may not cause any changes to the SCC. As shown in Fig.2, the results with pair distance of 5 pixels (Fig.2(b)) are much poorer than those with longer distances(Fig.2(c-d)). 2. Some axis directions are more favorable than others in terms of stability. For the patch shown in Fig.2(a), the most favorable axis direction for single-axis SCCs is along the y direction (θ = 90 in Fig.2(c-d)). 3. When two axes are used, due to the inequalities given in Eqn.7, no matter what the directions of the axes are, the condition number is between the two condition numbers generated independently by the two single-axis SCCs. Fig.2(c-d) show that the average condition numbers provided by the two-orthogonal-axis SCCs are significantly smaller than those generated by the single-axis ones.
4
Implementation Details
For most real objects, textures are irregular enough to avoid the extreme cases, as shown in section 3.4, thereby a two-orthogonal-axis SCC suffices in most cases. However, in applications where speed is an important factor, single-axis SCC is greatly favored for efficiency considerations. In this case, we search for the optimal axis direction in the orientation space to obtain the SCC with the smallest condition number. The pair distance is an important parameter in that it influences not only the SCC’s sensitivity to rotation, but also the stability of the solution. On one hand, the larger the pair distance, the more observable the orientation changes, as discuessed in section 3.4; on the other hand, a SCC with a large pair distance tends to have too few pairs counted (both pixels should be in the tracking window), which decreases the stability of the tracking. By trial and error, we set the default distance to be max(1/8(l + w), 10), where l and w are the length and width of the kernel size. The lower bound of 10 pixels ensures the stability of the tracker when the object is small. In this paper, similar to [5], mean shift algorithm is extended to a translationrotation joint domain to locate the object position and orientation simultaneously, in a gradient descent manner. However, the proposed idea can be easily incorporated into any tracking framework other than the mean shift one.
5
Experiments
The usefulness of the proposed schemes to ensure reliable tracking have been demonstrated on vehicle and pedestrian sequences under various environmental conditions. In the following, the first two real-time tracking tasks compute and use the optimal single-axis SCCs as representations, and the two-orthogonal-axis SCCs are used for the other sequences.
352
Q. Zhao and H. Tao
a-1 a-2 b-1 b-2 c-1 c-2
Fig. 3. Car-Chasing Sequence
Car-Chasing: The Car-Chasing is a live video sequence of 2250 frames. It has been tested using the standard mean shift (MS) based tracking algorithm, the single-axis SCC based tracking algorithms with and without the optimal SCC selection. The possible problems in vehicle tracking using the MS tracker are revealed in Fig.3: (1) Loss of tracking tends to occur when the car makes turns (Fig.3(a-1)); (2)Fixed orientation of the tracking window causes scale adaptation difficult to realize and the mismatch of the window to the object makes the tracker sensitive to background clutter (Fig.3(b-1)). In Fig.3(c-1), although the window direction adapts to the object, using a non-optimal single-axis SCC, the tracker is less stable. In Fig.3(a-2),(b-2),(c-2), we show results of the optimal single-axis SCC based tracker, which tracks the car throughout the entire sequence. Results demonstrate that challenging issues like object rotations, heavy occlusions, background clutters, scale changes and motion blur are handled elegantly. PETS 2001 Data: The proposed schemes are further evaluated on the PETS 2001 Dataset [8]. Compared with the color histogram based methods, the SCC based tracker successfully removes the restrictions brought up by certain object shapes and/or camera viewpoints. The proposed optimal single-axis SCC selection scheme further ensures both reliability and efficiency of the tracker. Sample frames of the experimental outputs are shown in Fig.4. Multiple Human Parts: Other experiments are shown in Fig.5. The twoorthogonal-axis SCCs are used to track multiple human parts. Although only
Motion Observability Analysis of the SCC for Visual Tracking
353
Fig. 4. PETS 2001 Sequences
Fig. 5. Stretching and Walking Sequences
color information is extracted, the reliable tracking results indicate the algorithm’s potential in being a useful module in any human tracking or behavior analysis tasks.
6
Conclusion
This paper analyzes the capability of the SCC (in a kernel based framework) in recovering both translation and rotation. A criterion to evaluate the SCC in terms of motion estimation is provided to guide the SCC selection. Twoorthogonal-axis SCCs are proved to be practical sufficient, while in tasks where speed requirements are high, optimal single-axis SCCs are desired. The discussion is focused on, but not limited to, the SCC representation. The SCC in an extended mean shift tracking framework is not computationally expensive. The tracker runs comfortably at 30 fps on a PIV 3.20GHz PC.
354
Q. Zhao and H. Tao
References 1. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. PAMI 25(5), 564–577 (2003) 2. Collins, R.: Mean-shift blob tracking through scale space. In: CVPR, pp. 234–240 (2003) 3. Fan, Z., Wu, Y.: Multiple collaborative kernel tracking. In: CVPR II, pp. 502–509 (2005) 4. Hager, G., Dewan, M., Stewart, C.: Multiple kernel tracking with SSD. In: CVPR I, pp. 790–797 (2004) 5. Zhao, Q., Tao, H.: Object tracking using color correlogram. In: VS-PETS, pp. 263– 270 (2005) 6. Huang, J.: Color-spatial image indexing and applications. PhD thesis, Cornell University (1998) 7. Kailath, T.: The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. on Comm. Tech. 15(1), 52–60 (1967) 8. http://visualsurveillance.org/PETS2001/
Appendix: Derivation for Eqn.5
√ Optimization on the objective function of min|| M − C(Φ)||2 results the fol√ 1 lowing equation: d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )), where ΔΦ is the motion vector in terms of both translation and rotation, and Φ0 is the initialized object center in the current frame. Applying the Taylor Expansion on C(Φ) and dropping higher order terms yields d C(Φ) C(Φ) = C(Φ0 ) + (9) |Φ=Φ0 ΔΦ. dΦ T Since C(Φ0 ) = UCT K(Φ0 ), therefore dC(Φ) dΦ |Φ=Φ0 = UC ∇K(Φ0 ). Introducing this into Eqn.9, we have 1 1 C(Φ) = C(Φ0 ) + d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ, (10) 2 where d(C(Φ0 )) is the matrix with C(Φ0 ) on its diagonal. Rewrite the objective function in terms of the motion vector ΔΦ, we obtain √ (11) argminΔΦ || M − C(Φ0 + ΔΦ)|| Substituting Eqn.10 into Eqn.11, the resulting objective function is √ 1 1 (12) argminΔΦ || M − C(Φ0 ) − d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ||, 2 the solution of which equates to the solution of the linear system √ 1 1 d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ = M − C(Φ0 ). (13) 2 Denoting ∇K(Φ0 ) as JK (Φ0 ) and scaling both sides up by a factor of 2 results √ 1 (14) d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )).
On-Line Ensemble SVM for Robust Object Tracking Min Tian1, Weiwei Zhang2, and Fuqiang Liu1 1
Broadband Wireless Communication and Multimedia Laboratory, Tongji University, Shanghai, China 2 Microsoft Research Asia, Beijing, China
[email protected],
[email protected],
[email protected]
Abstract. In this paper, we present a novel visual object tracking algorithm based on ensemble of linear SVM classifiers. There are two main contributions in this paper. First of all, we propose a simple yet effective way for on-line updating linear SVM classifier, where useful “Key Frames” of target are automatically selected as support vectors. Secondly, we propose an on-line ensemble SVM tracker, which can effectively handle target appearance variation. The proposed algorithm makes better usage of history information, which leads to better discrimination of target and the surrounding background. The proposed algorithm is tested on many video clips including some public available ones. Experimental results show the robustness of our proposed algorithm, especially under large appearance change during tracking.
1 Introduction Visual tracking is an important subject in computer vision with a variety of applications. One of the main challenges that limit the performance of the tracker is appearance change caused by the variances of pose, illumination or view-point. In order to develop a robust tracker, lots of former work has been done to address those problems, however, robust object tracking still remains a big challenge. Object tracking can be considered as an optimization problem. Tracking algorithm is used to find a region with the local maximum similarity score. In [1], the similarity is defined as the SSD between the observation and a fixed template. In [6], Mean-shift is proposed as a nonparametric density gradient estimator to search the most similar region by computing the similarity between color histogram of the target and the search window. Object tracking can also be considered as a state estimation problem. In early works, Kalman filter or its variants are frequently used. However, Kalman can’t solve the non-Gaussian and non-linear cases well. In order to solve the nonGaussian and non-linear cases well, sequential Monte Carlo methods are applied for tracking, among which Particle Filter (PF) [3,8,12,4] is the most popular one. Object tracking can also be regarded as a template updating problem. The classical subspace tracking method is proposed by Black et al. [2]. Ross and Lim [13] extended Eigen-tracking by on-line incremental subspace updating. Along the other direction, in [9] Jepson models the target as a mixture of stable component, outliers, and two frame transient component. And an on-line EM algorithm is use for the parameters of each Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 355–364, 2007. © Springer-Verlag Berlin Heidelberg 2007
356
M. Tian, W. Zhang, and F. Liu
component. Considering the tracker as a binary classify problem is very popular recently. [18] proposed a transductive learning method for tracking, in which D-EM algorithm is used for transducing color classifiers and selecting a good color space to determine each pixel whether belongs to the foreground or background. The constraints of these color trackers are clear because they can only work on color image sequences. In [10] an off-line SVM is used for distinguish the target vehicle from the background. Similar to [10], in [12] an Adaboost classifier is trained off-line to detect the hockey players for a proposal distribution to improve the robustness of the tracker. Since they need large mount training data, it’s not easy to extend those approaches to general objects tracking. In order to get a general tracker, [16] proposed an online learning of ensemble pixel based weak classifier, and tracking is done by classifying pixels into foreground and background. However, in his method the features are limited to an 11D low dimension vector including pixel colors and local orientation histogram. Helmun Grabner proposed an on-line version of the Adaboost algorithm in [17], which can on-line select features to get a strong classifier. Their work is roughly similar to the idea of [16], however it can on-line choose the features with the most discriminative ability. Among all of the topics that have been discussed in existing classifier based trackers, we found that the history information has not been paid much attention, which is very important for tracking. In order to better use those important information, we proposed an ensemble SVM classifier based tracking algorithm. In our algorithm, the linear SVM can automatically select the “Key Frames” of the target as support vectors. Moreover, through ensemble combining several linear SVM classifiers, history information can be used more reasonable, and the risk of drift can be decreased more effectively. Finally, because the ensemble method can automatically adjust each SVM classifiers weight on-line, so some off-line classifiers can also be trained and add into the framework to form a more robust tracker. The paper is organized as following, section 2 give out the proposed on-line updating of SVM based tracker. Section 3 give out the ensemble of SVM based tracker. Experiment and conclusion are given in section 4 and section 5 respectively.
2 On-Line SVM Tracker In this paper, we take the tracking as a binary classification problem, where we choose SVM as our basic classifier. In the following sections, we will show that SVM can automatically select “Key frame” of the target from the historic frames, and that is one of the most important factors for the proposed algorithm in this paper. First of all, let’s introduce our on-line SVM classifier based tracker. 2.1 SVM Classifier Based Tracking Within the context of object tracking, we define the target object region and its surroundings as positive data source and negative data source respectively, as shown in fig. 1(b). Our target is to learn a SVM classifier which can classify the positive data and negative data in the new frame. Starting from first frame, the positive and the negative samples are used to training the SVM classier. Then the search region (fig. 1(b)) can be estimated in the next frame. Finally, the target region in next frame is located with local maximum score within the search region.
On-Line Ensemble SVM for Robust Object Tracking
(a)
(b)
357
(c)
Fig. 1. (a) The confidence map of the search region. (b) The object region, search region and the context region. (c) Demonstration of a linear SVM. The filled circles and rectangles are the “support vectors”.
2.2 On-Line Updating Linear SVM One of the most difficult tasks for a tracker is how to on-line update the tracker to make it adapt to the appearance change of the target. Some former methods use a rapid update model, saying that xt = xˆ t −1 , but it’s dangerous and may cause drift problem [11]. Some off-line tracker, such as [7], takes benefit of the off-line selected “Key Frames”, which will let the tracker be more robust to against drift problem. Our idea is inspired by the former ones, and the difference is that our tracker can not only record the “Key Frames” of the target as the history information, but can also update on-line to decrease the risk of drift. We consider the updating of the classifier as an on-line learning process. Here we propose a simple yet effective way for on-line updating the linear SVM classifier. The details are described as Algorithm 1. Algorithm 1. On-line Linear SVM Tracking & Updating Input: In Video frames for processing ( n=1,…, L) R Rectangle region of the target region Output: Rectangles of target object’s region Rn ( n=1,…, L) Initialization for the first frame In ( n=1): z Extract positive and negative samples S1 = {xi , yi }iN=1 , where yi ∈ {−1, +1} , corresponding to the target region R. z Train a linear SVM to get f1 (x) = w1x + b1 and its support vectors V1 = {xi , yi }iM=1 For each new frame In ( n>1): z Find region Rn with the local maximum score given by f n−1 (x) . Here x denotes the search window’s feature vector. x Rn = arg max f n−1 (x) = arg max(w n−1x + bn−1 ) x
x
z If f n−1 (x R ) > 0 go to the next step to get a new SVM. n
Else stop updating and guess Rn is the target region, and go to the next frame. z Refresh positive samples Pn = Vn+−1 ∪ Sn+ and negative samples N n = Vn−−1 ∪ S n− . Here, Vn+−1 and Vn−−1 are the positive and negative support vectors of f n−1 (x) , Sn+ and Sn− are the
positive and negative samples of current frame. z Retrain the SVM using new samples for updating to get f n (x) = w n x + bn
358
M. Tian, W. Zhang, and F. Liu
By on-line updating, the SVM tracker can adjust its hyper-plane for the maximum margin between the new positive and negative samples. The support vectors transferred frame by frame contain important “Key Frames” of the target object in the previous tracking process (see fig.2 (b)). Figure 2 show the tracking result and part of selected “Key frames” by SVM in the final frame, the video is provided by Jepson [9]. The proposed tracker can adapt to the face with variation in appearance and distinguish it from the cluttered background (fig. 2(a)).
Fig. 2. (a) Tracking results of frame 1, 206, 366, 588, 709, 761, 973 and 1131. (b) In the end of the tracking task, 182 positive support vectors contain enough history information. Images displayed on the bottom are some of these support vectors.
3 Ensemble SVM Classifier Based Tracking Although the proposed SVM based tracker is powerful for tracking by on-line updating, we found there still existing several issues need to be addressed in the real world video clips. (a) The variance of the target is very large. (b) The tracker is disturbed by scale variation, partial occlusion or movement blur on a certain frame. Those two issues may lead the tracking algorithm drifting and finally fail. In order to further address those two issues, we proposed an ensemble SVM algorithm in this section. 3.1 On-Line Building the Ensemble of SVMs Our algorithm starts with a SVM trained with labeled data in the first frame. After then, in each frame new SVM may be added, the current tracking result with previous SVM classifiers. The match ratio rm is defined as equation (1), where U ( x) is a step function that equals to 1 when x is above zero, and otherwise it equals to 0. +
Here {xk }kN=1 are the positive samples, and
N+
is their number. The larger rm is, the
On-Line Ensemble SVM for Robust Object Tracking
359
better current component matches the positive samples. If rm below a ratio threshold, a new SVM should be added. N+
rm =
∑ U ( f m (x k ) − 1)
(1)
k =0
N+
N + − ∑ U ( f m (x k ) − 1) k =0
So after several frames, many SVM classifiers are generated and updated during different periods, which is shown as fig. 3.
Fig. 3. This flowchart is the demonstration of our framework of on-line ensemble SVM Tracker
After the number of SVM classifier is larger than one, we combine the linear SVM classifiers in the pool to get better classify result. Each SVM classifier is assigned with a coefficient α m , which is defined as following: N
1 α m = log 2
∑ U (Pm (xi , yi ) ⋅ ωi )
i =1 N
(2)
∑ U (− Pm (xi , yi ) ⋅ ωi )
i =1
Here ωi is the samples’ weight. And Pm (xi , yi ) is the output of each SVM classifier to evaluate its discriminative ability for every sample. Here we define Pm (xi , yi ) as following: 1 ⎧ ⎪ ⎪ ( fm (x) − T ) ⋅ fm (x) − T Pm ( x i , y i ) = ⎨ T2 ⎪ ⎪ −1 ⎪⎩
if y i ⋅ f m ( x ) ≥ 1 if 1 > y i ⋅ f m ( x ) > 0
(3)
if y i ⋅ f m ( x ) ≤ 0
When Pm (xi , yi ) is positive, it means the right classifying probability. Meanwhile, when it is negative, it means the wrong classifying probability. Here, T ∈ (0,1) is a threshold in the determining the rule, and we set is as 0.5 in our method. The details of on-line ensemble SVM tracker are described as Algorithm 2.
360
M. Tian, W. Zhang, and F. Liu
Algorithm 2. On-line Ensemble SVM Classifiers for Tracking Input: Output:
In Video frames for processing ( n=1,…, L) R Rectangle region of the target region Rectangles of target object’s region Rn ( n=1,…, L)
Initialization for the first frame In ( n=1): z Use the target and other random chosen regions without overlap with the target in the first frame to form a ground truth classifier f 0 (x) = w 0 x + b0 , which will not be updated during tracking. This classifier can also be other off-line trained classifier.
z Extract the positive and the negative samples as Algorithm 1. z Train a SVM classifier f1 (x) = w1x + b1 by using the extracted samples. z Initialize the samples’ weights ωi = 1/ N , i = 1, 2,..., N . z For m = 0 to 1 a)
Make {ωi }iN=1 a distribution
b)
Chose the most strong SVM with the largest α m by using equation (2)
c)
If α m <0, α m =0 and break
d) e)
Remove the chosen SVM Update samples’ weight ωi = ωi exp[−α m ⋅ yi Pm (xi , yi )] , i = 1, 2,..., N
z Normalize α i to make ∑ αi = 1 . The output of the ensemble one is F(x) = α 0 f0 (x) + α1 f1 (x) i
For each new frame In (n>1): z Use F(x) to search the target region and extract samples S = S + ∪ S − .
z z z z
Push some of S + into the positive sample queue by random sampling. Check whether a new SVM should be built by
rm in (1).
Choose the last K SVMs to update as Algorithm 1. Here we set K=5. Radom chose M samples S′ from the sample history queue, M equals to the number of S + . New group of samples are determined as: S ′′ = S ∪ S ′
z Initialize the samples’ weights ωi = 1/ N , i = 1, 2,..., N . z For m = 0 to K max ( K max = 10 in ours) a)
Make {ωi }iN=1 a distribution
b)
Chose the most strong SVM with the largest α m by using equation (2)
c)
If α m <0, α m =0 and break
d) e)
Remove the chosen SVM from the SVM queue Update samples’ weights ωi = ωi exp[−α m ⋅ yi Pm (xi , yi )] , i = 1, 2,..., N
z Normalize α i to make ∑ αi = 1 . The output of the ensemble one is: F(x) = i
K
∑ α m f m (x)
m =0
Compared with a single on-line SVM, the ensemble tracker can get a more reliable result, especially when the appearance of the target changes frequently (as fig. 4,5). From fig. 4 and fig.5, we can clearly find that a single on-line SVM is useful. However, it record all the history information as its support vectors to achieve the global optimization, which may cause it can difficultly handle large appearance variation in a short period. This phenomenon is also appeared in the incremental
On-Line Ensemble SVM for Robust Object Tracking
361
-
(a)
(b)
(c)
(d) Fig. 4. (a) Tracking results of frame 100,152,171,183 and 366 by using single on-line linear SVM tracker. (b) Tracking results of on-line ensemble SVMs tracker. The confidence map of the search region is on the right-bottom of each frame. (c) The ensemble weight of each SVM in the mixture model. (d) The updating period of each SVM (from being generated to stop updating). Mind that No.1 SVM is the ground truth SVM, and it will not be updated during tracking. frame 1
frame 121
frame 207
frame 282
frame 345
frame 384
frame 409
frame 429
frame 440
frame 481
(a)
(b)
(c)
Fig. 5. Sequences provided by Lim and Ross (a) Tracking results of incremental subspace learning tracker. The tracker failed after frame 345. (b) Tracking results of our single on-line linear SVM tracker. The tracker almost failed on frame 345, and then it drifts away from the target. (c) Tracking results of on-line ensemble SVMs tracker. The tracker finished the whole video with accurate results.
362
M. Tian, W. Zhang, and F. Liu
subspace learning tracker shown as fig.5 (a). The ensemble SVM tracker, which we proposed here, can choose the SVM classifiers with the best discriminative ability to the chosen samples, and on-line combines them together by adjusting their weight. Using this method, the ensemble classifier can use the history information more reasonable, at the same time the final tracker has an especially strong distinguish ability, which makes the tracking result more reliable.
4 Experiments In this section, several experimental results are carried out by our algorithm. Region patterns we used here are some common features: histograms of oriented gradients (HOG) [14] and local binary patterns (LBP) [5]. Integral histograms [15] are built for extracting region feature efficiently. Similar to the method used in [14], we construct a 9 bins HOG histogram for each cell, each block contains four cells with a 36-D HOG feature vector that is normalized to an L2 unit and a 59-D feature vector for LBP. Different form [14], the pixels including in a cell is not a constant, because the object region is scalable. The positive samples are captured by scaling the target region from 0.8 to 1.5 and rotating it form -8 degree to +8 degree. The negative samples are captured within the context region, and the negatives can have some overlaps with the positive one (below 1/3 area of positive sample). The sample rate between each negative region is set as 5 pixels per step in our method. In our method, the scale problem is solved by a naive way. The suitable scale is got by searching different scales around the centre of local maximum region. First of all, we captured some videos by ourselves to demonstrate the robustness of our framework (fig. 6(a)).Then we carried out our method on some frequently used public available sequences (fig. 6(b,c)). Compared with some other popular methods,
(a)
(b)
(c) Fig. 6. (a) A glass bottle with illumination and appearance change while moving in cluttered background. Beside the target, there is another bottle, which looks like the target bottle. However, the tracker can discriminate them very well without confusion. (b) A moving doll with large pose and illumination change, frames 1, 454, 728, 1162 and 1343. (c) A moving vehicle with disturbance of the light around, frames 1, 129, 245, 295 and 391.
On-Line Ensemble SVM for Robust Object Tracking
363
(a)
(b)
(c)
Fig. 7. Results of a boy with large head pose change. (a) Results of incremental subspace learning tracker. The tracker failed after frame 96 (b) Results of ensemble tracker, it runs well for 203 frames, but failed later, which may cause by size problem. (c) Results of our method.
the incremental subspace learning tracker [13] of Ross and Lim and the ensemble tracker [16] of Shai Avidan, we get the results in fig. 7. The incremental learning tracker is based on updating the sample mean and the eigenbasis over time. However, when the variation is very large, the updating can’t adapt the change quickly enough and an imprecise position may be got. So after several frames’ updating, the target may drift very quickly because of error accumulation. The ensemble is powerful, however it is a pixel based tracker (as [18]), the information for a pixel is little and the feature vector may have a large variation when the target is colorful, and the tracker may get confused when the color of the target and the background is similar. Our method, as mentioned before, is based on the region patterns which is more stable while tracking. Meanwhile, it contains and chooses the most useful “key frames” of the target by ensemble of SVMs, which have the most discriminative ability. Because of that, the performance of the tracker is especially good on some challenging videos with large appearance variation.
5 Conclusion In this paper, we build a novel framework to track general object. The ensemble SVM tracker proposed here is made up of several SVM classifiers, which are proved especially strong in selecting and recording the “Key Frames” of the object. These classifiers are generated and updated during different periods with different historical
364
M. Tian, W. Zhang, and F. Liu
information. By on-line adjusting each SVM’s weight, the ensemble classifier can distinguish the target and the background better than any single component. With the selected useful historical information and the strong discriminative ability, the tracker performs especially well on some difficult videos with large appearance variation.
Acknowledgments This work was done when the author visited visual computing group in Microsoft Research Asia. The author would like to appreciate all the researchers in that group for their supports. Thanks for Lim and Ross’s image sequences and matlab code of incremental subspace learning tracker provided in their website. Furthermore, thanks to Shai Avidan’s help and the results of ensemble tracker provided by him.
References 1. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. PAMI 20(10), 1025–1039 (1998) 2. Black, M.J., Jepson, A.: EigenTracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision 26(1), 63–84 (1998) 3. Isard, M., Blake, A.: Condensation-Conditional Density Propagation for Visual Tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 4. Perez, P., et al.: Color-Based Probabilistic Tracking. In: ECCV, pp. 661–675 (2002) 5. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. PAMI 24, 971–987 (2002) 6. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. PAMI 25(5), 564–577 (2003) 7. Vacchetti, L., Lepetit, V., Fua, P.: Fusing online and offline information for stable 3D tracking in real-time. In: CVPR 2003, vol. 2, pp. 241–248 (2003) 8. Nummiaroa, K., Koller-Meierb, E., Gool, L.V.: An Adaptive Color-Based Particle Filter. Image and Vision Computing 99–110 (2003) 9. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. PAMI 25(10), 1296–1311 (2003) 10. Avidan, S.: Support Vector Tracking. PAMI 26(8), 1064–1072 (2004) 11. Matthews, I., Ishikawa, T., Baker, S.: The Template Update Problem. PAMI 26, 810–815 (2004) 12. Okuma, K., Taleghani, A.: A Boosted Particle Filter: Multitarget Detection and Tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 28–39. Springer, Heidelberg (2004) 13. Ross, D., Lim, J., Yang, M.H.: Probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 470–482. Springer, Heidelberg (2004) 14. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR 2005, vol. 1, pp. 886–893 (2005) 15. Porikli, F.: Integral histogram: a fast way to extract histograms in Cartesian spaces. In: CVPR 2005, vol. 1, pp. 829–836 (2005) 16. Avidan, S.: Ensemble tracking. In: Proceedings of CVPR 2005. vol.2, pp. 494–501 (2005) 17. Grabner, H., Bischof, H.: On-line Boosting and Vision. In: CVPR 2006, vol. 1, pp. 260– 267 (2006) 18. Wu, Y., Huang, T.S.: Color Tracking by Transductive Learning. In: Proceedings of CVPR 2000, vol. 1, pp. 133–138 (2000)
Multi-camera People Tracking by Collaborative Particle Filters and Principal Axis-Based Integration Wei Du and Justus Piater University of Li`ege, Department of Electrical Engineering and Computer Science Montefiore Institute, B28, Sart Tilman Campus, B-4000 Li`ege, Belgium weidu.montefiore.ulg.ac.be,
[email protected]
Abstract. This paper presents a novel approach to tracking people in multiple cameras. A target is tracked not only in each camera but also in the ground plane by individual particle filters. These particle filters collaborate in two different ways. First, the particle filters in each camera pass messages to those in the ground plane where the multi-camera information is integrated by intersecting the targets’ principal axes. This largely relaxes the dependence on precise foot positions when mapping targets from images to the ground plane using homographies. Secondly, the fusion results in the ground plane are then incorporated by each camera as boosted proposal functions. A mixture proposal function is composed for each tracker in a camera by combining an independent transition kernel and the boosted proposal function. Experiments show that our approach achieves more reliable results using less computational resources than conventional methods.
1
Introduction
Tracking people in multiple cameras is a basic task in many applications such as video surveillance and sports analysis. A commonly-used fusion strategy is to detect people in each camera with bottom-up approaches such as background subtraction and color segmentation, and then to calculate the correspondences between cameras using the camera calibrations, or more often, the ground homographies. In order to reason about occlusions between targets, this fusion strategy usually requires all targets to be correctly detected and tracked [8,4,2,6,5]. However, sometimes we may be interested in the trajectories of only a few key targets, for instance, the star players in a soccer game or a few suspects in a surveillance scenario. Top-down approaches are preferable in such situations. In this paper, we present a novel top-down approach to people tracking by multiple cameras. The approach is based on collaborative particle filters, i.e., we track a target not only in each camera but also in the ground plane by individual particle filters. These particle filters collaborate in two different ways. First, the particle filters in each camera pass messages to those in the ground plane where the multi-camera information is integrated using the homographies Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 365–374, 2007. c Springer-Verlag Berlin Heidelberg 2007
366
W. Du and J. Piater
of each camera. Such a fusion framework usually relies on precise foot positions of the targets, which are often not provided by the particle filters in the cameras. To overcome the imprecise foot positions as well as the uncertainties of the camera calibrations, we exploit the principal axes of the targets during integration, which greatly improves the precision of the fusion results. These fusion results are then incorporated by the trackers in each camera as boosted proposal functions. A mixture proposal function is composed for each tracker in a camera by combining an independent transition kernel and the boosted proposal function, from which new particles are generated for the next time instant. Our approach has several distinct features. First, it doesn’t require all targets to be tracked simultaneously. Instead of having different target trackers interact, we compute the consensus between cameras by having different camera trackers communicate. Second, it has a fully distributed architecture. All the computations are performed locally and only the filter estimates are exchanged between the cameras and the fusion module. Third, the fusion of the multi-camera information is done by intersecting the targets’ principal axes. Experiments on both surveillance and soccer scenarios show that our approach achieves more reliable results using less computational resources than conventional methods. Particle filters are conventional in multi-camera tracking. Most previous work performed particle filtering in 3D so that precise camera calibration is required to project particles into the image plane of each camera [9,7]. The multi-camera information is often integrated by either the product of the likelihoods in all cameras [7] or a selection of the best cameras that contain the most distinctive information [9]. In our previous work, we proposed a different approach to fusion that combined particle filters and belief propagation, where particle filters collaborated with each other via a message passing procedure [1]. To match ground-plane target positions using homographies, the foot positions of the tracked people have to be detected. This, however, is a difficult and errorprone task if done separately for each camera. In this paper, we address the precision and computational issues. We relax the dependence on precise foot positions by exploiting the principal axes of the targets, the intersections of which give better ground positions. At the same time, we improve the speed over our previous system [1] by incorporating the fusion results from the ground plane as proposal functions into each camera. The rest of the paper is organized as follows. Section 2 formulates the multicamera tracking problem. Section 3 introduces the collaborative particle filters, including the principal axis-based integration and the boosted proposal functions. Experiments on sequences of video surveillance and soccer games are shown in Section 4.
2
Problem Formulation
Suppose L cameras are used and each camera collects one observation for each target at each time instant. Denote the target state on the ground plane by xt,0 and its states in different cameras by xt,j , j = 1, . . . , L. Let zt,j denote
Multi-camera People Tracking by Collaborative Particle Filters
(a) tree-structured graphical model
367
(b) dynamic Markov model
Fig. 1. Graphical models for modeling the dependencies at time t and for modeling the evolution of the system in time
the observation in camera j at time t, Zt = {zt,1 , . . . , zt,L } the multi-camera observation at time t, and Z t = {Z1 , . . . , Zt } the multi-camera observations up to time t. Fig. 1(a) shows the graphical model that models the dependencies between target states in the ground plane and at different cameras at time t. We assume that the xt,j , j = 1, . . . , L, are independent given xt,0 so that a tree-structured model is formed. Note that xt,0 is associated with no observation. Connecting the graphical models at different times results in a dynamic Markov model, shown in Fig. 1(b), that describes the evolution of the system over time. As all the xt,j depend on xt,0 , we add temporal links from xt−1,0 to xt,j . The addition of these temporal links is beneficial to the design of the proposal functions, shown in the next section. In both models in Fig. 1, each directed link from xt,0 to xt,j , j = 1, . . . , L, represents a message passing process and is associated with a potential function t (xt,0 , xt,j ). The directed link from xt,j to zt,j , j = 1, . . . , L, represents the ψ0,j observation process and is associated with a likelihood function pj (zt,j |xt,j ). In Fig. 1(b), the directed links from xt−1,i to xt,i , i = 0, . . . , L, and from xt−1,0 to xt,j , j = 1, . . . , L represent the state transition processes and are associated with motion models p(xt,i |xt−1,i ) and p(xt,j |xt−1,0 ) respectively. Thus, we infer each xt,i , i = 0, . . . , L, based on all Z t . A message passing scheme, the same as is used in belief propagation, is adopted to pass messages from each camera to the ground plane. The message from camera j is defined as m0j (xt,0 ) ←
t pj (zt,j |xt,j )ψ0,j (xt,0 , xt,j )
p(xt,j |xt−1,j )p(xt−1,j |Z t−1 )dxt−1,j dxt,j . (1)
The belief p(xt,0 |Z t ) is computed recursively by the message product and the propagation of the previous posterior, t p(xt,0 |Z ) ∝ m0j (xt,0 ) × p(xt,0 |xt−1,0 )p(xt−1,0 |Z t−1 )dxt−1,0 . (2) j=1,..., L
Note that the same message and belief update equations are used in our previous work [1].
368
W. Du and J. Piater
The inference of xt,j , j = 1, . . . , L, is done by nearly standard particle filters, except that the fusion results at t − 1 are taken into consideration. The belief p(xt,j |Z t ) is computed as (3) p(xt,j |Z t ) ∝ p(xj |zj ) × p(xt,j |xt−1,j )p(xt−1,j |Z t−1 )p(xt,j |xt−1,0 )p(xt−1,0 |Z t−1 )dxt−1,0 dxt−1,j . The underlined terms incorporate the fusion results as a boosted proposal function. In other words, the fusion module is used by each camera as a coupled process.
3
Collaborative Particle Filters
All the inference processes formulated above, in the ground plane and for each camera, are performed by individual but collaborative particle filters. Details are given below. 3.1
Principal Axis-Based Integration
The ground-plane particle filter integrates the multi-camera information according to Eqs. 1 and 2. For tracking ground targets, homographies are often used to map the foot positions from each camera to the ground plane. However, a large number of particles are required to estimate precise foot positions, which significantly slows down the tracking system. With a small number of particles, usually the sizes of the targets cannot be estimated precisely. We overcome this problem by exploiting the principal axes of the targets. The principal axis of a target is defined as the vertical line from the head of the target to the feet. It has been shown that the principal axes of a target in different cameras intersect in the ground plane, and computing the intersection point yields very robust fusion results [4,6], illustrated in Fig. 3. We exploit this effect in our multi-camera integration. The idea is to sample particles in the ground plane by importance sampling, and to evaluate these particles by passing messages from each camera. Here, p(xt,0 |xt−1,0 ) is used as the proposal function from which new particles for xt,0 are sampled. Each of these ground-plane particles receives messages from each camera, and a message weight is computed using Eq. 1. The principal axes are t incorporated in the potential function ψ0,j (xt,0 , xt,j ) in Eq. 1. In general, the principal axes of the particles in a camera are projected to the ground plane using the homographies. The potential function measures the distances of the ground particles to these projected principal axes and converts them to probability densities, given by 2 n t m (xnt,0 , xm ψ0,j t,j ) ∝ exp(−dist (xt,0 , project(Hj , xt,j ))),
(4)
where xnt,0 and xm t,j are the nth ground-plane particle and mth particle in camera j, Hj is the homography from camera j to the ground plane, dist() computes
Multi-camera People Tracking by Collaborative Particle Filters
369
Fig. 2. The particle distributions in four cameras at a time instant. It can be seen that the foot positions are not precise although all the particles are placed at the right location.
(a) Mapping particles to the ground.
(b) Mapping principal axes to the ground.
Fig. 3. Comparison between homography-based integration and principal axis-based integration. In (a), the projections of the particles (the red stars) from the images in Fig 2 to the ground have a large variance, making the integration imprecise. In contrast, in (b), the intersection of the principal axes (the red lines) of four selected particles yields a more precise foot position (the white square).
the distance between a point and a line, and project() maps the principal axis to the ground. The message and belief weights are then computed by j,n ∝ wt,0
N
m t n πt,j ψ0,j (xnt,0 , xm t,j ), πt,0 ∝
m=0
L
j,n wt,0 ,
(5)
j=1
j,n n m is the message weight of xnt,0 from camera j, and πt,0 and πt,j are where wt,0 the belief weights of xnt,0 and xm . Intuitively, the closer a ground-plane particle t,j is to all the principal axes, the larger its weight is, as illustrated in Fig. 4.
3.2
Boosted Proposal Functions
A target is tracked in each camera by a particle filter. Due to the occlusions or other image noise, feedback from the fusion module is expected to improve
370
W. Du and J. Piater
(a) Camera 1 passes messages
(b) Camera 2 passes messages
Fig. 4. An illustration of evaluating ground-plane particles using two cameras. The ground-plane particles are evaluated according to the distances to the projected principal axes. (a) After the first camera passes messages to the ground plane, all the particles along the principal axes (red dots) have larger weights than those further away (blue dots). The weights of the camera particles are shown at one end of the corresponding principal axes. (b) After the second camera passes messages, only those ground-plane particles that are close to the intersections have large weights.
the tracking performance in a camera. A similar message passing procedure was adopted in our previous work to pass messages from the ground plane to each camera, which proved computationally expensive. We propose here a different method to incorporate this feedback. Note that in the dynamic Markov model in Fig. 1(b), for each xt,j , j = 1, . . . , L, there is an extra temporal link from xt−1,0 besides that from xt−1,j . This enables us to design a mixture proposal function for importance sampling, p(xt,j |xt−1,j , xt−1,0 ) ∝ αp(xt,j |xt−1,j ) + (1 − α)p(xt,j |xt−1,0 ).
(6)
Thus, we sample particles from both p(xt,j |xt−1,j ) and p(xt,j |xt−1,0 ), i.e., αN particles are sampled from p(xt,j |xt−1,j ) and the other (1 − α)N from p(xt,j |xt−1,0 ). Parameter α specifies a trade-off between two proposal functions and is set to 0.5 in our experiments. To sample from p(xt,j |xt−1,0 ), we fit a Gaussian distribution to xt−1,0 and propagate it to each camera using the homographies. In a sense, the fusion results at t− 1 are used as boosted proposal functions by each camera. This is beneficial not only in maintaining consistency between the particle filters at different nodes but also in speeding up the tracking algorithm. The sampled particles are evaluated using the image likelihood as is done in standard particle filters.
4
Results
We tested our method on both video surveillance and soccer game sequences. We manually initialized the targets of interest in the first frames of the sequences and sampled 100 particles for each filter. Figure 5 shows the results of tracking a pedestrian in PETS sequences and a comparison with a reference method [7], which tracks a target in 3D by a particle
Multi-camera People Tracking by Collaborative Particle Filters
371
Fig. 5. The results of tracking a pedestrian in PETS sequences with our approach (top rows) and with a reference method [7] (bottom rows). In the latter method [7], we initialize a tracker in one camera and project the particles to another camera using the homographies. Here, due to imprecise foot positions, the estimates are projected to wrong positions.
filter and evaluates the particles by the product of the likelihoods in all cameras. In this experiment, we adopted a classic color observation model and evaluate each particle by matching the color histogram to a reference model [11]. The figure shows that particle filters do not estimate precise foot positions; thus, mapping the particles between cameras or between cameras and the ground plane using homographies is imprecise. As a result, using this method [7], most particles in one camera are projected to wrong positions in another camera so that only one camera contributes to the tracking. On the other hand, due to the use of the principal axes, our method integrates information from both cameras and achieves more reliable results. Figure 6 shows the results of tracking two selected people in an indoor environment with four cameras. In this experiment, we adopted a hierarchical multi-cue observation model and evaluated each particle first by a color likelihood function and then by a background-subtraction likelihood function [10]. We also assumed that the sizes of the people were fixed and could be inferred from their ground positions [2]. Thus, the only parameters of interest were the positions in the images and in the ground plane. A comparison with our previous work [1] shows that the new approach achieved similar results but was approximately twice faster.
372
W. Du and J. Piater
Fig. 6. The results of tracking two people in an indoor environment with four cameras. Each row shows four simultaneous views. In this experiment, both the head and the ground homographies of each camera are available. The fixed-size assumption significantly improved the robustness of the algorithm.
Figure 7 shows the results of tracking several soccer players in three cameras. Due to the interactions between the players, the feedback from the fusion module to each camera becomes critical, without which the trackers in different cameras fail one by one. In this experiment, the same observation model as in the PETS experiment was used and the homographies of each camera were obtained on-line by using a field model and by accumulating motion estimates between consecutive frames [3]. Note in the figure that the estimated foot positions do not coincide with the bottom of the bounding boxes, but are more precise than these thanks to the multiple-camera fusion using principal axes. At one point, due to a heavy occlusion that occurs in all cameras, a tracker jumps from one
Multi-camera People Tracking by Collaborative Particle Filters
373
Fig. 7. The results of tracking several soccer players in the last frames of the three sequences. The ellipses under the rectangles are the fusion results in the ground plane.
Fig. 8. The particle distributions at the time when the tracker is about to jump to a different player, which happens here because the players involved are very close both in space and in appearance in all three views. The green rectangles are the sampled particles, the blue are the estimates, and the red are the predictions of the fusion results at the previous time.
target to another. In such situations, multi-camera systems without feedback between cameras are susceptible to mismatched targets. In our system, thanks to the feedback from the ground-plane tracker, the trackers at each camera remain consistent, even if they collectively follow the wrong target. Figure 8 shows the particle distributions at the time instant when the jump begins. This problem can be partially solved by tracking multiple targets simultaneously.
5
Conclusion and Future Work
This paper presents a novel approach to ground-plane tracking of targets in multiple cameras. Different from previous work, our approach is not based on bottom-up detection or segmentation methods. Instead, we infer target states in each camera and in the ground plane by collaborative particle filters. Message passing and boosted proposal functions are incorporated in the collaboration between the trackers in each camera and the fusion module. Principal axes are exploited in the multi-camera integration, which enables us to handle the imprecise foot positions and some calibration uncertainties. In doing so, we achieve robust results using relatively little computational resources. We are currently adapting this approach to multi-target, multi-camera tracking, which involves the modeling of the target interactions and data association across cameras.
374
W. Du and J. Piater
Acknowledgement The authors wish to thank J. Berclaz and F. Fleuret for sharing their data.
References 1. Du, W., Piater, J.: Multi-view object tracking using sequential belief propagation. In: Asian Conference on Computer Vision, Hyderabad, India (2006) 2. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence (2007) 3. Hayet, J.-B., Piater, J., Verly, J.: Robust incremental rectification of sports video sequences. In: British Machine Vision Conference, Kingston, UK, pp. 687–696 (2004) 4. Hu, W.-M., Hu, M., Zhou, X., Tan, T.-N., Lou, J., Maybank, S.J.: Principal axisbased correspondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 663–671 (2006) 5. Khan, S.M., Shah, M.: A multiview approach to tracking people in crowded scenes using a planar homography constraint. In: ECCV, pp. 98–109 (2006) 6. Kim, K., Davis, L.S.: Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering. In: ECCV, pp. 98–109 (2006) 7. Kobayashi, Y., Sugimura, D., Sato, Y.: 3d head tracking using the particle filter with cascaded classifiers. In: BMVC (2006) 8. Mittal, A., Davis, L.S.: M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene. Internatial Journal of Computer Vision 51(3), 189– 203 (2003) 9. Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., van Gool, L.: Color-based object tracking in multi-camera environment. In: Michaelis, B., Krell, G. (eds.) Pattern Recognition. LNCS, vol. 2781, Springer, Heidelberg (2003) 10. P´erez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proceeding of the IEEE 92(3), 495–513 (2004) 11. P´erez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: European Conference on Computer Vision, Copenhagen, Denmark, vol. 1, pp. 661–675 (2002)
Finding Camera Overlap in Large Surveillance Networks Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, and Rhys Hill School of Computer Science University of Adelaide Adelaide, 5005, Australia {anton,ard,henry,alex,rhys}@cs.adelaide.edu.au
Abstract. Recent research on video surveillance across multiple cameras has typically focused on camera networks of the order of 10 cameras. In this paper we argue that existing systems do not scale to a network of hundreds, or thousands, of cameras. We describe the design and deployment of an algorithm called exclusion that is specifically aimed at finding correspondence between regions in cameras for large camera networks. The information recovered by exclusion can be used as the basis for other surveillance tasks such as tracking people through the network, or as an aid to human inspection. We have run this algorithm on a campus network of over 100 cameras, and report on its performance and accuracy over this network.
1
Introduction
Manual inspection is an inefficient and unreliable way to monitor large surveillance networks (see Figure 1 for example), particularly when coordination across observations from multiple cameras is required. In response to this, several systems have been developed to automate inspection tasks that span multiple cameras, such as following a moving target, or grouping together related cameras. A key part of any multi-camera surveillance system is to understand the spatial relationships between cameras in the network. In early surveillance systems, this information was manually specified or derived from camera calibration, but recent systems at least partly automate the process by analysing video from the cameras. These systems are demonstrated on networks containing of the order of 10 cameras, but have requirements that mean they do not scale well to networks an order of magnitude larger. For example: [1] requires manually marked correspondences between images; [2] requires a training stage where only one object is observed; and [3,4,5] require many correct detections of objects as they appear and disappear from cameras over a long period of time. An important step towards recovering spatial camera layout is to determine where cameras overlap. The approach taken in [6] is to estimate motion trajectories for people walking on a plane, and then match trajectories between cameras. However, this assumes planar motion, and accurate tracking over long periods Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 375–384, 2007. c Springer-Verlag Berlin Heidelberg 2007
376
A. van den Hengel et al.
Fig. 1. Snapshot of video feeds from the network. Some cameras are offline.
of time. It also does not scale well, since track matching complexity increases as O(n2 ) with the number of cameras n. In [7] evidence for overlap is accumulated by estimating the boundary of each camera’s field of view in all other cameras. Again, this does not scale well to large numbers of cameras, and assumes that all cameras overlap. Because they start with an assumption of non-connectedness, and gradually accumulate evidence for connections, most methods for determining spatial layout rely on accurately detecting and/or tracking objects over a long time period. They also require comparisons to be made between every pair of cameras in a network. The number of pairs of cameras grows with the square of the number of cameras in the network, rendering exhaustive comparisons infeasible. This paper describes the implementation of a method called exclusion for determining camera overlap that is designed to quickly home in on cameras that may overlap. The method is computationally fast, and does not rely on accurate tracking of objects within each camera view. In contrast to most existing methods, it does not attempt to build up evidence for camera overlap over time. Instead, it starts by assuming all cameras are connected and uses observed activity to rule out connections over time. This is an easier decision to make, especially when a limited amount of data is available. It is also based on the observation that it is impossible to prove a positive connection between cameras—any correlation of events could be coincidence—whereas it is possible to prove a negative connection by observing an object in one camera while not observing it at all in another.
Finding Camera Overlap in Large Surveillance Networks
2
377
The Exclusion Algorithm
Consider a set of c cameras that generates c images at time t. By applying foreground detection [8] to all images we obtain a set of foreground blobs, each of which can be summarised by an image position and camera index. Each image is partitioned into a grid of windows, and each window can be labelled “occupied” or “unoccupied” depending on whether it contains a foreground object. Exclusion is based on the observation that a window which is occupied at time t cannot be an image of the same area as any other window that is simultaneously unoccupied. Given that windows tend to be unoccupied more often than they are occupied, this observation can be used to eliminate a large number of window pairs as potentially viewing the same area. The process of elimination can be repeated for each frame of video to rapidly reduce the number of pairs of image windows that could possibly be connected. This is the opposite of most previous approaches: rather than accumulate positive information over time about links between windows, we seek negative information allowing the instant elimination of impossible connections. Such connections are referred to as having been excluded [9]. 2.1
Exclusion over Multiple Timesteps
Rather than calculate exclusion separately at each timestep, it is more efficent to gather occupancy information over multiple frames and then calculate exclusion over all of them at once. Let the set of windows over all cameras be W = {w1 . . . wn }. Corresponding to each window wi is an occupancy vector oi = (oi1 , . . . , oiT ) with oit set to 1 if window wi is occupied at time t, and 0 if not. If two windows are images of exactly the same region in the world, we would expect their corresponding occupancy vectors to match exactly. This can be tested by applying the exclusiveor operator ⊕ to elements of the occupancy vectors: K
a ⊕ b = max ak ⊕ bk k=1
It can be inferred that two windows wi and wj do not overlap if oi ⊕ oj = 1. This comparison is very fast to compute, even for long vectors. 2.2
Exclusion with Tolerance
Exclusion as described so far assumes that: 1. corresponding windows in overlapping cameras cover exactly the same visible area in the scene, 2. all cameras are synchronised, so they capture frames at exactly the same time, and 3. the foreground detection module never produces false positives or false negatives.
378
A. van den Hengel et al.
In reality none of these assumptions is likely to hold completely. It is thus possible that two overlapping windows might simultaneously register as occupied and vacant and therefore that the exclusive-or of the corresponding occupancy vectors might incorrectly indicate that they do not overlap. Assumptions 1 and 2 can be relaxed by including the neighbours of a particular window when registering its occupancy. We use a padded occupancy vector pi which has element pit set to 1 when window wi or any of its neighbours is occupied at time t. A more robust mechanism for determining whether two windows wi and wj overlap is thus to calculate oi pj on the basis of the occupancy vector oi and the padded occupancy vector pj . The operator is a uni-directional version of the exclusive-or defined such that K
a b = max ak bk . k=1
(1)
where ak bk is 1 if and only if ak is 1 and bk is 0. Note that this means exclusion calculation is no longer symmetric. To account for detection errors (assumption 3), we calculate exclusion based on accumulated results over multiple tests, rather than relying on a single contradictory observation. Assuming that the detector has a constant failure rate, the evidence for exclusion is directly related to the number of contradictory observations in a fixed time period t = 1...T [9], which we call the exclusion count: Eij =
T
oit pjt .
(2)
t=1
2.3
Normalised Exclusion
The exclusion count has two main shortcomings as a measure for deciding window overlap/non-overlap: – As the operator ab will only return true when a is true, the exclusion count Eij between windows wi and wj is bounded by the number of detections in wi , and is likely to be higher for windows wi that register more detections. – In a large network, it will frequently occur that data sent from a camera will be lost, or not arrive in time to be included in the exclusion calculation, or that a camera will go offline. Thus the maximum value of Eij also depends on how often data from wj is available. To address these problems we define a padded availability vector v for each window that is set to 1 when occupancy data for the window and its neighbours is available, and 0 otherwise. We can then define an exclusion opportunity count between each pair of windows: Oij =
T t=1
oit vjt
(3)
Finding Camera Overlap in Large Surveillance Networks
379
Based on this we define an overlap certainty measure from each window with opportunity count at least 1 to every other window: Cij =
Oij − Eij Oij
(4)
which measures the number of times that an exclusion was not found between wi and wj as a proportion of the number of times an exclusion could possibly have been found given the available data. In general, exclusion estimates for windows that are only occupied a small number of times are dominated by noise such as erroneous detection. We therefore include a penalty term for such windows: log(Oij ) = Cij × min 1, (5) Cij log(Oref ) where Oref is a number of detections empirically determined to result in reliable exclusion calculation. We set this to 20 in our experiments.
3
Implementing Exclusion
In this section we describe how the exclusion algorithm is implemented in order to find overlap in a large network of cameras. This is done in two steps: – Object detection (Section 3.1): after each frame is captured, it is processed to detect objects within it. These detections are then converted to occupancy data for each window and sent to a central server. The main challenge for large camera networks is to detect objects quickly and reliably. – Exclusion calculation/update (Section 3.2, 3.3): at regular intervals of the order of several seconds, the stored occupancy data is used to calculate exclusion between each window pair. This exclusion result is then merged with exclusion results from earlier time periods, resulting in an updated estimate of camera overlap. The main challenge here is to synchronise data from different cameras, and to mitigate the memory requirements of exclusion data. 3.1
Distributed Foreground Detection
We detect foreground objects within each camera image using the Stauffer and Grimson background subtraction method [8]. To derive a single position from a foreground blob, we use connected components and take the midpoint of the low edge of the bounding box of each blob. This corresponds approximately to the lowest visible extent of the object in the image, assuming that the camera is approximately upright. Foreground detection is the most computationally intensive part of exclusion, but is also the stage that is easiest to parallelise. Presently, cameras are assigned to one of several processors that perform background subtraction on each image they capture. Eventually, though, we aim to implement detection on the cameras themselves.
380
A. van den Hengel et al.
3.2
Calculating Exclusion
Each occupancy result is tagged with the timestamp of the frame of video on which it is based and sent to a central server. After a fixed time interval, typically several seconds, these results are assembled to form an occupancy vector oi for each window wi . Each element of oi is indexed by a time offset t within the time interval, and can be one of three values: – oit = 2 if no occupancy data is available for wi within the time interval [t − tˆ, t + tˆ) – oit = 1 if wi is occupied within the time interval [t − tˆ, t + tˆ) – oit = 0 if wi is not occupied within the time interval [t − tˆ, t + tˆ) where tˆ is a tolerance to account for inaccuracies in camera synchronisation. These occupancy vectors are then used to calculate exclusion and opportunity counts as described in Sections 2.2 and 2.3 for each window pair within the time interval. These counts are then added to counts from previous time intervals, giving an updated estimate of exclusion confidence for each window pair. 3.3
Exclusion Data Compression
The central server stores both an exclusion count Eij and an exclusion opportunity count Oij for each pair of windows. Both counts are stored as a byte. This means that for a network of 100 cameras, each containing a 10 × 10 window grid, the counts require approximately 2 × 108 bytes of storage. Initially, Eij = 0 for all i and j. Consider how the exclusion counts are affected when a single person is observed in one window wD , and no other person is detected across the network. This will result in the exclusion count EDj being incremented for all windows j = D in the network. If the exclusion counts are stored in a matrix whose ij th element is the exclusion count between wi and wj , this results in an entire row of the matrix being incremented. Situations similar to this are quite common and suggest that a run length encoding scheme could effectively compress the matrix. Similarly, exclusion opportunity counts Oij are initially 0 for all i and j. Like exclusion counts, neighbouring opportunity counts are likely to be incremented at identical times, since an increment to Oij requires that wi is occupied and all data in the neighbourhood of wj is available. Again, this suggests the use of a run length encoding scheme to store exclusion opportunity data.
4
Testing Exclusion
We tested our exclusion implementation on network containing 100 Axis IP cameras, distributed across a university campus. Frames are captured from each camera as JPEG compressed 320×240 images using the Axis API. Each frame is divided into a 9×12 grid of windows, for a total of 10800 windows. As previously mentioned, the computational cost of foreground detection over a large number of
Finding Camera Overlap in Large Surveillance Networks
381
cameras far outweighs that of exclusion. This coarse level of foreground detection is well suited to implementation on board a camera, but for the purposes of testing a cluster of 16 dual core Opteron PCs has been used to process the footage from the 100 cameras in real time. By contrast, the central server, where occupancy results are assembled and exclusion is calculated, is a single desktop PC (Dell Dimension 4700, 3.2GHz Pentium 4, 1GB memory). 4.1
Performance Testing
We first test how the performance of exclusion scales, both over long time periods and large numbers of cameras. It was found that due to the optimisations described previously, the performance of exclusion does not depend strongly on the number of cameras on the network. Rather, it depends on the amount of activity in the network. Thus we observed the performance of exclusion during high and low activity periods, over a period of one hour. The memory required by exclusion increases over time, as shown in Figure 2. This is largely due to the decreased effectiveness of RLE compression of the exclusion counts (EC) as more activity is observed. The opportunity counts (EOC) are still well compressed by RLE after one hour, as camera availability changes rarely during this time. However, notice that the increase in EC elements, and corresponding increase in memory usage, is less than linear. Even after an hour of observation, only 29.56MB of memory is being used, compared to over 200MB that would be required to store the uncompressed data. Figure 3 shows the time taken to calculate exclusion at intervals over the one hour period. Notice that the time to compute exclusion remains fairly constant over the time period, and is consistently faster than real time, even using a standard desktop PC. In fact the exclusion is calculated for the hour’s footage 30000000
40 35
Memory Use (MB)
20000000
25
15000000
20 Data Size (MB)
15
Allocated Size (MB)
10000000
Compressed EC Elements
10
Compressed EOC Elements
Number of elements
25000000
30
5000000
5 0
0 0
15
30
45
60
Frame time (minutes)
Fig. 2. Memory usage over one hour of processing. The exclusion element count (EC) shows how the RLE compression becomes less effective over time.
382
A. van den Hengel et al. 16
14 Avg Occupancy Count / Camera
14
12
Speed Relative To Real Time Time Taken (min)
12
8 8 6 6
Time taken (min)
10
10
4
4
2
2
0
0 0
15
30
45
60
Frame time (minutes)
Fig. 3. Timing information for one hour of processing. The time required to process each frame remains approximately constant over time, although it increases slightly during periods of higher activity. Exclusion for 100 cameras is consistently calculated at over 4 times real time on a desktop PC, and an hour’s video takes under 13 minutes to process.
using less than 13 minutes of processor time. It is also evident that the time taken to calculate each exclusion count does depend on the amount of activity, measured by the number of occupancies detected per available camera. This can be seen by the slight increase in “Avg Occupancy Count per Camera” between about 30 and 50 minutes, and the corresponding decrease in “Speed relative to real time”. 4.2
Ground Truth Verification
It is difficult to verify that exclusion captures all overlap in a large camera network, and excludes all non-overlap. For example, Figure 1 shows a set of images captured from across the network at one moment. After some exclusion processing, the grid is rearranged to group together related cameras as shown in Figure 4. Connections are drawn between a window pair when the overlap certainty measure (Equation 5) exceeds a threshold C ∗ . The link must pass the threshold in both directions for the connection to be established, i.e. a link > C ∗ and Cji > C ∗ . In our is drawn between wi and wj if and only if Cij ∗ experiments we set C = 0.8. To verify the exclusion results we manually inspected the groups that were found. Close up views of some groups can be seen in Figure 5. It can be seen that overlap has been detected correctly in a variety of cases despite widely differing viewpoints and lighting conditions. These are correspondences that would be
Finding Camera Overlap in Large Surveillance Networks
383
Fig. 4. Video feeds from Figure 1 after running exclusion on one hour of footage. The cameras are arranged on screen so that related cameras are near each other, to aid human inspection.
Fig. 5. Overlapping groups detected by exclusion
very difficult to detect by tracking people, and attempting to build up correlations between tracks. The lighting conditions are often very poor, and the size of people in each camera varies greatly. Figure 4 also includes four camera groups that have been erroneously linked. Each of these groups has only one or two links between windows in each image, and view low traffic areas. These errors would thus disappear as more traffic is viewed. To correct these groups until enough traffic has been seen, a filter can be implemented that only links cameras when more than window in each camera is linked. Alternatively, a human operator can sever the links manually.
384
A. van den Hengel et al.
Some camera overlap was not detected because of low traffic during the hour that footage was captured. However, all overlap between cameras monitoring areas with enough detections (relative to Oref ) to calculate exclusion was correctly determined. This leads us to believe that remaining overlap can be detected when the system is run over a longer time period.
5
Conclusion
This paper describes a method for automatically determining camera overlap in large surveillance networks. The method is based on the process of eliminating impossible connections rather than the slower process of building up positive evidence of activity. We describe our implementation of the method, and show that it runs faster than real time on an hour of footage from a 100 camera network, using a single desktop PC. Future work includes testing the system over a period of several days, adding more cameras to the network, and implementing a more efficient foreground detector.
References 1. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: ICCV 2003, pp. 952–957 (2003) 2. Dick, A.R., Brooks, M.J.: A stochastic approach to tracking objects across multiple cameras. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 160– 170. Springer, Heidelberg (2004) 3. Ellis, T.J., Makris, D., Black, J.K.: Learning a multi-camera topology. In: Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pp. 165–171. IEEE Computer Society Press, Los Alamitos (2003) 4. Stauffer, C.: Learning to track objects through unobserved regions. In: IEEE Computer Society Workshop on Motion and Video Computing, pp. 96–102. IEEE Computer Society Press, Los Alamitos (2005) 5. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: ICCV 2005, pp. 1842–1849 (2005) 6. Lee, L., Romano, R., Stein, G.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 758–767 (2000) 7. Khan, S., Javed, O., Rasheed, Z., Shah, M.: Human tracking in multiple cameras. In: IEEE International Conference on Computer Vision, pp. 331–336 (2001) 8. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) 9. van den Hengel, A., Dick, A., Hill, R.: Activity topology estimation for large networks of cameras. In: AVSS 2006. Proc. IEEE International Conference on Video and Signal Based Surveillance, pp. 44–49. IEEE Computer Society Press, Los Alamitos (2006)
Information Fusion for Multi-camera and Multi-body Structure and Motion Alexander Andreopoulos and John K. Tsotsos York University, Dept. of Computer Science & Engineering, Toronto, Ontario, M3J 1P3, Canada {alekos,tsotsos}@cse.yorku.ca Abstract. Information fusion algorithms have been successful in many vision tasks such as stereo, motion estimation, registration and robot localization. Stereo and motion image analysis are intimately connected and can provide complementary information to obtain robust estimates of scene structure and motion. We present an information fusion based approach for multi-camera and multi-body structure and motion that combines bottom-up and top-down knowledge on scene structure and motion. The only assumption we make is that all scene motion consists of rigid motion. We present experimental results on synthetic and nonsynthetic data sets, demonstrating excellent performance compared to binocular based state-of-the-art approaches for structure and motion.
1
Introduction
Multi-body and multi-camera structure and motion establishes the structure and motion of a scene that consists of multiple moving rigid objects that are observed from multiple views [1], [2]. Stereo vision analysis and image motion analysis provide information with complementary uncertainties which can depend on the motion of the camera platform, the scene structure and the spatiotemporal baselines. There are four fundamental problems with the extractable information from motion data or from stereo data [3]: (i) Image motion and disparity, with an unknown camera translation, allow us to infer object range only up to a scale ambiguity, since image motion and disparity depend on the ratio of camera translation to object range. (ii) Image motion and disparity tend towards zero near the focus of expansion (FOE). Since object range is inversely proportional to image motion and disparity, scene structure estimation is ill-conditioned near the FOE. (iii) The more closely aligned the local image structure is with the epipolar directions – i.e., directions pointing towards the FOE – the more ill-conditioned scene structure estimation becomes in those regions. (iv) Whereas large spatio-temporal baselines give better depth estimates for distant objects, the greater disparity and occlusion makes such cameras unsuitable for nearby objects. The severity of these problems is reversed when dealing with small baselines. The spatio-temporal baselines might be defined with respect to a monocular camera in motion –structure from motion–, a static stereo camera, or some other combination of static and non-static cameras. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 385–396, 2007. c Springer-Verlag Berlin Heidelberg 2007
386
A. Andreopoulos and J.K. Tsotsos
A method for fusing the structure and motion estimates of different cameras by preserving the accurate estimates and diminishing the effect of inaccurate estimates is highly desirable. For example, whereas in one camera the optical flow near the FOE might be poorly estimated, from another camera’s viewpoint the optical flow of the same scene region might not be as ill-conditioned, since the FOE will likely have changed. We present an information fusion based approach for dealing with all these problems in a unified framework. We model the above mentioned errors as originating from ambiguities in the estimation of stereo image correspondences and in the optical flow across all cameras. The only assumption we make as to the scene motion is that we are dealing with rigidly moving objects. The rest of the paper is organized as follows. Section 2 presents some related work. Section 3 introduces an approach for representing the motion and stereo data from a network of cameras. Section 4 describes how to combine this data in a single reference frame. Section 5 outlines a simple extension of the approach to camera rigs with arbitrary intrinsic and extrinsic parameters. Section 6 presents experimental results demonstrating the robustness of the approach. Section 7 concludes the paper.
2
Related Work
Richards [4] shows how the integration of changing disparity and object velocity can solve many of the ambiguitites inherent in stereopsis and motion under orthographic projection. Waxman [5] demonstrates the importance of the ratio of the rate of change of disparity over disparity, by using this quantity to unify stereo and motion analysis. As it is elaborated in [6], the importance of this ratio has been demonstrated numerous other times. Hanna and Okamoto [3] demonstrate how motion and stereo could be combined in a multi-camera system for egomotion and scene structure estimation. Their work is further expanded upon by Mandelbaum et al. [7]. Zhang and Kambhamettu [8] present a system which integrates 3D scene flow and structure recovery in order to complement the performance of each other, using a number of calibrated cameras. Singh and Allen [9] employ the Best Linear Unbiased Estimator (BLUE) to fuse local motion. Comaniciu [10], [11] developed a method for motion estimation under multiple source models. Neumann et al. [12] present a method for establishing a hierarchy of cameras based upon the stability and complexity of structure and motion estimation. To the best of our knowledge, the work we present is the first approach using information fusion for multi-camera and multi-body structure and motion.
3
Fusing Multiple Cameras
Assume we have a multi-camera rig composed of N monocular cameras. A maxiN mum of camera pairs exist. The coordinate system of camera C0 is referred 2
Information Fusion for Multi-camera and Multi-body Structure and Motion
387
C3 C4
C5
C2
C0
C6
C1
C8 C7
(a)
(b)
(c)
(d)
Fig. 1. (a) Diagram of a hypothetical nine camera rig. (b) A five camera rig mounted on a mobile robotic platform. (c)A planar textured region we used in some of the experiments for structure and motion estimation at a depth of 300cm. (d)The region after a 20 degree rotation around the camera’s optical axis.
to as the basis coordinate system. By convention a vector’s superscript will denote the coordinate system with respect to which we are expressing the vector. The camera rig is calibrated and therefore, for each pair of cameras Ci , Cj , we know a rotation matrix Rij and translation vector Tij = (Tijx , Tijy , Tijz )T that describes the rotation and translation that aligns camera Ci ’s coordinate axes with camera Cj ’s coordinate axes. See Fig. 1(a),(b) for examples of camera rigs where Rij = I (the identity matrix) ∀i, j. For each pixel p0 in camera C0 , and for each camera pair (Cj , Ci ) such that i = 0, j = i, we can use a stereo correspondence algorithm, such as [13], to obtain estimates of the pixels pj , pi in cameras Cj , Ci respectively, corresponding to pixel p0 in basis camera C0 . Similarly, we can obtain motion flow estimates for each pixel pj , pi in Cj , Ci . With each such pair of image pixels pj , pi , we can associate a 6D vector C0 C0 C0 0 V(p0 , Cj , Ci ), containing the 3D coordinates XC 1 = (X1 , Y1 , Z1 ) of a point P that is imaged by pj , pi in camera pair (Cj , Ci ). We can also associate with V(p0 , Cj , Ci ) a 3D vector uC0 corresponding to the 3D displacement vector of P that was extracted using the camera pair (Cj , Ci ). The displacement vector might be due to camera movement, an independent motion of scene point P or a C0 0 combination of both. As we have indicated above, the superscript C0 in XC 1 ,u indicates that the vectors are expressed with respect to the coordinate system C0 C0 C0 0 of C0 . Let XC 2 = (X2 , Y2 , Z2 ) denote the coordinate of P with respect to camera’s C0 coordinate system, obtained after an arbitrary camera rig or scene motion. The context will always make it clear with respect to which camera pair C0 0 (Cj , Ci ) we estimated XC 1 , X2 . We can then obtain the 3D motion estimate C0 C0 C0 C0 C0 T 0 u for point P by u = X2 − XC 1 . Then V(p0 , Cj , Ci ) (X1 , u ) . Given a small neighborhood Δp0 of pixels around a pixel p0 in C0 – we use 3×3 N N pixel neighborhoods in this paper –, the set p∈Δp j=0 i=1,i>j V(p, Cj , Ci ) 0 contains estimates of scene structure and motion over all camera pairs. If we need to enforce a hard real-time constraint, we can select to process a subset of the camera pairs. For each camera pair Cj , Ci and each pixel p0 in C0 that we process, we assign the covariance matrix Cov(V(p0 , Cj , Ci )). In the next section we will show how to estimate this covariance matrix and how to use it to assign a weight of importance to each one of those vectors. We will also show how to use
388
A. Andreopoulos and J.K. Tsotsos
information fusion techniques to get a robust estimate of the true scene structure and motion. Notice that in the above mentioned set, mainly due to occlusions, V(p0 , Cj , Ci ) will not always contribute a vector for all p0 , Cj , Ci .
4
Fusing the Camera Data
We need to model the uncertainty in each of the 6D vectors V(p0 , Cj , Ci ) in order to obtain each vector’s 6 × 6 covariance matrix. These covariance matrices are used by the BLUE estimator to obtain a reliable estimate of the scene structure and motion. For example, an image pixel that is near the focus of expansion in one monocular camera needs to assign a high uncertainty to its motion elements and assign a 3D structure uncertainty that depends on the scene depth relative to the camera pair used. From a different stereo camera’s point of view, these uncertainties will differ. By combining bottom-up and top-down information related to the scene uncertainty we obtain the noise model used by our BLUE estimator. For notational simplicity, we initially assume a perspective projection camera model where all cameras have the same focal length f , the aspect ratio is 1, the skew is 0, and the principal point is set to (0,0). The camera set up similar to Fig. 1(a),(b) where Rij = I and Tijz = 0 ∀i, j. The extension to arbitrary camera setups is presented in Section 5. Every pair of cameras (Cj , Ci ) can be viewed as a stereo camera with a focal length f , such that the projection C0 C0 C0 0 of a point XC 1 = (X1 , Y1 , Z1 ) in camera Ci is given by: xr =
x )∗f (X1C0 − T0i , C0 Z1
yr =
y )∗f (Y1C0 − T0i C0 Z1
(1)
and the projection of the same point in camera Cj is: xl =
x (X1C0 − T0j )∗f
Z1C0
,
yl =
y )∗f (Y1C0 − T0j
Z1C0
.
(2)
y y x x If | − T0j + T0i | ≥ | − T0j + T0i |, we have: x x x x (−T0j + T0i ) (xr + xl ) T0j + T0i + 2 xl − xr 2 yr y x x = (−T0j + T0i ) + T0i xl − xr x x + T0i )f (−T0j = . xl − xr
X1C0 =
(3)
Y1C0
(4)
Z1C0
(5)
y y x x Conversely, if | − T0j + T0i | < | − T0j + T0i |, we have:
xr x + T0i yl − yr y y y y + T0i ) (yr + yl ) T0j + T0i (−T0j = + 2 yl − yr 2 y y + T0i )f (−T0j = . yl − yr
y y X1C0 = (−T0j + T0i )
(6)
Y1C0
(7)
Z1C0
(8)
Information Fusion for Multi-camera and Multi-body Structure and Motion
389
Notice that in Eqs.(3)-(5) and Eqs.(6)-(8), yl and xl respectively, are not used. 0 This provides a simple approximation for XC 1 when due to small errors (xl , yl ), (xr , yr ) are not corresponding pixels. The corresponding image coordinates in C C the next frame are given by (xl , yl ) = (xl , yl ) + (vx j , vy j ), (xr , yr ) = (xr , yr ) + C C j j (vxCi , vyCi ) where (vx , vy ), (vxCi , vyCi ) denote the motion flow vectors in cameras Cj , Ci respectively. We can use (xl , yl ),(xr , yr ), in conjunction with Eqs.(3)-(8), C0 C0 C0 0 to estimate XC 2 = (X2 , Y2 , Z2 ) and calculate V(p0 , Cj , Ci ). We now show how Eqs. (3)-(8) can be used to define a covariance matrix for x V(p0 , Cj , Ci ). We only describe the covariance matrix derivation for | − T0j + y y y y x x x T0i | ≥ | − T0j + T0i |, since the case | − T0j + T0i | < | − T0j + T0i | is similar. We model the error in the correspondences of the image points as (xr + nxr , yr + nyr ), (xl + nxl , yl + nyl ) where nxr ,nyr ,nxl ,nyl are zero mean Gaussian random variables. Their standard deviation can depend on how noisy the images are and on prior knowledge regarding the accuracy of the correspondences –e.g., the sample variance of the correspondences within Δp0 . In this paper we assume a variance of 12 pixel for each of the four random variables. We also assume that the random variables are independent. Furthermore, we notice that in Eqs. (3)-(8) we can view X1C0 , Y1C0 , Z1C0 as functions in terms of nxr ,nyr ,nxl ,nyl . We obtain first order Taylor expansions of X1C0 , Y1C0 , Z1C0 and we use these Taylor 0 expansions to obtain variance/covariance measures for vector XC 1 . It can be shown that within first order: x x x x + T0i ) xl )2 − T0i ) xr )2 ((−T0j ((T0j V ar(x ) + V ar(xl ) r ( xl − x r )4 ( xl − x r )4 x x 2 x x + T0i ) + T0i ) yr )2 (−T0j ((−T0j V ar(yr ) + V ar(xr ) + V ar(Y1C0 ) ≈ 2 4 ( xl − x r ) ( xl − x r ) x x ((T0j − T0i ) yr )2 V ar(xl ) ( xl − x r )4 x x x x + T0i )f )2 − T0i )f )2 ((−T0j ((T0j V ar(Z1C0 ) ≈ V ar(x ) + V ar(xl ) r ( xl − x r )4 ( xl − x r )4
V ar(X1C0 ) ≈
(9)
(10) (11)
r , yl and yr are estimated using a trimmed mean estimator, with where x l , x the top and bottom, 25% of the samples being rejected before calculating the mean. The samples used to estimate x l , x r , yl and yr are the pixels in Ci , Cj corresponding to the neighborhood Δp0 in C0 . For example, to estimate x r we use the stereo matching algorithm to find the pixels in Ci that correspond to the r . pixels Δp0 in C0 , and then apply the trimmed mean estimator to get x
⎞ ⎛ Var ( X 1C0 ) 0 0 0 0 0 ⎞ ⎛ Var ( X 1C0 ) 0 0 −Var ( X 1C0 ) 0 0 ⎟ ⎜ ⎟ ⎜ Var (Y1C0 ) 0 0 0 0 0 ⎟ ⎜ 0 Var (Y1C0 ) 0 0 −Var (Y1C0 ) 0 ⎟ ⎜ ⎟ ⎜ C0 ⎜ Var ( Z ) 0 0 0 0 0 Var ( Z1C0 ) 0 0 0 0 −Var ( Z1C0 ) ⎟ 1 ⎟ ⎜ + a Cov(V (p 0 , C j , Ci ) = (1 − a ) ⎜ ⎟ C0 ⎟ ⎜ Var ( X 2C0 − X 1C0 ) 0 0 0 0 0 0 0 Var ( X 2C0 − X 1C0 ) 0 0 ⎟ ⎜ −Var ( X 1 ) ⎟ ⎜ C0 C0 C0 ⎟ ⎜ C C −Var (Y1 ) Var (Y2 − Y1 ) 0 0 0 0 ⎟ ⎜ 0 0 0 0 Var (Y2 0 − Y1 0 ) 0 ⎟ ⎜ ⎟ ⎜ ⎜ −Var ( Z1C0 ) Var ( Z 2C0 − Z1C0 ) ⎟⎠ 0 0 0 0 C0 C0 ⎟ ⎜ ⎝ 0 0 0 0 0 Var ( Z 2 − Z1 ) ⎠ ⎝
Fig. 2. The covariance matrix encoding the uncertainties
390
A. Andreopoulos and J.K. Tsotsos
In order to obtain a covariance matrix for V(p0 , Cj , Ci ), we also need to obtain an estimate of the variance of the elements of uC0 . We know that for a physical point PCj that is moving with velocity SCj with respect to camera Cj and its coordinate frame, we can decompose the velocity as SCj = −TCj − C C C C C C ΩCj × PCj where TCj = (Tx j , Ty j , Tz j )T and ΩCj = (Ωx j , Ωy j , Ωz j )T denote the translational and angular velocity vectors of camera Cj that would cause the same apparent motion of the particle PCj with respect to camera Cj ’s coordinate frame. Then, the image velocity of the projection (xl , yl ) of PCj in camera Cj is given by C
vx j C vy j
= B Cj ΩCj + dCj ACj TCj
(12)
where dCj is the inverse of the scene depth with respect to camera Cj ’s coordinate system – it is estimated using Eqs.(3)-(8) and the current camera pair – and B
Cj
=
(f
xl yl f y2 + fl )
−(f +
x2l f )
− xlfyl
yl −xl
Cj
A
=
−f 0 xl 0 −f yl
.
(13)
Similar conditions hold for camera Ci . We use Eq.(12) to model the noise senC0 0 sitivity of XC 2 , as we did for X1 in Eqs.(9)-(11). This allows us to weigh the suitability of each camera for tracking a particular object. In the case of multibody structure and motion and due to the reasons mentioned in the introduction, it is quite feasible to end up with degenerate situations of objects whose motion estimation is ill-conditioned from a particular viewpoint. If we model TCj , ΩCj 0 as being corrupted by Gaussian noise, we can view XC 2 as a function of nxr , nyr , nxl , nyl , nTCj , nΩCj , nTCi , nΩCi , where nTCj , nΩCj , nTCi , nΩCi denote zero mean Gaussian noise vectors. For each camera pair (Cj , Ci ) and for each corresponding pixel pair pj , pi in the two cameras, we obtain approximations for TCj , ΩCj , TCi , ΩCi – de Cj , Ω Cj , T Ci , Ω Ci – and the variances of n Cj , n Cj ,nTCi , nΩCi as noted T T Ω follows: For each local image region centered at pα , α ∈ {i, j}, or for each image region containing pα and undergoing independent rigid motion – esti Cα , mated using any popular motion segmentation algorithm – we estimate T Cα Ω , the approximation of the camera’s translational and rotational velocity that would lead to the motion flow observed in that particular image region using camera pair (Cj , Ci ). For each such image region, we use a least squares pseudo-inverse based approach on a random subset of the estimated displacement vectors to approximate the translational and rotational velocity. We repeat this approach a number of times and the mean of the results is used as Cα , Ω Cα and their variance provides an estimate for the variance used in T the noise model described above. If we take the partial derivatives of X2C0 , Y2C0 , Z2C0 with respect to the above mentioned random variables and expand Cj , Ω Cj , T Ci , Ω Ci we obtain the desired expressions around x l , x r , yl , yr , T C0 C0 C0 for V ar(X2 ), V ar(Y2 ) and V ar(Z2 ). In the appendix we list the derived expressions for these variances. The above mentioned variances are referred to
Information Fusion for Multi-camera and Multi-body Structure and Motion
391
as the “top-down” information. Note that in our experiments, when modeling V ar(W2C0 − W1C0 ), we make the assumption of independence between W1C0 and W2C0 for all W ∈ {X, Y, Z}. Notice that V ar(X2C0 ), V ar(Y2C0 ) and V ar(Z2C0 ) are C C calculated using the derivatives of velocities vx j , vy j and might have very differC0 C0 ent magnitudes from V ar(X1 ), V ar(Y1 ), V ar(Z1C0 ). To guarantee the numerical stability of the covariance matrices, we perform two simple modifications to the top-down variances. We first set an upper bound maxvar to each of the variances by setting V ar(W1C0 ) ← min (V ar(W1C0 ), maxvar), V ar(W2C0 − W1C0 ) ← min (V ar(W2C0 − W1C0 ), maxvar). Secondly, for each W ∈ {X, Y, Z} and each C0 pixel in C0 , wescale the variances V ar(W 2 ) acquired across all camera pairs by C0 aW c · min( allpairs V ar(W1 ))/ min( allpairs V ar(W2C0 )) for some constant c (we set c = 2 in our experiments). For each pair (Cj , Ci ), we also estimate the sample variances V ar(X1C0 ), V ar(Y1C0 ), V ar(Z1C0 ), V ar(X2C0 − X1C0 ), V ar(Y2C0 − Y1C0 ) and V ar(Z2C0 − Z1C0 ) by using the samples in p∈Δp V(p, Cj , Ci ) and using the mean of the vectors 0 N N in p∈Δp j=0 i=1,i>j V(p, Cj , Ci ) as the sample mean. We refer to these 0 sample variances as the “bottom-up” information. We define the final covariance matrix corresponding to each vector V(p0 , Cj , Ci ) as a linear combination of their corresponding top-down and bottom-up variances. For each point p0 in camera C0 and by using the two cameras Cj ,Ci for depth estimation, we use the set p∈Δp V(p, Cj , Ci ) in conjunction with the variances defined above, to 0 model the covariance matrix of V(p0 , Cj , Ci ) as given by Fig.2, where 0 ≤ a ≤ 1. , where for each k ∈ {1, ..., n}, Assume we have n vectors Vi(1),j(1) ,...,Vi(n),j(n) Vi(k),j(k) is the average of all the vectors in p∈Δp V(p, Cj(k) , Ci(k) ). Also with 0 each of the vectors Vi(k),j(k) we associate a covariance matrix Nk indicating our confidence in this measure, as described in this section. If we ignore any potential cross-correlation between the n vectors, the Best Linear Unbiased Estimator (BLUE) [9] is the vector X that minimizes the sum of the Mahalanobis distances n T T = (Vi(1),j(1) N−1 1 + ... + k=1 D(X, Vi(k),j(k) , Nk ). It can be shown that X −1 T −1 −1 −1 Vi(n),j(n) Nn )(N1 + ... + Nn ) . In the next section we extend our approach to camera rigs with arbitrary intrinsic and extrinsic parameters.
5
Arbitrary Camera Rig Setup
Let us suppose that for a camera pair (Cj , Ci ) with intrinsic camera parameters (Kj , Ki ) and for pixels pj = (xl , yl )T , pi = (xr , yr )T imaging a common scene point P = (X1C0 , Y1C0 , Z1C0 )T , the following equations hold:
xl yl
⎛ ⎜ =⎝
C
C0 C0 y z −T0,j ))+Kj1,3 (R1,3 j,0 (Z1 −T0,j )) C0 z )) Kj3,3 (R3,3 (Z −T 1 j,0 0,j C0 C0 C0 2,2 2,2 y x z Kj2,1 (R2,1 −T0,j ))+Kj2,3 (R2,3 j,0 (X1 −T0,j ))+Kj (Rj,0 (Y1 j,0 (Z1 −T0,j )) C0 3,3 3,3 z Kj (Rj,0 (Z1 −T0,j ))
1,2 1,2 x 0 Kj1,1 (R1,1 j,0 (X1 −T0,j ))+Kj (Rj,0 (Y1
⎞ ⎟ ⎠ (14)
392
A. Andreopoulos and J.K. Tsotsos
xr yr
⎛ ⎜ =⎝
C
C0 C0 y z −T0,i ))+Ki1,3 (R1,3 i,0 (Z1 −T0,i )) C0 z )) Ki3,3 (R3,3 (Z −T 1 i,0 0,i C0 C0 C0 2,2 2,2 y x z Ki2,1 (R2,1 −T0,i ))+Ki2,3 (R2,3 i,0 (X1 −T0,i ))+Ki (Ri,0 (Y1 i,0 (Z1 −T0,i )) C0 3,3 3,3 z Ki (Ri,0 (Z1 −T0,i ))
1,2 1,2 x 0 Ki1,1 (R1,1 i,0 (X1 −T0,i ))+Ki (Ri,0 (Y1
⎞ ⎟ ⎠ (15)
m,n denote the m, nth entry of Kj /Rj0 . As we did in Eqs.(3)-(8), where Kjm,n /Rj,0 y y x x 0 + T0i |, we can express XC if | − T0j + T0i | ≥ | − T0j 1 in terms of xl , xr , yr . y y x x 0 Conversely, if | − T0j + T0i | < | − T0j + T0i | we can express XC 1 in terms of yl , C0 yr , xr Thus, we can define a function g(pj , pi ) X1 with respect to camera C0 ’s coordinate system. By using g(·) and the approach described in Section 4, we can obtain the desired variance approximations. We also need to redefine Eqs.(12)-(13) in order to obtain variance estimates for the motion error. We will only deal with the case of camera Cj , as the case of camera Ci is similar. As indicated in Section 4, SCj = −TCj − ΩCj × PCj . Then: 1,1 Cj 1,1 1,2 YCj 1,3 1,2 d YCj C d XCj K + K d Kj X vx j Cj + Kj Cj + Kj C C j j j j dt Z dt Z Z Z = = Cj C 2,3 2,2 d YCj dt vy j Kj2,2 Y K Cj + Kj Cj j dt Z Z (16) assuming Kj2,1 = 0, Kj3,3 = 1. The derivatives are taken with respect to time t, and by using the expression for SCj we can express Eq.(16) in terms of TCj and ΩCj . Then the variance derivation proceeds as described in Section 4. The derivatives can be determined analytically, or via common numerical methods such as finite differences.
6
Experiments
We present our camera setup and results in Figs. 1, 3, 4. We test our approach on a number of synthetic and non-synthetic datasets. Synthetic dataset (i) consists of a 30cm × 30cm planar surface on a black background (Fig. 1(c),(d)) centered at camera C0 , moving in depth, along the optical axis, by 15cm per frame. Synthetic dataset (ii) consists of the planar surface, rotated by 4 degrees around the optical axis between each frame. Syntethic dataset (iii) consists of a (2cm, 2cm) translation of the planar surface, parallel to the image plane, between each frame. The camera setup is similar to that of Fig. 1(a). All cameras Ci , i > 0 are radially distributed around camera C0 at a radius of 12cm, have a focal length of 4mm, and have corrupting Gaussian noise added to their images. We fuse all (C0 , Ci ) camera pairs and demonstrate the performance of the algorithm with an increasing object range from 300cm to 800cm by setting a = 0.5 in Fig.2. The stereo correspondence and optical flow algorithm used is described in [13] and is available by the authors online1 . Our results are illustrated in Fig. 3(a)-(f). We also test our algorithm using a five camera rig, as shown in Fig. 1(b). The corresponding results are presented in Fig. 4(a)-(h). In the synthetic data set we used the entire planar surface to estimate each TCα , ΩCα 1
http://www.cs.umd.edu/users/ogale/download/code.html
Information Fusion for Multi-camera and Multi-body Structure and Motion
Stereo reconstruction error of world coordinates Object shift:[0;0;15], SNR:4 dB, no calibration error
Stereo reconstruction error of world coordinates Object rotated by 4 degrees around optical axis, SNR:4 dB, no calibration error
18
393
Stereo reconstruction error of world coordinates Object shift:[2;2;0], SNR:4 dB, no calibration error 25
25
12 10 8 6 4
20
20
RMS error of 3D reconstruction
14 RMS error of 3D reconstruction
RMS error of 3D reconstruction
16
15
10
15
10
5
5
2 0
300
400 500 600 700 Distance of object from camera (cm)
0
800
300
0
800
300
(c)
(a)
400 500 600 700 Distance of object from camera (cm)
800
(e)
3D motion error Object rotated by 4 degrees around optical axis, SNR:4 dB, no calibration error
3D motion error Object shift:[0;0;15], SNR:4 dB, no calibration error
3D motion error Object shift:[2;2;0], SNR:4 dB, no calibration error 25
25
25
20
RMS error of 3D motion
15
10
15
10
5
5
0
0 300
400 500 600 700 Distance of object from camera (cm)
(b)
800
RMS error of 3D motion
20
20 RMS error of 3D motion
400 500 600 700 Distance of object from camera (cm)
15
10
5
300
400 500 600 700 Distance of object from camera (cm)
(d)
800
0
300
400 500 600 700 Distance of object from camera (cm)
800
(f)
Fig. 3. (a)-(f):The results of our tests on the synthetic dataset. The x-axes represent the depth of the object in cm, and the y-axes represent the RMS error of the stereo reconstructed coordinates and the 3D motion vector (in cm). The RMS error for a particular camera pair is calculated by estimating the error across all pixels in the base camera C0 that fall within the textured region. The solid/dashed lines correspond to the errors of our information fusion based approach using the BLUE/mean estimator, and the boxplots represent the distribution of the errors across each of the camera pairs used. The red crosses represent outliers. Note that in some figures the outliers are not displayed as they fall outside the vertical range of our error axes. (a),(b) correspond to the stereo reconstruction and 3D motion error respectively, when the planar object was translated by 15cm in depth along the optical axis. (c),(d) correspond to the stereo reconstruction and 3D motion error respectively, when the planar object was rotated by 4 degrees around the optical axis between frames. (e),(f) correspond to the stereo reconstruction and 3D motion error respectively when the translation occured parallel to the image plane. The object was translated by 2cm along the x and y axes of the world coordinate system. Notice how, even though gross outliers exist in most of the figures, the effect of those outliers on the estimated scene structure and motion is minimal in general. We also performed a number of experiments with modest errors in the external parameters’ calibration and similar observations were made. The mean RMS error of the stereo reconstruction using the information fusion/mean approach for all instances of the reconstructed planar surface is 2.05 ± 1.71/3.71 ± 3.87 cm respectively. The respective values for the motion data are 2.35 ± 1.43/2.56 ± 1.41 cm. In both cases the improvement compared to the mean approach is statistically significant using a paired-samples t-test (p ≈ 0.01).
394
A. Andreopoulos and J.K. Tsotsos
(a)
(c)
(e)
(g)
(b)
(d)
(f)
(h)
Fig. 4. (a)-(h):Experimental results from an image sequence showing a robotic wheelchair that is equipped with a 6-d.o.f. robotic arm. The robotic arm is moving diagonally towards the top left image corner. (a)-(b): Adjacent frames from the respective sequence (before correcting for radial/tangential distortions). (c): The reconstructed scene depth using a single pair of cameras to reconstruct each scene. Image regions in black denote pixels where the left-right consistency constraint could not be enforced. (d): The reconstructed scene depth of frames (a),(b) using the five camera rig setup shown in Fig. 1 in conjunction with our information fusion based algorithm. The colorbar depths of (c),(d) represent mm. Notice the significant decrease in occlusions. (e)-(f): Image motion of the sequence after projecting the estimated 3D motion on the image plane using a single camera pair in conjunction with our information fusion based algorithm. Image motion is represented in pixel units. (g)-(h): Image motion of the respective image sequences after using the five camera rig setup shown in Fig. 1 in conjunction with our information fusion based algorithm. (e),(g): The image motion component parallel to the horizontal axis and (f),(h): The image motion component parallel to the vertical axis.
(simulating perfect motion segmentation) and in the non-synthetic data set we used 21 × 21 pixel regions centered at the current pixel of interest. From Fig. 3, we observe that the multi-camera approach provides a significant decrease of the RMS error in both structure and motion estimation compared to the errors achieved using the stereo camera pairs. In almost all cases the quality of the results surpasses that obtained by any one of the camera pairs. As indicated in the caption of Fig. 3 the BLUE estimator provides better results than the results obtained by the mean vector across all cameras and their neighborhoods. For both the structure and motion data the improvement is judged statistically significant. In Fig. 3(a),(b) where the plane is moving along the z-axis and we are dealing with ill-conditioned motion estimation near the focus of expansion, we observe significant improvements. In Fig. 3(c)-(d) we present results after a pure rotation of the plane around the optical axis. It is interesting
Information Fusion for Multi-camera and Multi-body Structure and Motion
395
to notice however, that for depths 700cm, 800cm the optical flow estimation algorithm we used performs poorly on about half of our camera pairs, thus, resulting in a big RMS error, as the boxplots show. Our algorithm is capable of ignoring the erroneous data and gives us a relatively robust estimate of the 3D motion. This indicates that if we are using a multi-camera rig with cameras that break down quite often and provide gross outliers, our algorithm remains reliable. We observe that the mean estimator is severely affected by outliers at various depths, while the information fusion based algorithm is more robust in the presence of outliers. In Fig. 4(a)-(h) we compare the performance of our algorithm using a two camera rig versus a five camera rig (Fig. 1(b)). The five camera rig consists of two Point Grey Research Bumblebee stereo cameras and a Point Grey Research Flea camera. The coordinate system of the Flea camera is used as our basis coordinate system and represents camera C0 . The two camera rig is represented using the Flea camera and one of the four Bumblebee monocular cameras. The robotic wheelchair presented in Fig. 3 is equipped with a 6-d.o.f. robotic arm providing a number of independent rigid motions to test our algorithm. We used the algorithm described in [13] to determine the correspondences. We notice a dramatic increase in the number of pixels satisfying the left-right consistency constraint as the number of cameras in our rig increases.
7
Conclusion
We presented an algorithm for multi-camera and multi-body structure and motion. The algorithm combines top-down and bottom-up knowledge on scene structure and motion to model the respective uncertainties. An information fusion based algorithm uses these uncertainties to obtain competitive results demonstrating that our algorithm performs robustly in situations where a number of camera pairs provide severely degraded results. Such situations arise in practice due to hardware failures and poor environmental conditions. We are currently investigating the use of other information fusion algorithms for solving this problem [10]. Some potential application areas in future research are dynamic scene interpretation, vision based simultaneous localization and mapping (SLAM) and dynamic rendering. Acknowledgments. JKT holds the Canada Research Chair in Computational Vision and gratefully acknowledges its financial support. AA would also like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for its financial support through the PGS-D scholarship program.
References 1. Schindler, K., Suter, D.: Two-view multibody structure and motion. In: Proc. Conf. Computer Vision and Pattern Recognition (2005) 2. Zhang, W., Kosecka, J.: Nonparametric estimation of multiple structures with outliers. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, Springer, Heidelberg (2006)
396
A. Andreopoulos and J.K. Tsotsos
3. Hanna, K.J., Okamoto, N.E.: Combining stereo and motion analysis for direct estimation of scene structure. In: Proc. Int. Conf. on Computer Vision (1993) 4. Richards, W.: Structure from stereo and motion. Journal of the Optical Society of America A. 2(2), 343–349 (1985) 5. Waxman, A., Duncan, J.: Binocular image flows. IEEE Trans. Patt. Anal. Mach. Intell. 8(6), 715–729 (1986) 6. Grosso, E., Tistarelli, M.: Active dynamic stereo vision. IEEE Trans. Patt. Anal. Mach. Intell. 17(11), 1117–1128 (1995) 7. Mandelbaum, R., Salgian, G., Sawhney, H.: Correlation-based estimation of egomotion and structure from motion and stereo. In: Proc. Int. Conf. on Computer Vision (1999) 8. Zhang, Y., Kambhamettu, C.: Integrated 3D scene flow and structure recovery from multiview image sequences. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2000) 9. Singh, A., Allen, P.: Image-flow computation: An estimation-theoretic framework and a unified perspective. CVGIP: Image Understanding 56(2), 152–177 (1992) 10. Comaniciu, D.: Nonparametric information fusion for motion estimation. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2003) 11. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Patt. Anal. Mach. Intell. 24, 603–619 (2002) 12. Neumann, J., Fermuller, C., Aloimonos, Y.: A hierarchy of cameras for 3D photography. Computer Vision and Image Understanding 96, 274–293 (2004) 13. Ogale, A.S., Aloimonos, Y.: A roadmap to the integration of early visual modules. International Journal of Computer Vision: Special Issue on Early Cognitive Vision 72(1), 9–25 (2007)
Appendix In this section we derive expressions for V ar(X2C0 ), V ar(Y2C0 ), V ar(Z2C0 ) y y x x + T0i | ≥ | − T0j + T0i |. From Eqs.(9)-(11) we can for the case | − T0j derive the corresponding expressions for V ar(X2C0 ), V ar(Y2C0 ), V ar(Z2C0 ):
x x x x ((−T0j +T0i ) xl )2 ((T0j −T0i ) xr )2 V ar(x ) + V ar(xl ), V ar(Y2C0 ) ≈ 4 r ( xl − xr ) ( xl − x r )4 x x 2 x x 2 x x (−T0j +T0i ) ((−T0j +T0i ) yr ) ((T0j −T0i ) yr )2 V ar(xr ) + ( V ar(xl ), V ar(Z2C0 ) ( xl − xr )2 V ar(yr ) + ( xl − xr )4 xl − xr )4 x x x x ((−T0j +T0i )f )2 ((T0j −T0i )f )2 ≈ V ar(xr ) + ( V ar(xl ). We have previously noted that ( xl − xr )4 xl − xr )4 C C (xr , yr ) = (xr , yr ) + (vxCi , vyCi ), (xl , yl ) = (xl , yl ) + (vx j , vy j ). If we let a ∈ {x, y},
V ar(X2C0 ) ≈
b ∈ {r, l} and k = i/j if b = r/l we obtain the following approximations ∂a ∂a for V ar(xr ), V ar(xl ), V ar(yr ): V ar(ab ) ≈ ( ∂xbr )2 V ar(xr )+ ( ∂xbl )2 V ar(xl )+ ∂a
( ∂yrb )2 V ar(yr )+ ( ∂a
∂ab
C ∂Ωx k
)2 V ar(ΩxCk )+ ( ∂a
∂ab
C ∂Ωy k
)2 V ar(ΩyCk )+ (
∂ab C
∂Ωz k
)2 V ar(ΩzCk )+
( Cbk )2 V ar(TaCk ) + ( Cbk )2 V ar(TzCk ). By using Eqs. (12)-(13) we can derive ∂Ta ∂Tz the expressions for the partial derivatives. By expanding these expressions for Ci , Ω Ci , T Cj , Ω Cj , as appropriate, the partial derivatives around x l , x r , yl , yr , T we can obtain the desired expressions.
Task Scheduling in Large Camera Networks Ser-Nam Lim1, , Larry Davis2 , and Anurag Mittal3 2
1 Cognex Corp., Natick, MA, USA CS Dept., University of Maryland, College Park, Maryland, USA 3 CSE Dept., IIT, Madras, India
Abstract. Camera networks are increasingly being deployed for security. In most of these camera networks, video sequences are captured, transmitted and archived continuously from all cameras, creating enormous stress on available transmission bandwidth, storage space and computing facilities. We describe an intelligent control system for scheduling Pan-Tilt-Zoom cameras to capture video only when task-specific requirements can be satisfied. These videos are collected in real time during predicted temporal “windows of opportunity”. We present a scalable algorithm that constructs schedules in which multiple tasks can possibly be satisfied simultaneously by a given camera. We describe two scheduling algorithms: a greedy algorithm and another based on Dynamic Programming (DP). We analyze their approximation factors and present simulations that show that the DP method is advantageous for large camera networks in terms of task coverage. Results from a prototype real time active camera system however reveal that the greedy algorithm performs faster than the DP algorithm, making it more suitable for a real time system. The prototype system, built using existing low-level vision algorithms, also illustrates the applicability of our algorithms.
1 Introduction Large scale camera network systems are being increasingly deployed for purposes that include security, traffic monitoring, etc. These systems typically consist of a large number of cameras, which can either be active (specifically, Pan-Tilt-Zoom or PTZ cameras) or static, transmitting in real time video streams to processing and/or storage systems. Our interest is in controlling these cameras to acquire video segments that satisfy taskspecific constraints. For example, one may wish to acquire at least a few images of each person who enters a given region, capture video segments lasting k seconds and containing well-magnified facial images for facial recognition, or, capture k second long video segments of the side view of a person for gait modeling and recognition. By intelligently transmitting and storing only video segments satisfying task requirements, we can reduce the bandwidth requirements and storage space significantly and increase the efficiency and effectiveness with which the collected video segments can be processed. The control of the cameras to collect these video segments is a challenging problem. The system must detect and track moving objects both within and between cameras in a sensing stage, a problem which is not fully solved yet. Papers such as [1,2,3] deal
This research was funded in part by the U.S. Government VACE program.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 397–407, 2007. c Springer-Verlag Berlin Heidelberg 2007
398
S.-N. Lim, L. Davis, and A. Mittal
with tracking under occlusions, and other papers such as [4,5] describe algorithms for tracking across non-overlapping views. A second challenge is to predict, given a set of tracked targets, the time intervals during which video segments meeting the requirements of the tasks can be collected from available cameras. These requirements include ensuring (1) that the associated object is unobstructed by other objects, (2) that it is moving in a direction suitable for the task, (3) that it can be captured in a field of view of the camera (PTZ) assigned to collect its video segments, and (4) that the collected video segments must satisfy task-specific minimum resolution and duration. For example, if the task is to collect facial images, then we must ensure that the video segments are collected only during time intervals when the person is predicted to be walking towards the camera and unobstructed by other moving objects.This can be done using the observed tracks of the person and other moving objects, predicting their trajectories into the future, and then identifying periods of crossings between the predicted trajectories with respect to each of the available cameras. The complements of these periods of crossings are visibility time intervals during which the person is unobstructed, and camera settings can be determined within these temporal visibility predictions to capture the person in a well-magnified frontal image or video sequence satisfying the four requirements above. This problem has been addressed in earlier work. [6] described the construction of so-called “Task Visibility Intervals” (TVIs) and “Multiple Task Visibility Intervals” (MTVIs), that represent timevarying camera setting ranges that can be used to collect video segments satisfying one (TVI) or multiple tasks simultaneously (MTVIs). A TVI is a 6-tuple: (1) (c, (T, o), [r, d], V alidψ,φ,f (t)), where c represents a PTZ camera, (T, o) is a (task, object) pair - T is the index of a task to be accomplished and o is the index of the object to which the task is to be applied, and [r, d] is a future time interval during which task requirements can be satisfied using camera c. Then, for each time instance t ∈ [r, d], V alidψ,φ,f (t) is the range of valid combinations of the pan angle (ψ), tilt angle (φ) and focal length (f ) settings that camera c can employ to capture object o at time t. The tasks themselves are 3-tuples: (p, α, β),
(2)
where p is the required duration of the task, α is the orientation of the object relative to the optical axis of the camera used to accomplish the task, and β is the minimum image resolution needed to accomplish the task. [6] also described the composition of TVIs into MTVIs, time intervals during which collections of tasks can be satisfied simultaneously by one camera. A set of n TVIs, each represented in the form: (c, (Ti , oi ), [ri , di ], V alidψi ,φi ,fi (t)), for TVI i [Eqn. 1] can be combined into a valid MTVI, represented as: (Ti , oi ), [ri , di ], V alidψi ,φi ,fi (t), (c, i=1...n
i=1...n
i=1...n
(3)
Task Scheduling in Large Camera Networks
such that:
[ri , di ] = ∅,
399
(4)
i=1...n
i.e., there is some common time interval during which they can be scheduled, and: [ri , di ] ≥ pmax , i=1...n
where pmax is the largest processing time among the tasks, and for all t ∈ V alidψi ,φi ,fi (t) = ∅,
i=1...n [ri , di ],
(5)
i=1...n
i.e., the tasks can be captured with common PTZ settings. Besides [6], other work that has focused on temporal analysis and planning for camera scheduling includes [7,8], which discuss a dynamic sensor planning system, called the MVP system. They were concerned with determining occlusion-free viewpoints of a target. This involves handling occlusions between the target and the different moving objects in a scene, each of which generates a swept volume in temporal space. Using a temporal interval search, they divide the temporal intervals into halves while searching for a viewpoint that is not occluded in time by these sweep volumes. This is then integrated with other constraints such as focus and field of view in [8]. The culmination of this work is found in [7], where the algorithms are applied to an active robot work cell. In this paper, we will address the problem of job scheduling given the TVIs and MTVIs generated as in [6]. In general, job scheduling problems are NP-hard, and approximation algorithms have to be employed. We first analyze the approximation factor of a greedy scheduling algorithm (as a function of the number of cameras), which reveals that its performance deteriorates significantly as the number of cameras increases. We then describe a Dynamic Programming (DP) approximation algorithm with an approximation factor that is much better than the greedy approach. The performance advantage of the DP algorithm is confirmed by simulations. Finally, we describe a prototype real time active camera system. A scheduler controls PTZ cameras in real time to capture video segments based on automatically constructed TVIs and MTVIs. While the prototype system includes only a small number of cameras due to limited resources, the results illustrate the applicability of the algorithms for large scale camera networks.
2 Single-Camera Scheduling We first study the scheduling problem when only a single camera is used. This will be extended to the problem of multiple cameras in the next section. Also, we will limit our analysis to non-preemptive schedules in this paper. We introduce the following theorems that make the single-camera scheduling problem tractable: + Theorem 1. Let the slack for the i-th task be δi = [t− δi , tδi ], and define δmax = max(|δi |) and pmin as the smallest processing time among all (M)TVIs for some camera. Then, if |δmax | < pmin , then any feasible schedule for the camera is ordered by the slacks’ start times.
400
S.-N. Lim, L. Davis, and A. Mittal
+ − + Proof. Consider that the slack δ1 = [t− δ1 , tδ1 ] precedes δ2 = [tδ2 , tδ2 ] in a schedule and − − − tδ1 > tδ2 . Let the processing time corresponding to δ1 be p1 . Then t− δ1 + p1 > tδ2 + p1 . − + We know that if tδ1 + p1 > tδ2 , then the schedule is infeasible. This happens if t+ δ2 ≤ − + − + − tδ2 + p1 - i.e., tδ2 − tδ2 ≤ p1 . Given that |δmax | < pmin , tδ2 − tδ2 ≤ p1 is true.
Theorem 1 implies that if |δmax | < pmin , we can limit our attention to feasible schedules that are ordered by the slacks’ start times. This is a reasonably close assumption in many cases since the time to move the cameras and capture an object is generally quite large compared to the slack times in crowded scenes, where such scheduling matters most. This assumption allows us to construct a Directed Acyclic Graph (DAG), where each (M)TVI is a node with an incoming edge from a common source node and outgoing edge to a common sink node, with the weights of the outgoing edges initialized to zero. An outgoing edge from one (M)TVI node to another exists iff the slack’s start time of the first node precedes that of the second (Theorem 1), which can however be removed if it makes the schedule infeasible. Consider the following theorem and corollary: + Theorem 2. A schedule - a sequence of n (M)TVIs each with slack δi = [t− δi , tδi ], where − i = 1...n represents the order of execution - is feasible if t+ δn − tδ1 ≥ ( i=1...n−1 pi ) − th ( i=1...n−1 |δi |), pi being the processing time of the i (M)TVI in the schedule. + Proof. For the schedule to be feasible the following must be true: t− δ1 + p1 ≤ tδ2 , + − + − − t− δ2 + p2 ≤ tδ3 , ... , tδn−1 + pn−1 ≤ tδn . Summing them up gives tδ1 + tδ2 + ... + − + + + + tδn−1 + i=1...n−1 pi ≤ tδ2 +tδ3 +...+tδn , which can then be simplified as tδn −t− δ1 ≥ − + − + ( i=1...n−1 pi ) − ( i=1...n−1 |δi |). The condition, tδ1 + p1 ≤ tδ2 , tδ2 + p2 ≤ tδ3 , ... , + t− δn−1 + pn−1 ≤ tδn , is however only a sufficient condition for a feasible schedule. + − + Corollary 1. Define a new operator , such that if δ1 (= [t− δ1 , tδ1 ]) δ2 (= [tδ2 , tδ2 ]), − + then tδ1 + p1 ≤ tδ2 . Consider a schedule of (M)TVIs with slacks δi...n . The condition: δ1 δ2 , δ2 δ3 , ..., δn−1 δn , is necessary for the schedule to be feasible. Conversely, if a schedule is feasible, then δ1 δ2 , δ2 δ3 , ..., δn−1 δn . Proof is omitted since it follows easily from Theorem 2.
Due to Corollary 1, an edge between two (M)TVI nodes can be removed if it violates the relationship, since it can never be part of a feasible schedule. Using such a DAG, a Dynamic Programming (DP) algorithm can be used to solve the single-camera scheduling problem. Consider the following set of (M)TVIs that have been constructed for a given camera, represented by the tasks (T1...6) they satisfy and sorted in order of their slacks’ start times:{node1 = {T1 , T2 }, node2 = {T2 , T3 }, node3 = {T3 , T4 }, node4 = {T5 , T6 }}, where the set of nodes in the DAG in Figure 1 is given as nodei=1...4 . DP is run by first initializing paths of length 1 starting from each of the (M)TVI nodes to the sink, all with “merit” 0. At each subsequent path length, the next node nodenext chosen for a given node nodecurr in the current iteration is: max |S (6) nodenext =n∈Sarg T asks(nodecurr )|, n curr2next
Task Scheduling in Large Camera Networks
node1
Source
401
00
0 Sink
node2
0 node3 0 node4
Fig. 1. Single-camera scheduling: DAG formed from the set {node1 = {T1 , T2 }, node2 = {T2 , T3 }, node3 = {T3 , T4 }, node4 = {T5 , T6 }}. The weights between (M)TVI nodes are determined on the fly during DP. Assume that, in this example, the relationship is satisfied for the edges between the (M)TVI nodes.
where Scurr2next is the set of nodes that have valid paths starting from them in the previous iteration and for which nodecurr has an outgoing edge to. Sn is defined as the set of tasks covered by the path (in the previous iteration) starting from n, and T asks() gives the set of tasks covered by the (M)TVI associated with nodecurr . So, for example, from node1 , paths of length 2 exist by moving on to either one of node2...4 , with the move to node2 , node3 and node4 covering {T1 , T2 , T3 } (merits=3), {T1 , T2 , T3 , T4 } (merits=4) and {T1 , T2 , T5 , T6 } (merits=4) respectively. We choose the path of length 2 from node1 to node3 . Iterations are terminated when there is only one path left that starts at the source node or a path starting at the source node that covers all the tasks. In our example, the optimal path becomes node1 → node3 → node4 , terminated at paths of length 4 from the sink when all the tasks are covered.
3 Multi-camera Scheduling While single-camera scheduling using DP is optimal and has polynomial running time, the multi-camera scheduling problem is unfortunately NP-hard. Consequently, computationally feasible solutions can only be obtained with approximation algorithms. We consider both a simple greedy algorithm and a branch and bound-like algorithm. 3.1 Greedy Algorithm The greedy algorithm iteratively picks the (M)TVI that covers the maximum number of uncovered tasks, subject to schedule feasibility as given by Theorem 2. Under such a greedy scheme, the following is true: Theorem 3. Given k cameras, the approximation factor for multi-camera scheduling using the greedy algorithm is 2 + kλμ, where the definitions of λ and μ are given in the proof.
402
S.-N. Lim, L. Davis, and A. Mittal
Proof. Let G = i=1...k Gi , where Gi is the set of (M)TVIs scheduled on camera i by the greedy algorithm, and let OP T = i=1...k OP Ti , where OP Ti is the set of (M)TVIs assigned to camera i in the optimal schedule. We further define (1) H1 = i=1...k H1,i , where H1,i is the set of (M)TVIs for camera i, that have been chosen by the optimal schedule but not the greedy algorithm and each of these (M)TVIs contains tasks that are not covered by the greedy algorithm in any of the cameras, (2) H2 = i=1...k H=2,i , where H2,i is the set of (M)TVIs for camera i, that have been chosen by the optimal schedule but not the greedy algorithm and each of these (M)TVIs contains tasks that are also covered by the greedy algorithm, and finally (3) OG = OP T G. Clearly, OP T = H1 H2 OG. Then, for hj=1...ni ∈ H1,i where ni is the number of (M)TVIs in H1,i , ∃gj=1...ni ∈ Gi such that hj and gj cannot be scheduled together based on the requirement given in Theorem 2, else hj should have been included by G. If T asks(hj ) T asks(gj ) = ∅, then hj contains only tasks that are not covered by G. In this case, |hj | ≤ |gj |, else G would have chosen hj instead of gi . Note that the cardinality is defined as the number of unique tasks covered. In the same manner, even if T asks(hj ) T asks(g j ) = ∅,hj could have replaced gj unless |hj | ≤ |gj |. Consequently, |H1,i | = |h1 h2 ... hni | ≤ |h1 | + |h2 | + ... + |hni | ≤ |g1 | + |g2 | + |g | ... + |gni |. Let βj = |Gji | and λi = max(βj ∗ ni ). This gives |H1,i | ≤ β1 |Gi | + ... + βni |Gi | ≤ λi |Gi |. Similarly, we know |H1 | ≤ λ1 |G1 | + ... + λk |Gk | ≤ λ(|G1 | + i| ... + |Gk |), where λ = max(λi ). Introducing a new term, γi = |G |G| and letting μ = max(γi ), we get |H1 | ≤ kλμ|G|. Since |H2 | ≤ |G| and |OG| ≤ |G|, |OP T | ≤ (2 + kλμ)|G|. 3.2 Branch and Bound Algorithm The branch and bound approach runs DP in a similar manner as single-camera scheduling, but on a DAG that consists of multiple source-sink pairs (one pair per camera), with the node of one camera’s sink node linked to another camera’s source node. An example is shown in Figure 2. Then, for a source node s, we define its “upper bounding set” Ss as: Ss = Sc , (7) c∈Slink
where Slink is the set of cameras for which paths starting from the corresponding sink nodes to s exist in the DAG, and Sc is the set of all tasks that are covered by some (M)TVIs belonging to camera c. Intuitively, such an approach aims to overcome the “shortsightedness” of the greedy algorithm by “looking forward” in addition to backtracking and using the tasks that can be covered by other cameras to influence the (M)TVI nodes chosen for a particular camera. Admittedly, better performance is possibly achievable if “better” upper bounding sets are used, as opposed to blindly using all the tasks that other cameras can cover without taking scheduling feasibility into consideration. The algorithm can be illustrated with the example shown in Figure 2, which shows two cameras, c1 and c2 , and the following sets of (M)TVIs that have been constructed for them, again ordered by the slacks’ start times and shown here by the tasks (T1...4 )
Task Scheduling in Large Camera Networks
403
node1
Source1 Source2
node2 node3
Sink1 0 Sink2 0
Sink
Fig. 2. Multi-camera scheduling: DAG formed from the set {node1 = {T1 , T2 , T3 }, node2 = {T3 , T4 }} for the first camera, and the set {node3 = {T1 , T2 , T3 }} for the second camera
they satisfy. For c1 , the set is {node1 = {T1 , T2 , T3 }, node2 = {T3 , T4 }} and for c2 , {node3 = {T1 , T2 , T3 }}. The DAG that is constructed has two source-sink pairs, one for each camera - (Source1 , Sink1 ) belongs to c1 and (Source2 , Sink2 ) to c2 . The camera sinks are connected to a final sink node as shown, with the weights of the edges initialized to zero. Weights between nodes in the constructed DAG are similarly determined on the fly like in the single-camera scheduling. Directed edges from Sink2 to Source1 connects c1 to c2 . The DP algorithm is run in almost the same manner as single-camera scheduling, except that at paths of length 3 from the final sink node, the link from Source1 to node2 , is chosen because the upper bounding set indicates that choosing the link potentially covers a larger number of tasks (i.e., the upper bounding set of Source1 , {T1 , T2 , T3 } combines with the tasks covered by node2 to form {T1 , T2 , T3 , T4 }). The branch and bound algorithm can be viewed as applying the single-camera DP algorithm, camera by camera in the order given in the corresponding DAG, with the schedule of one camera depending on its upper bounding set. This allows us to derive a potentially better approximation factor than the greedy algorithm as follow: Theorem 4. For k cameras, the approximation factor of the branch and bound algo (1+kμ(1+u))k ∗ ∗ rithm is (1+kμ(1+u)) k −(kμ(1+u))k . μ and u are defined as follow. Let G = i=1...k Gi , ∗ where Gi is the set of (M)TVIs assigned to camera i by the branch and bound algorithm. |G∗ | Then, μ = max( |Gi∗ | ) and u = max(ui ), where ui is the ratio of the cardinality of the upper bounding set of camera i to |G∗i |. Proof. Let α be the approximation factor of the branch and bound algorithm. Then, assuming that schedules for G∗1 , ..., G∗i−1 have been determined, |G∗i | ≥ α1 (|OP T | − i−1 ∗ i−1 ∗ j=1 |Gj |). Adding j=1 |Gj | to both sides gives: i j=1
α−1 ∗ OP T + |G |). α α j=1 j i−1
|G∗j | ≥
A proof by induction shows, after some manipulation: k αk |G∗ | ≥ |OP T |. αk − (α − 1)k j=1 j
404
S.-N. Lim, L. Davis, and A. Mittal Greedy − 10 cameras
Greedy − 50 cameras
30 20 10 0 1
Greedy − 100 cameras
200
400
150
300
Approx. factor
Approx. factor
Approx. factor
40
100 50 0 1
μ
3
2 1
0.5
3 0.5
2
μ
λ
0 0
100 0 1
3 0.5
200
1 0 0
2
μ
λ
1 0 0
λ
(a) Branch and Bound − 100 cameras
Branch and Bound − 50 cameras 5
4
4
4
3 2
Approx. factor
5
5
Approx. factor
Approx. sfactor
Branch and Bound − 10 cameras
3 2 1 3
1 3 1
2
u
0.5
1
μ
0 0
2 1 3
1
2
u
3
0.5
1 0 0
1
2
u
μ
0.5
1 0 0
μ
(b) Fig. 3. (a) The approximation factor for the greedy algorithm using 10, 50 and 100 cameras respectively. λ and μ here are as defined in Theorem 3. (b) The same plots for the branch and bound algorithm. Here, the approximation factor depends only on the distribution parameters and not on the number of cameras. u and μ are as defined in Theorem 4. Comparing DP and Greedy Branch and Bound
% of tasks covered
100 90 80 70
Greedy
60 50 200
100 150
No. of tasks
50 100
0
No. of cameras
Fig. 4. The DP algorithm consistently covers more tasks than the greedy algorithm
Let H = i=1...k Hi , Hi being the set of (M)TVIs chosen by the optimal schedule on camera i but not the branch and bound algorithm. The condition |Hi | ≤ |G∗i | + ui |G∗i | is true; otherwise, Hi would have been added to G∗ instead. Consequently, |H| ≤ (|G∗1 | + ... + |G∗k |) +(u1 |G∗1 | + ... + uk |G∗k |) ≤ kμ|G∗ | + kuμ|G∗ | ≤ kμ(1 + u)|G∗ |. Since OP T = OG H (Theorem 3), we get |OP T | ≤ 1 + kμ(1 + u)|G∗ |. Thus, α = 1 + kμ(1 + u).
Task Scheduling in Large Camera Networks
405
By expressing the approximation factors of the greedy and branch and bound algorithm as a function of the number of cameras, we see that the branch and bound algorithm theoretically outperforms the greedy algorithm substantially in terms of task coverage. This is illustrated in Figure 3, whereby the approximation factors of the greedy and branch and bound algorithm are plotted as the “distribution” parameters vary when different number of cameras are used. These distribution parameters refer to λ and μ in Theorem 3, and μ and u in Theorem 4. They represent how well the tasks are distributed among the cameras and (M)TVIs. The plots show that the greedy algorithm is highly sensitive to the number of cameras, with the approximation factor becoming prohibitively high when the tasks are unevenly distributed. On the other hand, the performance of the branch and bound algorithm depends only on the distribution parameters and is not affected by the number of cameras. Both the single-camera and branch and bound DP-based multi-camera algorithm have a computational complexity of O(N 3 ), N being the average number of (M)TVIs constructed for a given camera and used in the resulting DAG. On the other hand, the greedy algorithm takes only O(N 2 ) time, which could outweigh the benefits of better scheduling for very large camera networks.
4 Implementation and Experiments Although we have theoretically found the approximation factors for the scheduling algorithms, it would be interesting for practical purposes to investigate the performance of the greedy algorithm relative to the DP algorithm under “normal” circumstances where we would expect “reasonable” task distribution. For this purpose, we conducted simulations using a scene of size 200m × 200m, and generated moving objects in the scene by randomly assigning to them different starting positions in the scene, sizes and velocities. Cameras are also simulated with calibration data from real cameras. The objects are assumed to be moving in straight lines at constant speeds, and the (M)TVIs for each camera are then constructed and utilized by the scheduler. We conducted simulations for 20, 40, 60, and 80 cameras and 100, 120, 140, 160, 180, and 200 objects, and plot the percentage of the total number of tasks that were captured by both the greedy and DP algorithm. For each object, the task is to capture video segments in which the full body of the object is visible. Since there is only one task for each object, the total number of tasks equals the number of objects. The results are shown in Figure 4. The DP algorithm schedules more tasks than the greedy algorithm by a minimum of 13.55 percent and a maximum of 33.78 percent. Finally, we test our algorithms in a small-scale real time image analysis system. Due to limited resources, building a system with large number of cameras was not possible. We developed a prototype multi-camera system consisting of four PTZ cameras synchronized by a Matrox four-channel card. For running the experiments, one camera is kept static, so that it can be used for background subtraction and tracking in the sensing stage[9,1]. From the detection and tracking, the system recovers an approximate 3D size estimate of each detected object from the ground plane and camera calibration. This is followed by the planning stage, during which the observed tracks allow the system to predict the future locations of
406
S.-N. Lim, L. Davis, and A. Mittal
(a)
(b)
(c)
Fig. 5. (a) The robots are tracked (left and middle image), and the predicted tracks are used to construct the TVIs and MTVIs, which are then used by the scheduler to assign cameras to the (M)TVIs (annotated in the right image). Next, (b) camera 0 captures robot 3, and (c) camera 1 captures robots 0, 1 and 2 simultaneously.
the objects, and to use them for constructing (M)TVIs, which are then scheduled for capture. The predicted position of each detected object on the ground plane is mapped to the PTZ cameras, after which the 3D size estimate of the object is used to construct a rough 3D model of the object for the corresponding PTZ camera. Such a 3D model is utilized to determine valid ranges of PTZ settings during the construction of TVIs. The experiments confirm that the greedy algorithm performs faster than the DP algorithm. This makes the greedy algorithm more suitable for our prototype system. A real time experiment using the greedy scheduler is illustrated in Figures 5. Four remotecontrollable 12x14 inches robots move through the scene. Two PTZ cameras were needed to capture the four robots using a (one task) TVI and a three-task MTVI.
5 Conclusion This paper addressed scheduling algorithms for smart video capture in large camera networks. We developed approximation algorithms for scheduling using a greedy and a DP based approach. While both theoretically and experimentally, the DP algorithm gives very good results, it is computationally more expensive than the greedy algorithm. A suitable algorithm can thus be chosen depending on the application scenario and computational resources available.
Task Scheduling in Large Camera Networks
407
References 1. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International journal of computer vision 29, 5–28 (1998) 2. Mittal, A., Davis, L.: M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene. In: European Conference on Computer Vision, Copenhagen, Denmark (2002) 3. Zhao, T., Nevatia, R.: Bayesian human segmentation in crowded situation. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2003) 4. Kaucic, R., Perera, A.A., Brooksby, G., Kaufhold, J., Hoogs, A.: A unified framework for tracking through occlusions and across sensor gaps. In: IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, IEEE Computer Society Press, Los Alamitos (2005) 5. Rahimi, A., Dunagan, B., Darrell, T.: Simultaneous calibration and tracking with a network of non-overlapping sensors. In: IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, IEEE Computer Society Press, Los Alamitos (2004) 6. Lim, S.N., Davis, L.S., Mittal, A.: Constructing task visibility intervals for video surveillance. ACM Multimedia Systems (2006) 7. Abrams, S., Allen, P.K., Tarabanis, K.: Computing camera viewpoints in an active robot work cell. International Journal of Robotics Research 18 (1999) 8. Tarabanis, K., Tsai, R., Allen, P.: The mvp sensor planning system for robotic vision tasks. IEEE Transactions on Robotics and Automation 11, 72–85 (1995) 9. Grimson, W.E.L., Stauffer, C.: Adaptive background mixture models for real-time tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (1999)
Constrained Optimization for Human Pose Estimation from Depth Sequences Youding Zhu1 and Kikuo Fujimura2 1
Computer Science and Engineering, The Ohio State University 2 Honda Research Institute USA
[email protected],
[email protected]
Abstract. A new 2-step method is presented for human upper-body pose estimation from depth sequences, in which coarse human part labeling takes place first, followed by more precise joint position estimation as the second phase. In the first step, a number of constraints are extracted from notable image features such as the head and torso. The problem of pose estimation is cast as that of label assignment with these constraints. Major parts of the human upper body are labeled by this process. The second step estimates joint positions optimally based on kinematic constraints using dense correspondences between depth profile and human model parts. The proposed framework is shown to overcome some issues of existing approaches for human pose tracking using similar types of data streams. Performance comparison with motion capture data is presented to demonstrate the accuracy of our approach.
1
Introduction
Markerless human motion capture is a research field concerned with obtaining large scale human motion data, such as head, torso, limbs, from image observation of human subjects. For the past decades, markerless human motion capture has been an active research field motivated by various applications such as action recognition, surveillance, and man-machine interaction. Despite substantial advances in related aspects including tracking, pose estimation, and recognition, challenging problems still remain due to the high degrees of freedom coming from the dynamic range of poses during human activities, the diversity of visual appearance caused by clothing, visual ambiguities from self-occlusion of non-rigid 3D object, and background clutters. There have been many attempts at solving this problem using various modalities including a single image, a sequence of images, and multiple streams (using multiple cameras) of images. In this paper, we propose a method for upper-body pose estimation from a stream of depth images. The region occupied by a human subject is easier to capture in depth image and it usually contains a stronger cue to distinguish a human subject from other objects. By taking advantage of this characteristic, an optimization approach is presented to generate the most plausible pose from various cues provided by depth image analysis. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 408–418, 2007. c Springer-Verlag Berlin Heidelberg 2007
Constrained Optimization for Human Pose Estimation
409
More concretely, coarse body part labeling is cast as a linear programming problem, where various small segments of the human body are labeled with constraints arisen from body parts detection and tracking. Furthermore, 3D human pose is optimally estimated from dense correspondences. Our algorithm is structured to work for a stream of point clouds from depth sensors. It also configured to function when a color stream is provided in addition to a depth stream. Our implementation of the algorithm runs at 5∼9 frames per second in online mode on a 3.00GHz HP desktop. The rest of the paper is organized as follows. After a brief review of related work in Section 2, our algorithm is described in Section 3. Experimental results are presented in Section 4 and Section 5 concludes the paper.
2
Related Work
A large number of pose estimation methods have been proposed in the literature. A thorough discussion of various pose estimation approaches is beyond the scope of this paper and the reader is referred to a recent survey [1] for a comprehensive comparison between various approaches. Lately, there have been approaches at making use of depth sequences due to its advantages over a single color image. For example, depth measurement provides necessary information to resolve depth ambiguity which is an issue with approaches using a single color image [2]. Grest et al. [4] adapt an Iterative Closest Point (ICP) approach to the articulated human model, where pose parameters are updated using inverse kinematics based on dense correspondences between sampled depth observations and model vertices that are found based on nearest neighbor association. Knoop et al. [3] also use ICP to update pose parameters by incorporating multiple input data from different sensors such as stereo depth, hands/face tracking from color camera, etc. Ziegler et al. [11] use an unscented kalman filter based on a set of correspondences between model vertices and observed stereo point cloud. ICP is often used as a method of choice when a 3D model is to be fitted to 3-dimensional data. A common issue with ICP approaches for human pose tracking is that the model may drift away from the data or get stuck in local minima. An initial configuration is critical for ICP to converge correctly. Our framework also uses the idea of closest point correspondence as a part of the solution, but it is less susceptible to the problem of local minima due to the coarse body part identification. We also use a grid acceleration data structure so as to achieve pose estimation at a high frame rate without loss of accuracy.
3
Algorithm
Our algorithm takes a depth image sequence representing human motion and outputs pose vectors of the upper body. Depth data is usually obtained by using stereo cameras, structured light sensors, or time-of-flight sensors. If other image modality, e.g., a color image sequence corresponding to the depth sequence, is
410
Y. Zhu and K. Fujimura
Fig. 1. Flow of the Algorithm. The left half of the figure illustrates Step 1, in which coarse body part labeling is determined. The right half illustrates the process of determining joint positions by fitting.
also available, our framework allows such data to be integrated to strengthen the result. The algorithm consists of two major modules, namely, (i) coarse body part labeling and (ii) model fitting (Fig. 1). In the first module, the region within the given image corresponding to the human body is partitioned into small homogeneous segments. Such segments are formed such that within each segment, depth of each pixel is similar and areas of these segments are of a small and similar size. Each segment is then assigned a body part label (e.g., head, left arm) by using a label assignment framework. At this point, coarse body part identification within each image is completed and passed to the second module. For the second module, a polygonal human upper body model attaching to the underlying kinematic skeleton structure is fitted to the depth observation using ICP for each body part. 3.1
Body Constraints
A few body constraints are extracted from depth images. 1. Head and torso constraints: The head and torso are tracked by specialized modules. The head is tracked based on circle fitting with predicted head contour points from depth, while the torso is tracked based on box fitting, where a box with 5 degrees of freedom (x, y, height, width, and orientation) is positioned so as to minimize the number of background pixels within the box. 2. Depth constraint. For certain frames, an arm shape is clearly separable in depth when it is in front of the torso. These cues are further used by body part labeling as described in the next.
Constrained Optimization for Human Pose Estimation
3.2
411
Coarse Part Labeling
For the next step, we form an adjacent graph G where a node represents a small cluster of pixels within the human body (see Fig. 1, top left) and an edge represents adjacency relationship between two pixel clusters. A recursive subdivision strategy is taken to partition the image into pixel clusters based on homogeneity of depth values and spatial positions. Starting from the root node representing all pixels, the k-mean clustering (with k=2) is used to subdivide clusters further until each cluster has the property that it is sufficiently small in size and that it has small depth variance. All the leaf nodes form segment set S to be labeled. Segments si (i = 1, 2, · · · , N ) are to be classified into major body parts p1 , p2 , · · · , pM . This is a labeling problem and formulated in the following optimization problem. A segment si in S is to be assigned a label by a function f . Each si has an estimate of its likelihood of having labeling f (si ). This comes from a heuristic in pose estimation. For example, if s is near the bottom of the image, its likelihood of being a part of the head is low. For this purpose, a non-negative cost function c(s, f (s)) is introduced to represent the likelihood. Further, we consider two neighboring segments si and sj to be related, in the sense that we would like si and sj to have the same label. Each edge e in graph G has a nonnegative weight we indicating the strength of the relation. Moreover, certain pairs of labels are more similar than others. Thus, we impose a distance d() on the label set. Larger distance values indicate less similarity. The total cost of a labeling f is given by: c(s, f (s)) + we d(f (si ), f (sj )) Q(f ) = s∈S
In our problem, the following table (Fig. 2) is to be completed, where binary variable Aij indicates if segment si belongs to body part pj . For c(i, j), the Euclidean distance from segment to the model body part (from the previous frame) is used. More specifically, the Euclidean distance from a segment si to a model body part pj is estimated as the sum of squared distances from a number of sampled pixels to their nearest model vertices in the model body part. table Aij segment:s1 segment:s2 ... s egment:sN
Head (p1 ) A11 A21
Torso LeftArm RightArm ... (p2 ) (p3 ) (p4 ) (pM ) A12 A13 A14 (A1M ) A22 A23 A24 (A2M )
AN1 AN2
AN3
AN4
(ANM )
Label Assignment Problem Aij = 1 ifsi belongs to pj A = 0 otherwise ijM j=1 Aij = 1 Nearby pixels have similar label
Fig. 2. Labeling problem
M Since each segment belongs to only one body part, j=1 Aij = 1 hold. In addition to this constraint, a number of related constraints are considered.
412
1. 2. 3. 4.
Y. Zhu and K. Fujimura
Neighboring segments should have a similar label. Head and torso constraint. Depth slicing constraint. Color constraint.
It turns out that this is an instance of the Uniform Labeling Problem, which can be expressed as the following integer programming by introducing auxiliary variables Ze for an edge e to express the distance between the labels and we use Zej to express the absolute value |Apj − Aqj |. Following Kleinberg and Tardos [6], we can rewrite our optimization problem as follows: min{
N M
c(i, j)Aij +
i=1 j=1
we Ze }
(1)
eE
subject to M
Aij = 1, i = 1, 2, 3, · · · , N
(2)
j=1
1 Zej , e ∈ E 2 j=1 M
Ze =
(3)
Zej ≥ Apj − Aqj , e = (p, q); j = 1, 2, · · · , M
(4)
Zej ≥ Aqj − Apj , e = (p, q); j = 1, 2, · · · , M
(5)
Aij ∈ {0, 1}, i = 1, 2, · · · , N ; j = 1, 2, · · · , M
(6)
Here, terms involving Ze and Zej come from constraint (1). The weight we is given by we = e−αde , where de is depth difference between two adjacent segments and α is selected based on experiments. In our study, we let M = 4 (head, torso, left arm, right arm). For constraint (2), if the segment si is outside the tracked head circle, an additional constraint Ai,1 = 0 is added. If segment si is outside of the tracked torso box, constraint Ai,2 = 0 is added. To apply constraint (3) for the detected arm segments, constraint:Ai,3 + Ai,4 = 1 is added (as it is not clear whether it is the right or left arm). Finally, if there are tracked hand positions based on skin color information, we can add constraint: Ai,3 + Ai,4 = 1. In general, solving an integer program optimally is NP-hard. However, we can relax the above problem to linear programming with Aij ≥ 0, and this can be solved efficiently by using a publicly available library, e.g., [7]. Kleinberg and Tardos [6] describe a method for rounding fractional solutions so that the expected objective function Q(f ) is within a factor of 2 from the optimal solution. In our experiments we find that this relaxed linear programming always returns an integer solution. (See an observation due to Anguelov et al. [10].) Figure 1 shows one example of this body part labeling result.
Constrained Optimization for Human Pose Estimation
3.3
413
Model Fitting
The human body model is represented as a hierarchy of joint link models with a skin mesh attached to it as in Lewis et al. [8]. For the upper body model used in this paper, a skin mesh and hierarchical skeleton structure are illustrated as in Fig. 1. Given a set of 3D data point P = {p1 , p2 , · · · , pm } as targets and their corresponding model vertices V = {v1 , v2 , · · · , vm }, the model pose vector q is estimated as qˆ = argminq ||P − V (q)||2
(7)
where q = (θ0 , · · · , θn )T is the pose parameter vector and vi ’s are visible vertices of the polygonal model. To solve this minimization problem efficiently and robustly, we use a variant of inverse kinematics (known as damped least square), which is inspired by the well-known ICP algorithm [9]. The formulation (see Fig. 7) minimizes ||JΔq − ΔE||2 + λ||Δq||2 where J is the Jacobian. The inverse kinematics with damped least square [5] has the benefit of avoiding singularities, thus making the process numerical stable. We use λ=0.1 based on our experiments. For articulated body pose estimation, the algorithm depends on the accuracy of finding correspondence pairs of data point and model vertices. Most recent works apply a nearest neighbor search between two point clouds: one contains all the observed 3D points from depth or other sensor; the other contains all the model vertices. Since iteration may be attracted to local minima as the algorithm is inspired by the well-known ICP algorithm [9], we apply the aforementioned body part labeling to limit the nearest neighbor search between a subset of observed 3D points and a subset of visible model vertices for each body part. Thus, this not only speeds up the nearest neighbor search, but more importantly, it achieves robust pose estimation even for long sequences containing large motions between two consecutive frames. In our implementation, the OpenGL depth buffer is utilized to decide model vertex visibility. For faster computation, we use a grid based spatial index data structure to speed up nearest neighbor search between point clouds. We partition the working volume into a set of 3D grid. Because only scene profiles are used, we partition the xy plane of the working volume. Depth points and visible model vertices are indexed into corresponding grids. To perform nearest neighbor search for a model vertex, we first find the grid it is located. The nearest neighbor depth point in this grid is found afterward. Then, we recursively propagate the nearest neighbor depth point search to the neighboring grids until the minimal distance from grid corners to the model vertex is greater than the current minimal distance from the model vertex to the depth points. As illustrated in Fig. 8, the method contributes speed-up by a factor of 6. When capturing a pose sequence, the subject is initially requested to take an open-arm posture (so-called “T-pose”), in which his arms and torso do not
414
Y. Zhu and K. Fujimura
overlap. At this initialization stage, body dimensions are measured and further used to scale the kinematic skeleton and polygonal human body model.
4
Experimental Results
The proposed pose estimation algorithm has been tested on many sequences collected from a few human subjects. Depth sequences and color sequences have been obtained using a calibrated hybrid pair of stereo (Swiss ranger SR3000 for depth and Sony DFWV500 for color) in a synchronous fashion. Furthermore, a motion capture system by PhaseSpace Inc. with 8 camera units has also been run synchronously with the hybrid camera system, taking coordinates for eight major joints of the subject for ground-truth reference. The subject wears markers for a motion capture purpose only and these markers are not used for the main algorithm. Test motion sequences include a complete semaphore flag signaling motion (A to Z), simple exercise movements, and TaiChi movements. The total number of frames collected is 4800 and each test sequence is about 400 frames long. All sequences have been tracked successfully at the frame rate of 5∼9Hz on a 3.00GHz HP desktop. Figure 3 contains tracked frames taken from full-length sequences, where the subject performs (a) a TaiChi motion and (b) an exercise motion, respectively. Pose estimation precision has also been compared against joint position data captured by a marker-based motion capture system. Figure 4 contains error in various joint positions for the TaiChi motion sequence. As seen here, the overall tracking error is approximately 5cm (where the subject stands 1.5m to 2m from the camera). Similar tracking results have been obtained for the other sequences, such as out-of-plane-rotation (Fig. 6) which is usually difficult to capture with single-color-camera-based pose estimation methods. To compare tracking stability, we tested an ICP-based approach using exactly the same sequences. The ICP and our method have similar performance (in terms of error amount from the ground-truth), when tracking runs successfully. However, a significant difference is that for some frames (where our method processes successfully), the ICP based tracking fails and never recovers (as shown in Fig. 5). Our method functions for all cases where the ICP method works. The ICP-based method is slower, because it has to do more iteration for convergence. This illustrates the advantage of the part labeling step in our framework. At this point, let us contrast our approach with a few other approaches using depth sequences. Ziegler et al. [11] use depth sequences obtained by four stereo cameras. Point correspondences are based on spatial proximity which may result in wrong correspondences when body parts are close to each other. Our method is less susceptible to such a problem due to the use of inverse kinematics. Demirdjian et al. [13] use efficient example-based matching to improve the tracking of a set of likelihood modes. Large errors can still occur when the test example is not close to training examples. Our coarse labeling step has a similar function and finds the likelihood mode with constraints from bottom-up observations. Grest et al. [4] introduce an ICP method for articulated body pose estimation,
Constrained Optimization for Human Pose Estimation
415
(a)
(b)
Fig. 3. Snapshots of the algorithm output (a) TaiChi sequence, (b) Simple exercise sequence
Model Joints
error (in millimeter) ΔX(μ, σ) ΔY (μ, σ) ΔZ(μ, σ) Right Hand (-15, 49) (-39, 58) (23,44) Right Elbow (-23, 34) (-70, 42) (-48,59) Right Shoulder (21, 57) (-43,19) (1,25) Waist (-24, 26) (-12, 15) (-19,14) Left Hand (16, 61) (-6, 86) (44,45) Left Elbow (30, 35) (-74, 39) (71,66) Left Shoulder (-23, 53) (-36, 30) (27,30) Head (-15, 26) (-18, 15) (-22,15) Overall (-4, 49) (-37, 50) (22,52) Fig. 4. Comparison table
416
Y. Zhu and K. Fujimura TaiChi
Color Image Our Method
ICP only
Frame:178 Fig. 5. Stability comparison between our method and standard ICP
Fig. 6. Examples of out-of-plane rotation up to 50 degree
while they do not address the robustness of ICP-based pose estimation. Knoop et al. [3] utilize skin color segmentation (hence face and hand feature trackers) to improve the ICP-based pose estimation. It is not clear how to handle temporarily invisible face or hands. Some other approaches use multiple sensors to obtain more surface data. Cheung et al. [12] use visual hull while Anguelov et al. [10] use 3D range scan data to reconstruct human skeletal structures. Accurate pose estimation might be obtained for these methods since body parts are visible in multiple views.
5
Concluding Remarks
We have presented a method for estimating human pose from depth sequences. Our method consists of two major components that cooperate to estimate and track human motion. The first module is body component identification which has been solved by reducing it to linear programming. If a certain application requires only rough body labeling in the image, the first module provides such a solution. If more accurate positions for major joints are required, then the second module is to be used. The second module is model fitting by using inverse kinematics based on dense correspondences between the image data and the human kinematic model. The result of the second component, in turn, is used to initiate the first component for the next frame. The algorithm tracks human upper-body movements over several minutes of pose sequences at a speed of a few Hz using a laptop PC (up to 10Hz when a desktop with 3GHz is used).
Constrained Optimization for Human Pose Estimation
417
model fitting(P :3D point,V :model vertex) 1. Form ⎤ ⎡ ⎤ Δe1 pxi − vix ⎢ Δe2 ⎥ ⎥ Δei = ⎣ pyi − viy ⎦ , ΔE = ⎢ ⎣ ... ⎦ z z pi − vi Δem ⎡
2. Solve JΔq = ΔE by damped least square where J is Jacobian of model vertices Δq = (J T J + λI)−1 J T ΔE q = q + Δq 3. Repeat until Δq is sufficiently small Fig. 7. Model fitting procedure
Fig. 8. Performance comparison with and without grid acceleration (on a 2.13GHz IBM Laptop)
We have also made a comparative study of marker-less pose tracking based on a commercial marker-based tracking system and shown that our joint positions are accurate to several cm. The algorithm has a number of extension possibilities such as severe occlusions coming from environmental objects. We leave these for our future work.
References 1. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2,3), 90–126 (2005)
418
Y. Zhu and K. Fujimura
2. Sminchisescu, C., Triggs, B.: Kinematic jump processes for monocular 3D human tracking. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 18–20 (2003) 3. Knoop, S., Vacek, S., Rillman, R.: Sensor fusion for 3D human body tracking with an articulated 3D body model. In: Int. Conf. on Robotics and Automation, pp. 1686–1691 (2006) 4. Grest, D., et al.: Nonlinear body pose estimation from depth images. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, Springer, Heidelberg (2005) 5. Buss, S., Kim, J.: Selectively damped least squares for inverse kinematics. Journal of Graphics Tools 10(3), 37–49 (2005) 6. Kleinberg, J., Tardos, E.: Approximation algorithms for classfication problems with pairwise relationships: Metric partitioning and Markov random fields. Journal of the ACM 49(5), 616–639 (2002) 7. LP solve reference guide, http://lpsolve.sourceforge.net/5/5 8. Lewis, J.P., Cordner, M., Fong, N.: Pose space deformations: A unified approach to shape interpolation and skeleton-driven deformation, Siggraph, pp. 165–172 (2000) 9. Besl, P., McKay, N.: A method for registration of 3-d shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 10. Anguelov, D., Koller, D., Pang, H., Srinivasan, P., Thrun, S.: Recovering articulated object models from 3D range data. In: Proc. of Uncertainty in Artificial Intelligence Conference, pp. 18–26 (2004) 11. Ziegler, J., Nickel, K., Stiefelhagen, R.: Tracking of the articulated upper body on multi-view stereo image sequences. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 774–781 (2006) 12. Cheung, K.M., Baker, S., Kanade, T.: Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 77–84 (2003) 13. Demirdjian, D., Taycher, L., Shakhnarovich, G., Grauman, K., Darrell, T.: Avoiding the Streetlight Effect: tracking by exploring likelihood modes. In: Int. Conf. on Computer Vision, pp. 357–364 (2005)
Generative Estimation of 3D Human Pose Using Shape Contexts Matching Xu Zhao and Yuncai Liu Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University 200240, Shanghai, China
Abstract. We present a method for 3D pose estimation of human motion in generative framework. For the generalization of application scenario, the observation information we utilized comes from monocular silhouettes. We distill prior information of human motion by performing conventional PCA on single motion capture data sequence. In doing so, the aims for both reducing dimensionality and extracting the prior knowledge of human motion are achieved simultaneously. We adopt the shape contexts descriptor to construct the matching function, by which the validity and the robustness of the matching between image features and synthesized model features can be ensured. To explore the solution space eÆciently, we design the Annealed Genetic Algorithm (AGA) and Hierarchical Annealed Genetic Algorithm (HAGA) that searches the optimal solutions eectively by utilizing the characteristics of state space. Results of pose estimation on dierent motion sequences demonstrate that the novel generative method can achieves viewpoint invariant 3D pose estimation.
1 Introduction Capturing 3D human motion from visual cues has received increasing attention in recent years, due to the drive from a wide spectrum of potential applications such as behavior understanding, content-based image retrieval and visual surveillance. Although having been attacked by many researchers, this challenging problem is still long standing because of the diÆculties conduced mainly by complicated nature of 3D human motion and incomplete information of 2D images for 3D human motion analysis. In the context of graphical models, the state-of-art approaches of 3D human motion estimation can be classified as generative and discriminative [1]. Generative methods [2,3,4,5,6,7] follow the bottom-up Bayes’ rule and model the state posterior density using observation likelihood or cost function. Given an image observation and prior state distribution, the posterior likelihood is usually evaluated using Bayes’ rule. This approach has a sound framework of probabilistic support and can achieve significant success for recovering complex unknown motions by utilizing well-defined state constrains. However, it is generally computationally expensive because one had to perform complex search over the state space in order to locate the peaks of the observation likelihood. Moreover, prediction model and initialization are also the bottlenecks of the approach especially in tracking situation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 419–429, 2007. c Springer-Verlag Berlin Heidelberg 2007
420
X. Zhao and Y. Liu
In this paper, we propose a novel generative approach in the framework of evolutionary computation, by which we try to widen the bottlenecks mentioned above with eective search strategy embedded in the extracted state subspace. Considering the generalization of application scenario, the observation information we utilized comes from an uncalibrated monocular camera. This makes the state estimation get into severe illconditioned problem. And, we have to confront the curse of dimensionality because there are more than forty degrees of freedom (DOFs) of full body joints in our 3D human model. Therefore, the process for searching optimal solutions should be performed in some compact state space by the search algorithms which suit for the characteristics of this space. In doing so, infeasible solutions, namely, the absurd poses can be avoided naturally. To this end, we consider to reduce the dimensionality of state space by principal component analysis (PCA) of motion capture data. Actually, the motion capture data embody the prior knowledge about human motion. By PCA, the aims at both reducing dimensionality and extracting the prior knowledge of human motion are achieved simultaneously. From the theoretical view, PCA is optimal in the sense of reconstruction because it allows the minimal information loss in the course of state transformation from the subspace to original state space. Dierent from the previous works [8,9], we perform the lengthways PCA, by which the subspace can be extracted from only single sequence of motion capture data. To explore the solution space eÆciently, we design the Annealed Genetic Algorithm (AGA) combining the ideas of simulated annealing and genetic algorithm [10]. As the promoted version of AGA, Hierarchical Annealed Genetic Algorithm (HAGA) searches the optimal solutions more eectively than AGA by utilizing the characteristics of state space. According to the theory of PCA, in our problem, the first principle component captures the most important part of human motion and the rest of principle components capture the detailed parts of this motion. In monocular uncalibrated camera situation, the fitness function (observation likelihood function) is very sensitive to the change of global motion. The HAGA performs hierarchical search automatically in the extracted state subspace by localizing priorly the state variables such as the global motions and the coordinate of the first principle component which dominate the topology of state space. We adopt the shape contexts descriptor [11] to construct the fitness function, by which the validity and the robust matching between image features and synthesized model features can be achieved. 1.1 Related Work There has been considerable previous work on capturing human motion from image information. The earlier work on this research topic had been reviewed comprehensively by the survey papers [12,13,14]. Generally speaking, to recover 3D human pose configuration, more information are required than image can provide especially in the monocular situation. Therefore, much work focus on using prior knowledge and experiential data in order to alleviate the ill-condition of this problem. Explicit body model embodies the most important prior knowledge about pose configuration and thus be widely used in human motion analysis. Another class of important prior knowledge comes from the experiential data such as motion capture data acquired by commercial
Generative Estimation of 3D Human Pose
421
motion capture system and other hand-labeled data. The combination of the both prior information can produces favorable techniques for solving this problem. Agarwal et al. [5] distill prior information (the motion model) of human motion from hand-labeled training sequences using PCA and clustering on the base of a simple 2D human body model. This method presents a good autoregressive based tracking scheme but has no description about pose initialization. In the framework of generative approach, the prior information is usually employed to constrain or reduce the search space. Urtasun et al. [15,9] construct a dierentiable objective function based on the PCA of motion capture data and then find the poses of all frames simultaneous by optimizing a function in low-dim space. Sidenbladh et al. [3,8] present similar methods in the framework of stochastic optimization. For a specific activity, such methods need many example sequences of motion capture to perform PCA and all of these sequences must keep same length and same phase by interpolating and aligning. Ning et al. [6] learn a motion model from semi-automatically acquired training examples which are aligned with correlation function, and then, some motion constrains are introduced to cut the search space. Unlike these methods, we extract the state subspace from only one example sequence of a specific activity using the lengthways PCA and thus have no use for interpolating or aligning. In addition, useful motion constraints are included naturally in the low-dim subspace. In recent years, particle filter [16] (also known as condensation algorithm) based optimization methods are used widely for recovering human pose in generative framework [2,3,4,5,6,7]. However, as a stochastic search algorithm, we think that particle filter is essentially similar with evolutionary algorithm (EA) if having no explicit temporal dynamic model. The EA can provide more flexible evolutionary mechanism such as crossover operator. This is the important motivation for us to solve this problem in the framework of EA. A noticeable example showing the relationship between particle filter and EA is the work of Deutscher et al. [17]. By introducing the crossover operator, the annealed particle filter proposed in theirs earlier work [2] get remarkable improvement. Comparing with previous generative methods, extracting the common characteristic of a special types of motion from prior information and represent them with some compact forms are of particular interests to us. At the same time, we ensure the motion individuality of the input sequences with eective evolutionary search strategy suiting for the characteristic of state subspace.
2 State Space Analysis The potential special interests motivate us to analyze the characteristics and structure of the state space. Such interests involve mainly modeling the human activities eectively in the extracted state subspace and eliminating the curse of dimension. 2.1 Pose Representation We use a explicit model that represent the articulated structure of the human body. Our fundamental 3D skeleton model (see Figure 1.a) is composed of 34 articulated rigid sticks. The pose is described by a 44 dimensional vector x xg x j , where 3D vector
422
X. Zhao and Y. Liu
(a )
(b )
(c )
Fig. 1. (a) The 3D human skeleton model. (b) The 3D human convolution surface model. (c) The 2D convolution curves.
xg represents the global rotations of human motion and 41D vector x j represents the joint angles. Figure 1.b shows the 3D convolution surface [18] human model which actually is an isosurface in a scalar field defined by convolving the 3D body skeleton with a kernel function. Similarly, the 2D convolution curves of human body as shown in Figure 1.c are the isocurves generated by convolving the 2D projection skeleton. As the synthetical model features, the curves are used to match with the edges of image silhouettes for constructing the likelihood function. 2.2 Subspace Extraction All of the 3D poses distribute in the state space X. The pose set which belongs to a special activity, such as walking, running, handshaking, etc., generally crowd in a subspace of X. We extract the subspace X s from motion capture data obtained from the CMU database (http:mocap.cs.cmu.edu). Assuming xt xt X is a given data sequence of motion capture corresponding to one motion type, where t is the time tag, the subspace X s is extracted by PCA as follows: 1. Centering the state vectors and assembling them into a matrix (by rows): X [(x1 c); (x2 c); ; (xT c)], where c is the mean vector. 2. Performing a singular value decomposition of the matrix to project out the dominant directions: X U D VT . 3. Projecting the state vectors into the dominant subspace: each state vector is represented as a reduced vector x s (x c) Um , where Um is the matrix consisting of first m columns of U, by which the m-D subspace X s is spanned. Therefore, the original state vector x can be reconstructed by: x c x s UTm
(1)
The dimensionality m of subspace X s is determined according to the cumulative sum of principal component variance percentage. With our experiences, the value of is set to be not smaller than 095; accordingly, the value of m is not greater than 6 generally.
Generative Estimation of 3D Human Pose
423
3 Fitness Function In generative framework, pose capturing can be formulated as Bayesian posterior distribution inference: (2) p(x s y) p(x s )p(y x s) The function p(y x s) represents the likelihood observing in image y, conditioned on a pose candidate x s . It is used to evaluate every pose candidate generated from p(x s ). In the context of evolutionary algorithm, the likelihood function is just the fitness function. We propose a fitness function on the basis of shape contexts matching [11]. We choose the image silhouette of subject as the observed image feature, which is extracted using statistical background subtraction. The shape context descriptor is used to describe the shape of image silhouette and convolution curves generated by the pose candidate (see Figure 1). Figure 2 illustrate the shape contexts [11] (histograms of local edge pixels into log-polar bins) of human shape. Our shape contexts contain 12 angular five radial bins, giving rise to 60-dimensional histograms as shown in Figure 2.b. In the matching process, the regularly spaced points on the edge of the silhouette are sampled as the query shape. The point set sampled from the convolution curves is viewed as the candidate shape. Before matching, the image shape and the candidate shape are normalized to same scale. We represent the query shape and the candidate shape as S query (y ) and S m (x s) respectively. To this end, the matching cost function is formulated as: F(S query (y) S m (x s ))
r
2 (S Cquery (y) S Cm (x s ) ) j
(3)
j 1
where S C is the shape context, r is the number of sample point on the edge of image j silhouette, and S Cm (x s ) argminu2 (S Cquery (y) S Cmu (x s)). Here, we use the 2 distance as the similarity measurement. In AGA, the optimization mechanism are designed for searching the maximal value of object function. Therefore, according to Eq. (3), the fitness function can be reformulated as: (S query (y)
S m (x s )) C exp(F(S query (y) S m (x s)))
(4)
where C is a constant for adjusting the value range of fitness function.
4 Pose Estimation Using HAGA In this section, we describe the key algorithms of the generative framework, namely, the AGA and HAGA, and theirs adaption for pose estimation from monocular silhouettes. 4.1 Hierarchical Annealed Genetic Algorithm Combining simulated annealing (SA) and genetic algorithm (GA), we design the annealed genetic algorithm, which actually is a hybrid (1 1) evolutionary strategy. In our algorithm, the local optimal solutions are avoided by introducing several genetic evolutionary principles. We represent chromosome by state vector as x [x1 x2 xn ],
424
X. Zhao and Y. Liu
(a)
(b )
Fig. 2. (a) The shape contexts computed on edge points of image silhouette (right) and sampled points of convolution curves (left). (b) The example shape contexts for reference samples showed in (a) of image silhouette (bottom) and convolution curves (top).
where the genes xi i 1 2 n are random numbers uniformly distributed in the interval (0 1). We use real encodings. The algorithm searching for optimal solutions with the AGA is described as follows: Parameter initialization set values for evolution control parameters: S t – stop criteria; Nt – termination condition; Et – times for searching a equation state; for st 1 to S t do: NonImproveNum 0; Generate the genes of x uniformly at random in the interval (0 1); Evaluate the fitness function (x) by mapping x onto the problem domain; while (NonImproveNum Nt ) do for et 1 to Et do: Evolution of x driven by the genetic operators; (see Table 1.) Evaluate (x); end for If the value of fitness function is improved, NonImproveNum 0, else NonImproveNum NonImproveNum 1; end while Record the optimal x; end for We design five genetic operators, which are executed orderly in AGA. The operators are introduced by evolving a example chromosome x [x1 x2 x3 x4 x5 x6 ]. The new chromosome generated by the operators is denoted as x . Assuming the positions generated randomly are number 2 and number 6 or number 3 ( for point mutation operator), for example, the five operators are illustrated in Table 1. (The new genes are represented as x ). On the basis of AGA, we develop a HAGA by utilizing the characteristics of state space X. In HAGA, the state space is decomposed automatically by computing the variances of state components which are generated in each annealing run. According to the variances of state components, the state space is partitioned by localizing down the important components to a small area of theirs range. It is explainable in theory because ¼
¼
Generative Estimation of 3D Human Pose
425
Table 1. The genetic operators in AGA Operators
Example
Exchange
x [x1 x2 x3 x4 x5 x6 ] x
¼
Segment reversion x [x1 x2 x3 x4 x5 x6 ] x
¼
Segment shift Point mutation
x [x1 x2 x3 x4 x5 x6 ] x
¼
x [x1 x2 x3 x4 x5 x6 ] x
¼
Segment mutation x [x1 x2 x3 x4 x5 x6 ] x
¼
[x1 [x1 [x1 [x1 [x1
x6 x3 x4 x5 x2 ] x6 x5 x4 x3 x2 ] x6 x2 x3 x4 x5 ] ¼
x2 x3 x4 x5 x6 ] ¼
x2 x3 ¼ x4 ¼ x5 ¼ x6 ¼ ]
the important state components dominate the topology of the state space and the little changes of theirs value can produce great eect whereas the values of other state components had little influence on whether they were selected or not. This theory is illustrated in Figure 3. Focusing only on one annealing run of sate evolution (st st 1), we describe the detailed HAGA as follows. 1. Generate initial chromosome x [x1 x2 xn ] at random, where xi i 1 2 n are random numbers uniformly distributed in the interval (0 1). Mapping it linearly into the variance domain: x xt
(min xt max xt )
(5)
In the first round of state evolution, (min x1 max x1 ) (0 1). Evaluating the fitness function (x). 2. Evolve the chromosome according to the state evolutionary mechanism of AGA. Before evaluating the fitness function, every new chromosome needs to be mapped onto the variance domain as formulated in Eq.5. 3. Store N best states (chromosomes) and computing the covariance matrix: Vt1
1 N
N
(xit1 xct1 )T (xit1 xct1 )
(6)
i 1
where xct1 is the mean vector, and the covariance matrix Vt1 is a diagonal matrix on the assumption that the state components are independent each other. To this end, the variance domain can be formulated as:
min xt1 max xt1
xct1 ct1 Vt1 xct1 ct1 Vt1
(7)
where ct1 [ct1 ct1 ct1 ] is used to adjust the variance domain and ct1 is a positive constant. 4. The variance domain (min xt1 max xt1 ) is used to cut down the state space in next state evolution. 4.2 Experiments We demonstrate our method by extracting subspaces for dierent classes of human motion and using them to estimate 3D body pose in unseen video sequences.
426
X. Zhao and Y. Liu
(a)
(b )
(c)
Fig. 3. Variance reduction contrast between principal state components and other state components. Graph (a) shows the variances of state set in which the chromosomes have not been evolved, displaying almost equal variances for each components. Graph (b) shows the variances of state set which have come through one round of state evolution, noticing that the variances of first four principal state components have been greatly reduced whereas the variances of other components have been reduced with a slighter extent. In graph (c), the variances of the principal components have been reduced to very small scopes indicating advanced localization after coming through two rounds of state evolution.
Walking motion: straight walk and turning walk. To extract the motion subspace of walking, a data set consisting of motion capture data of a single subject was used. The total frame number is 316. It was found that the dierent subject and dierent frame numbers can produce generally identical subspace. To keep the ratio of information loss lower than 0.05, the dimensionality of the subspace was choose to be 5. For the sequence of one subject walking in a straight line, the parameters of HAGA are set as S t 2 Nt 2 Et 5. The results are showed in Figure.4. It can be seen that the estimator is successful in determining the correct global motion as well as the 3D pose of the subject. The occlusion problem are tackled by searching the optimal pose in the extracted subspace because the prior knowledge about walking motion is contained in
Fig. 4. Results of recovering the poses of a subject walking straight. (the images are part of a sequence from www.nada.kth.se hedvigdata.html). The second pose demonstrated the left-right confusion in the silhouette.
Generative Estimation of 3D Human Pose
427
Fig. 5. Results of recovering the poses of a subject performing a turning walking motion
this space. The left-right confusion is mostly disambiguated, however, in few frames, the left-right confusion conduced by silhouette ambiguity still exist. This can be seen from Figure. 4. We test the generalization capability of our method in a turning walk sequence. In this sequence [19], a subject is performing continuing turning walking motion around a circle therefore the global motion is changed in a wide range. The parameters of HAGA are set as S t 2 Nt 2 Et 5. The results can be seen in Figure.5. Running motion. The subspace of running motion is extracted from motion capture data that consisted of 130 frames. This subspace is more compact than that of walking motion. Figure.6 shows the estimation results of 3D poses.
Fig. 6. Results of recovering the poses of a subject performing a running motion. The images are extracted from the video taken from the web site http:mocap.cs.cmu.edu.
428
X. Zhao and Y. Liu
5 Conclusion We have discussed a novel generative approach to estimating 3D human pose from a single camera. Our approach is a step towards describing motion characteristic of highdimensional data spaces by extracting its subspace. From motion capture data, we not only distilled the prior knowledge about human motion, but also reduced the dimensionality of problem. In the compact subspace, we perform eective search for finding the optimal poses. To explore the solution space eÆciently, we designed the AGA and HAGA, by which the optimal solutions can be searched eectively by utilizing the characteristics of state subspace. The robust shape contexts descriptor allows us using the silhouettes as image features. The approach was tested on dierent human motion sequences with good results, and allows the estimation of complex unseen motions in the presence of image ambiguities. In terms of future work, the more interior edge information need be added to disambiguate some challenging sequences. Including a wider range of motion capture data would allow the estimator to cover more types of human motions.
Acknowledgements This research is supported by the National Basic Research Program (973 Program) of China (No. 2006CB303103) and the National Natural Science Foundation of China (No. 60675017).
References 1. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: Proc. Conf. Computer Vision and Pattern Recognition, pp. 217–323 (2005) 2. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: Proceedings of the 2000 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 126–133 (2000) 3. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3D human figures using 2D image motion. In: European Conference on Computer Vision, vol. 2, pp. 702–718 (2000) 4. Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 447–454 (2001) 5. Agarwal, A., Triggs, B.: Tracking articulated motion using a mixture of autoregressive models. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 54–65. Springer, Heidelberg (2004) 6. Ning, H., Tan, T., Wang, L., Hu, W.: People tracking based on motion model and motion constraints withautomatic initialization. Pattern Recognition 37(7), 1423–1440 (2004) 7. Mori, G., Malik, J.: Recovering 3 D Human Body Configurations Using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1052–1062 (2006) 8. Sidenbladh, H., Black, M., Sigal, L.: Implicit Probabilistic Models of Human Motion for Synthesis and Tracking. In: European Conference on Computer Vision, vol. 1, pp. 784–800 (2002)
Generative Estimation of 3D Human Pose
429
9. Urtasun, R., Fleet, D., Fua, P.: Monocular 3-D Tracking of the Golf Swing. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE Computer Society Press, Los Alamitos (2005) 10. Michalewicz, Z.: Genetic algorithms data structures evolution programs. Springer, Heidelberg (1996) 11. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 12. Aggarwal, J., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding 73(3), 428–440 (1999) 13. Gavrila, D.: Visual analysis of human movement: A survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 14. Moeslund, T., Granum, E.: A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81(3), 231–268 (2001) 15. Urtasun, R., Fua, P.: 3D Human Body Tracking using Deterministic Temporal Motion Models. In: European Conference on Computer Vision, vol. 3, pp. 92–106 (2004) 16. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T., Sci, D., Organ, T., Adelaide, S.: A tutorial on particle filters for online nonlinearnon-GaussianBayesian tracking. IEEE Transactions on Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing 50(2), 174–188 (2002) 17. Deutscher, J., Davison, A., Reid, I.: Automatic partitioning of high dimensional search spaces associated with articulated body motion capture. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2001) 18. Jin, X., Tai, C.: Convolution surfaces for arcs and quadratic curves with a varying kernel. The Visual Computer 18(8), 530–546 (2002) 19. Sigal, L., Black, M.J.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University (2006)
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body Eng Hui Loke and Masanobu Yamamoto Graduate School of Science & Technology, and Department of Information Engineering, Niigata University, Ikarashi 2-nocho 8050, Nishi-ku, Niigata-city 950-2181, Japan
[email protected]
Abstract. This paper explores a novel endeavor of deploying only four active-tracking cameras and fundamental vision-based technologies for 3D motion capture of a full human body figure, which includes facial expression, motion of fingers of both hands and a whole body. The proposed methods suggest alternatives to extract motion parameters of the mentioned body parts from four single-view image sequences. The proposed ellipsoidal model- and flow-based facial expression motion capture solution tackles both 3D head pose and non-rigid facial motion effectively and we observe that a set of 22 self-defined feature points suffice the expression representation. The body figure and fingers motion capture is solved with a combination of articulated model and flow-based methods.
1
Introduction
The human-based character animation technology has been growing in an incredible speed. Indirectly, this induces the growth in and demands towards motion capture. An abundance of work has been proposed in suggesting better and robust solutions for human figure [8], fingers or facial [10] motion capture. Despite these efforts, never has anyone in vision history proposed a solution on full human motion processing, that includes the motion of the body figure, facial expression and fingers simultaneously. In this paper, we propose a novel motion capture idea to implement an imagebased full human motion capture system where the simultaneous motion data estimated with our framework can easily be reconstructed in realizing the 3D character animation. This paper presents the idea: by simply adopting four autotracking function installed cameras and employing the knowledge in computer vision techniques, the 3D motion of a human figure, fingers and facial expression can be estimated from single-view image sequences recorded against a cluttered background. This first attempt of using only four cameras in image acquisition of the motion of the whole body eliminates the costing and performance area issues. Besides, in one system, we tackle the facial features recognition problem, 3D rigid head and non-rigid expression motion tracking based on the mixture vision techniques while treating human figure and fingers as articulated models problems for the 3D motion estimation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 430–441, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
431
Taking account of the fact that the characteristics of the facial expression and the other body parts are dissimilar, we treat them differently. To be precise, the facial expression is caused by muscles’ motion and thus, is changeable and induces the variant outlook and size of the features, that it is posed as a non-rigid-motion problem [1]. Meanwhile, body figure and fingers motion are caused by the bones and joints with respective limited D.O.F. [9]. Therefore, their size is invariant and appropriate to be represented as articulated model with rigid-motion. In brief, our framework consists of three main modules, image acquisition (section 2), vision-based motion estimation (sections 3 and 4), full model reconstruction and animation (sections 5 and 6).
2
Image Acquisition
Most of the image acquisition in the past works is done using VGA camera, with which it is either dealing solely with the movement of one small body part or tackling with the whole human figure motion without involving smaller body parts. However in our context, a VGA image, sized 640x480 pixels, has its limitation on the resolution and affects the image analysis quality on the smaller body parts such as face and fingers. To solve this, a high-vision camera with its high resolution of 1920x1080 pixels suits our requirement. Despite its high resolution, one non-compressive highvision camera with a full set of image capture devices can easily reach a cost that enfeebles its feasibility in most cases. We adopt an active multi-camera system consisting of four VGA color cameras (SONY, EVI-D30) which are connected parallel to the computer, where each automatically tracks the four different body parts: the face, the whole body, and the left and right hands. These cameras are able to auto-pan, tilt and zoom in tracking the movement of the target by color- and brightness-matching the current view with the user predefined area [11]. Once the starter is intrigued, the computer with multi-image capture board (Micro Vision, MV-34) starts fetching image data from these four cameras in parallel. This avoids the starting time delay that is common in a manual starting set-up and achieves a real definition of simultaneous motion tracking. The sequences of images are captured at the frame rate 30 fps, sized 320x240 pixels and in a limited designated range. Each camera is targeted on tracking the movement of different body parts, where the resulted images are shown in Figure 1. The hand and face images keep enough high resolution to capture the targets in motion while each camera can keep track of the target within the field of view.
3
Facial Expression Solutions
This section introduces a mixture of rigid and non-rigid solution framework, with which both the 3D head pose and facial features motion can be solved concurrently. We define a set of 22 feature points to represent the facial features as shown in Figure 2: one per nostril, three per eyebrow, four per eye and six
432
E.H. Loke and M. Yamamoto
Fig. 1. The 4 image sequences recorded using the active multi-camera system
Fig. 2. Facial feature points location. The numbering is the detection sequence.
for mouth. These feature points are assumed adhering to the surface of the ellipsoidal head model yet independently tracked on the image. The initial facial pose is required to be at its front view with no occlusion onto the facial features allowed. The following sub-section details the facial features localization, followed by the facial expression motion capture adopted for both rigid and non-rigid motion recovery. 3.1
Skin Color Region Detection and Facial Region Recognition
Chromatic color space is the traditional approach for skin color space analysis since it carries such characteristic where its two useful color components are in fact the result of the color elements with intensity normalization. However, this paper considers also the HSV space of the skin color based on two facts: (i) the evaluation done in [12] proves that a Gaussian mixture is more accurate than a single Gaussian model, and (ii) HSV color space performs the same clustering characteristic as chromatic’s with a broader range. A Gaussian model can approximate these clustering characteristics. Applying this Gaussian model onto the chromatic and HSV color space of the initial facial image, two grey level images are generated with the intensity of each pixel representing its probability of the skin color existence. We average the intensity values of these two images and apply an empirical threshold on them to get the binary skin-likelihood segments. Spurious pixels induced by problematic noise or color defects in the image are cancelled off with morphological closing operator and followed by median filter. The biggest skin-color blob is selected as the facial candidate-region while the other regions are discarded. With the assumption that the front-face-view is assured on the initial image, we use a grey scale eye-nose template as a detector identifying the exact location
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
433
of the facial features, eyes and nose. This is based on the conclusion deduced by Brunelli et al [2] that eyes perform the best matching result followed by the nose. A full frontal face template would be too time-consuming and the matching with different mouth-opening images is error-prone, that we settle down on this eye-nose template and it is rescaled only once to the updated width of the facial-feature area image. Template matching is carried out on the grey-level image of the updated facial features area. The Normalized Correlation Coefficients (NCC) of different y location are computed vertically. Normalization on the correlation coefficients subdues the illumination- and brightness-induced ambiguous matching results. The highest correlation value of the facial upper region reports the presence of the eyes and nose. 3.2
Extraction of Facial Feature Points
There are 22 feature points in a set to be localized after the position of the eyes and nose are confirmed. Various approaches are proposed in different works, to name a few, edge detection, integral projection or snake localization. The facial feature area recognized in the previous section is segmented into different blocks on the color image according to the scaled size of the eyes and nose on the template. The approaches for each feature point’s localization are detailed as followed. – Calculate the color histogram within the eye feature block and extract the highest value as the threshold, where pixels with lower luminance are extracted – Convert the extracted pixels block into a binary image – Get the Vertical/Horizontal Integral Projection values of the binary image to locate the hair region – Identify the upper rows and side columns at which the projection value is above-threshold. These regions are recognized for the existence of hair and thus are eliminated – The binary region is thus segmented as the eye region The two corners of an eye are located at the left and right bounds of this region. The Harris corner detector is further carried out within a fixed area centered at these two bounds to find good corner values. Following that, the y values of both upper and lower bounds of the eye region are taken as the opening values of the eye, albeit the x values of them are not determined yet. The initial try was done by calculating the center of the two corner points but it doesn’t end up a stable nor good solution. The final adopted approach is that by calculating the Horizontal Integral Projection of the eye region on the binary image. The x value with the highest integral projection value is adopted as the x coordinate of the opening points as plotted in Figure 3. It is proven in the detection result that this method works well as it indirectly identifies the gazing view of the person. In addition, this approach works stably even when the captured eye is closed because dark line forms at the edge of the eyelid.
434
E.H. Loke and M. Yamamoto
Fig. 4. Lips opening feature points detection
Fig. 3. Eyes feature points dDetection
Fig. 5. Eyebrow mid point detection
Detecting feature points on Eyes: In [4], eyes (pupils) are located using template matching but in [3,5], the eyes feature points are detected using integral projection or in others’, manually. However, before the two corners and two opening points of each left and right eye are detected, we have to eliminate the hair-likelihood region with the following steps: Detecting feature points on Nose: The approach taken in detecting nostrils is straightforward as it carries a special characteristic - nostrils are the darkest regions in the mid area of a face (between the vertical distance of the eyes and mouth). This system searches one dark region in each half of the nose feature block. The center point of each region is taken as the nostril feature point. Detecting feature points on Mouth: In this system, we define six feature points on the mouth where two are the corners and the rest represents the upper and lower lips. To calculate each of their location, we have to estimate the searching area on the image. This is done by fixing the searching boundary based
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
435
on the known eyes and nostrils coordinates, which is shown as the black segment in Figure 4. The luminance of the lips is the most easily recognized characteristic in the case of mouth detection. Thus, this informative color can be considered as the best choice for such a task on a color image instead of the edge extraction method adopted in [4], which is susceptible to noise that can lead to fallacious detection result. Our method applies HSV or RGB color space in the mentioned bounding box to extract the segment where lips-likelihood pixels exist. Easily, the two feature points at each upper and lower lip can be derived at the mid x value between the nostrils. The y values of each point is extracted from the lips-likelihood segment as illustrated in Figure 4. Similar to the searching method in detecting corners of the eyes, the two corner points of the lip are determined using Harris corner detector one at each half of the mouth feature block and they must satisfy certain geometric constraints. Detecting feature points on Eyebrows: Eyebrow is one of the difficult objects to be detected and tracked on the face for its uneven shape and thickness traits, besides the problem of being blocked or misrecognized as forelock. Not many facial features detection research has a clear solution for it. We are representing one eyebrow with three points where one at the middle and two at both of its ends. Figures 5 and 6 illustrate the detection mechanism explained below. First, the middle position of the eyebrow has to be determined. The searching boundary is estimated in between of the beginning y value of the skin-likelihood region, the ystart and the beginning of the eyes’ bounding box, the yend . This bounding area is gotten rid of noise with a Gaussian convolution method. Starting from the yend to ystart , a searching-for-edge function is run at the mid point of both corners of the eye. The first point encountered with a significant intensity changes (from bright to dark) would be taken as the lower edge of the eyebrow. Once this point is detected, we search for the upper edge of the eyebrow that would be the first point encountered with the characteristic of significant intensity changes (from dark to bright). Averaging the y value of these two points, we mark it as the mid feature point for the eyebrow. Both ends of an eyebrow are detected by carrying out the detection function starting from the known mid eyebrow point towards both ends horizontally. This
Fig. 6. Eyebrow corner points localization
436
E.H. Loke and M. Yamamoto
bounding area is converted to an edge-image. At every incremental/decremental x location, the upper and lower edges (y value) of the eyebrow have to be determined. The search continues by taking the previous average y value as the starting y location at next x value until it meets either of the two conditions: (i) the two upper and lower edges cling/are close to each other under certain threshold, or (ii) the edges seem too apart and the distance is larger than the thickness of the center point of the eyebrow. The head is represented as a 3D ellipsoidal model which is fitted onto the facial region in the image. The depth values of each facial feature points are calculated according to their x and y positions to be placed at the surface of the ellipsoidal head. 3.3
Facial Features Motion Capture
Facial expression, also the facial features motion, poses as a non-rigid motion issue only very recently as most of the earlier works either treat it as twodimensional affine solution or leave it untouched. Among the proposed solutions are inclusive of but not limited to, the application of either complicated or simple physically based models (polygonal or mesh) of the head, local parametric flow model and optical flow calculation with model-based solution. Aiming for a robust but less computation intensive approach, this system took the middle path between the simple template-based affine approximation approach and the intricate physical or muscle structure model-based tracking method. Adopting the similar idea proposed in [1], we tackle the rigid head motion and non-rigid facial expression tracking separately. Rigid Head Motion Estimation: The 2D image flow field in the head region between two successive frames (exclusive of the sparse motion field in facial features region of the eyebrows, eyes and mouth) can be calculated by the constant brightness constraint over time coupling with the Lucas-Kanade technique[6]. This 2D flow is then interpreted as the 3D rigid motion of the head using the depth constraint by referring to the current pose of the ellipsoidal head model. The 3D rigid motion updates current pose of the head at every frame. Non-Rigid Facial Features Motion Estimation: This module inherits the 3D motion parameter from the rigid tracking module to expect the facial feature points in the next frame. The facial feature points are also tracked individually using the NCC method. A displacement between the expected position and tracked one denotes the relative motion with respect to the head as reference coordinates. If the displacement is large, expression deformation occurs.
4
Articulated Model Solutions
The use of articulated model for human body and hand figures representation is common in model-based tracking system [8]. Fitting the model to image sequence can produce a pose sequence, i.e. motion, of the body and hand. In this paper, we
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
437
Head Neck Index F. Middle F. Ring F. Seg. 3 Seg. 3 Seg. 3
Little F. Seg. 3
Index F. Middle F. Ring F. Seg. 2 Seg. 2 Seg. 2
Little F. Seg. 2
Thumb. Index F. Middle F. Ring F. Seg. 2 Seg. 1 Seg. 1 Seg. 1
Little F. Seg. 1
Right Left Upper Upper Arm Chest Arm Right Forearm
Waist
Right Hand Right Thigh
Left Forearm
Left Left Hand Thigh
Right Shin
Left Shin
Right Foot
Left Foot
Thumb Seg. 1
Palm (Left)
Palm (Right)
Fig. 7. Left: Body ireframe model and the tree structure of the body parts, Right: Fingers wireframe model and its tree structure
inherit a motion capture method proposed in [9]. We briefly explain the method, but the details can be further referred in [9]. Our body model consists of 16 parts arranged in a tree structure with the waist model part as the root of the model, as shown in Figure 7 left. The similar concept in body figure modeling is extended to the fingers modeling. We define it with the similar structure with 16 parts as illustrated in Figure 7 right. The root of this model is placed at the palm. The tree structure in the model denotes connection from parent to child by the arrow direction. Each part has its own coordinates system of which one axis aligns along the body axis and the origin is located at the joint with its parent part (a center of gravity for waist and chest). We manually adjust the pose of the articulated model to fit it onto the initial frame image of each body and fingers images. The automatic model fitting makes further issue. It will be possible since we can utilize existing methods, e.g. [7], for detection of the human body in image and determining the 3D pose. A pose displacement of human body can be estimated from image subtraction between the successive frames. Therefore, after getting a pose at the initial frame by model fitting, pose at any frame can be obtained from accumulating the successive pose displacements onto the initial pose. In fact, the estimation of pose displacement is based on that (1) optical flow is constrained by temporal-spatial linear equation, (2) a 3D translation vector on the model is approximated as a sum of pose displacements weighted by the Jacobian matrix, and (3) a depth of human body can be obtained from the model fitted to the body. Chain substitutions based on the above (1), (2) and (3) can produce a system of linear equations with respect to the pose displacements as unknowns. Solving the system of linear equations and accumulating the obtained pose displacements onto the initial pose can result in the human motion. However, this approach causes a pose drift as an inherent drawback. To cope with this drift, additional
438
E.H. Loke and M. Yamamoto
model fitting is manually required at several key-frames, and the accumulated pose at in-between frame is corrected by propagation of the poses given at keyframes [9].
5
Animation of the Simultaneous Motion
A general humanoid polygonal model is deformed automatically according to the initial pose acquired from the earlier initialization module. Since this animation step highlights the reconstruction of a humanoid model coupling with the restoration and animation of the captured simultaneous motion from an actor, no human motion kinematics is applied. The fingers and facial features models are installed on the body figure model for the reconstruction of a full humanoid model. Nevertheless, the initial estimated size of each body parts calculated from each image sequence are different. Therefore, several steps are taken for the integration: (i) The system recognizes the parental/replacement part for each model : The head is the parental part for facial features as is the palm for the fingers model. The system locates the center of the ellipsoidal head on the body figure model, makes it the origin of the coordinate system for the facial feature points. As for the both hands, fingers models can just be considered the replacements for hand parts and be installed at the origin of their respective local coordinate systems, ΣLef tHand and ΣRightHand , on the body figure model. (ii) Model size scaling : Assuming the three axes of the initial ellipsoidal head size of the body figure and facial image sequences are Hbody = (x, y, z) and Hf ace = (x, y, z) respectively, we rescale each features position with Hbody /Hf ace at all the x, y and z axes. The similar calculation works on fingers model as well. 5.1
Reconstruction of the Full Human Model
By utilizing the camera-referenced motion data of all the targetted subjects acquired in the previous motion capture section, this system automatically constructs a full human model and realises the simultaneous motions of each body part on it. However, each facial features, fingers and body figure model consists of each model structure and poses in different camera coordinate systems. Thus, in constructing these different body parts into one complete human model structure, they need to be referenced in the same coordinate system, namely the world coordinate system. In this paper, the human body camera coordinate system is taken as the world coordinate system. The hand model shares a palm with the body model. The face model shares a head with the body model. Therefore, if the parent of the fingers changes from the palm of the hand model to the palm of the body model, and the parent of the facial features changes from the head of the face model to the head of the body model, a full body model can be established without further camera calibration. We draw the transformation from a local camera coordinates to the world coordinates by an example of the little fingertip. In order to assemble the hand models on the body model, we assume j Ti as the transformation from coordinates
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
439
Fig. 8. The tracking result of each face, body figure, right and left hand image sequence, at frame 56, 101 and 197
Σi of part i to Σj of part j. Let denote parts 1, 2, · · ·, 8 be waist, chest, upper arm, forearm, hand (alias palm), finger seg.1, finger seg.2, and finger seg.3 (alias fingertip), respectively, and Σ0 denotes the world coordinate. The body camera can capture the motion up to the palm. The transformation from the palm to the world coordinate system is given by 0
T1 1 T2 2 T3 3 T4 4 T5 .
(1)
Meanwhile, the hand camera captures the hand motion from the palm to fingertip. The transformation from the fingertip to the hand camera coordinate system is given by h T5 5 T6 6 T7 7 T8 (2) where h denotes the hand camera coordinate system. To represent a pose of the fingertip in the world coordinate, adding 5 T6 6 T7 7 T8 in eq. (2) to eq. (1), we have the transformation 0
T1 1 T2 2 T3 3 T4 4 T5 5 T6 6 T7 7 T8 .
Similarly, the face model is embedded into the full human model.
(3)
440
6
E.H. Loke and M. Yamamoto
Experimental Results
To prepare the motion image sequences for the system experiment, we carry out the image acquisition in the laboratory without constraint on the background scene and costumes of the actor. The actor is required to start with a front view position but free to move along the recording process of 200 frames with the conditions that facial features and fingers visibility is guaranteed and the motion has to be smooth. Two sets of motion were recorded with this multicamera system as sized 320x240 pixels bitmap image sequences,in which the actor (i) did simple movements on the hands and face, and (ii) performed airguitar while singing. The tracking result at different time-frame of the four image sequences on face features, whole body figure and both hands can be refered in Figure 8 while Figure 9 illustrates the reconstructed model’s appearance and the motion recovered at the same time-frame. The average processing time consumes about 30 seconds per frame with which the 200 frames motion capture and animation process takes less than 15 minutes to complete.
Fig. 9. Animation reconstructed on the humanoid model at frame 56, 101 and 197
An Active Multi-camera Motion Capture for Face, Fingers and Whole Body
7
441
Conclusion
In this paper, we have suggested a novel idea for full human motion capture, inclusive of the motion on the full human figure, fingers of both hands and facial features simultaneously, for which only 4 cameras are needed for the recording purpose. Our major contributions lie in several aspects. The main relates to the novel endeavor of using only 4 cameras to capture the full motion of an actor. We also successfully built a foundation platform for the concurrent motion estimation of the body figure, fingers and facial expression solely from the digital information on single-view image sequences and the motion reconstruction on a full humanoid model. We also proposed several alternatives for facial features detection and 3D rigid head pose and non-rigid features motion estimation. Through the animation result, we have demonstrated how our approaches provide a concise description of the human motion on different body parts and are feasible for humanoid model animation construction. Acknowledgments. We thank Hideaki Sasagawa for the hands tracking works.
References 1. Black, M.J., Yacoob, Y.: Recognizing Facial Expressions In Image Sequences Using Local Parameterized Models Of Image Motion. IJCV 25(1), 23–48 (1997) 2. Brunelli, R., Poggio, T.: Face Recognition: Features versus Templates. IEEE PAMI 15(10), 1042–1052 (1993) 3. Chuang, M.M., Chang, R.F., Huang, Y.L.: Automatic Facial Feature Extraction In Model-Based Coding. Journal of Information Science And Engineering 16, 447–458 (2000) 4. Feris, R.S., De Campos, T.E., Junior, R.M.C.: Detection and Tracking Of Facial Features In Video Sequences. In: Cair´ o, O., Cant´ u, F.J. (eds.) MICAI 2000. LNCS, vol. 1793, pp. 197–206. Springer, Heidelberg (2000) 5. Gu, H., Su, G.: Feature Points Extraction From Faces. Image and Vision Computing NZ (2003) 6. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. Imaging Understanding Workshop, pp. 121–130 (1981) 7. Mori, G., Malik, J.: Recovering 3D human body configurations using shape contexts. IEEE PAMI 7(28), 1052–1062 (2006) 8. Wang, J.J., Singh, S.: Video analysis of human dynamics - a survey. Real-Time Imaging 9(5), 321–346 (2003) 9. Yamamoto, M., Ohta, Y., Yamagiwa, T., Yagishita, K., Yamanaka, H., Ohkubo, H.: Human Action Tracking Guided by Key-Frames. FG2000, 354–361 (2000) 10. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computiong Surveys 35(4), 299–458 (2003) 11. http://www.Sony.Co.Jp/Products/ISP/Products/Model/Ptz/EVID30.Html 12. Evaluation of RGB and HSV models in Human Faces Detection, [Online] (2004) Available, http://www.cescg.org/CESCG-2004,/web/Sedlacek-Marian/
Tracking and Classifying of Human Motions with Gaussian Process Annealed Particle Filter Leonid Raskin, Michael Rudzsky, and Ehud Rivlin Computer Science Department, Technion—Israel Institute of Technology, Haifa, Israel, 32000 {raskinl,rudzsky,ehudr}@cs.technion.ac.il
Abstract. This paper presents a framework for 3D articulated human body tracking and action classification. The method is based on nonlinear dimensionality reduction of high dimensional data space to low dimensional latent space. Motion of human body is described by concatenation of low dimensional manifolds which characterize different motion types. We introduce a body pose tracker, which uses the learned mapping function from low dimensional latent space to high dimensional body pose space. The trajectories in the latent space provide low dimensional representations of body poses performed during motion. They are used to classify human actions. The approach was checked on HumanEva dataset as well as on our own one. The results and the comparison to other methods are presented.
1
Introduction
Human body pose estimation and tracking is a challenging task for several reasons. The main problem that has to be solved in order to achieve satisfactory results of pose tracking and understanding is large dimensionality of the human pose model, which complicates the examination of the entire subject and makes it harder to detect each body part separately. Despite the high dimensionality of the problem, many poses can be presented in a low dimensional space by dimensionality reduction. Human body motions can be displayed as curves in this space. This space can be obtained by learning different motion types [1,2]. This paper presents an approach to 3D people tracking and motion analysis. In this approach we apply a nonlinear dimensionality reduction using Gaussian Process Dynamical Model (GPDM) [3,4] and annealed particle filter [5]. GPDM is better able to capture properties of high dimensional motion data than linear methods such as PCA. This method generates a mapping function from the low dimensional latent space to the full data space based on learning from previously observed poses of different motion types. For the tracking we separate model state into two independent parts: one contains information about 3D location and orientation of the body and the second one describes the pose. We learn latent space that describes poses only. The tracking algorithm consists of two stages. Firstly the particles are generated in the latent space and are transformed into the data space by using learned a priori mapping function. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 442–451, 2007. c Springer-Verlag Berlin Heidelberg 2007
Tracking and Classifying of Human Motions
443
Secondly we add rotation and translation parameters to obtain valid poses. The likelihood function is calculated in order to evaluate how well a pose matches the visual data. The resulting tracker estimates the locations in the latent space that represents poses with the highest likelihood. As the latent space is learned from sequences of poses from different motion types, each action is represented by a curve in the latent space. The classification of the motion action is based on the comparison of the sequences of latent coordinates that the tracker has produced, to the ones that represent poses sequences of different motion types. We use a modified Fr`echet distance [6] in order to compare the pose sequences. This approach allows introducing different actions from the ones we have used for learning by exploiting the curve that represents it. We show that our tracking algorithm provides good results even for the low frame rate. An additional advantage of our tracking algorithm is the capability to recover after temporal loss of the target. We also show that the task of action classification, when performed in the latent space, is robust.
2
Related Works
One of the common approaches for tracking is the particle filter method. This method uses multiple predictions, obtained by drawing samples of pose and location prior and then propagating them using the dynamic model, which are refined by comparing them with the local image data, calculating the likelihood [7]. The prior is typically quite diffused (because motion can be fast) but the likelihood function may be very peaky, containing multiple local maxima which are hard to account for in detail [8]. Annealed particle filter [5,19] or local searches are the ways to attack this difficulty. An alternative is to apply a strong model of dynamics [9]. There exist several possible strategies for reducing the dimensionality of the configuration space. Firstly it is possible to restrict the range of movement of the subject [10]. Due to the restricting assumptions the resulting trackers are not capable of tracking general human poses. Another way to cope with highdimensional data space is to learn low-dimensional latent variable models [11]. However, methods like Isomap [12] and locally linear embedding (LLE) [13] do not provide a mapping between the latent space and the data space, and, therefore Urtasun et al. [14] proposed to use a form of probabilistic dimensionality reduction by GPDM [3,4] to formulate the tracking as a nonlinear least-squares optimization problem. During the last decade many different methods for behavior recognition and classification of human actions have been proposed. The popular methods are based on Hidden Markov Models (HMM), Finite State Automata (FSA), context-free grammar (SCFG) etc. Sato et al. [15] presented a method to use extraction of human trajectory patterns that identify the interactions. Park et al. [16] proposed a method using a nearest neighbor classifier for recognition of twoperson interactions such as hand-shaking, pointing, and standing hand-in-hand. Hongeng et al. [17] proposed probabilistic finite state automata for recognition
444
L. Raskin, M. Rudzsky, and E. Rivlin
of a sequential occurrence of several scenarios. Park et al. [18] presented a recognition method that combines model-based tracking and deterministic finite state automata. This paper is organized as follows. Section 3 describes the tracking algorithm. Section 4 describes the classification algorithm. Section 5 shows the experimental results for tracking and action classification of different data sets and motion types.
3 3.1
Tracking GPAPF Tracker
The drawback in the annealed particle filter [5] tracker is that a high dimensionality of the state space causes an increase in the number of particles that are needed to be generated in order to preserve the same density of particles. This makes this algorithm computationally ineffective for low frame rate videos (30 fps and lower). The other problem is that once a target is lost (i.e. the body pose was wrongly estimated, which can happen for the fast and not smooth movements) it becomes highly unlikely that the next pose will be estimated correctly in the following frames. In order to reduce the dimension of the space we propose to use Gaussian Process Annealed Particle Filter (GPAPF). Using Gaussian Process Dynamical Model (GPDM) [3,4] we embedded several types of poses into a low dimensional space. We used two and three dimensional spaces which was enough for robust tracking and classification. The poses are taken from different sequences, such as walking, running, punching and kicking. We divide our state into two independent parts. The first part contains the global 3D body rotation and translation, which is independent of the actual pose. The second part contains only information regarding the pose (26 DoF). We use GPDM to reduce the dimensionality of the second part. This way we construct a latent space (Fig. 1). This space has a significantly lower dimensionality (for example 2 or 3 DoF). The latent space includes solely pose information and is therefore rotation and translation invariant. For the tracking task we use a modified annealed particle filter tracker [5]. We are using a 2-stage algorithm. The first stage is generation of new particles in the latent space, which is the main modification of the tracking algorithm. Then we apply the learned mapping function that transforms latent coordinates to the data space. As a result, after adding the translation and rotation information, we construct 31 dimensional vectors that describe a valid data state, which includes location and pose information, in the data space. In order to estimate how well the pose matches the images the likelihood function is calculated [19]. Suppose we have M annealing layers. The state is defined as a pair Γ = {Λ, Ω}, where Λ is the location information and Ω is the pose information. We also define ω as a latent coordinates corresponding to the data vector Ω: Ω = ℘ (ω), where ℘ is the mapping function learned by the GPDM. Λn,m , Ωn,m and ωn,m are the location, pose vector and corresponding latent coordinates on the frame n and annealing layer m. For each 1 ≤ m ≤ M − 1 Λn,m and ωn,m
Tracking and Classifying of Human Motions
445
Fig. 1. The latent space that is learned from different motion types. (a) 2D latent space from 3 different motions: lifting an object (red), kicking with the left (green) and the right (magenta) legs. (b) 3D latent space from 3 different motions: hand waving (red), lifting an object (magenta), kicking (blue), sitting down (black), and punching (green).
are generated by adding multi-dimensional Gaussian random variable to Λn,m+1 and ωn,m+1 respectively. Then Ωn,m is calculated using ωn,m . Full body state Γn,m = {Λn,m , Ωn,m } is projected to the cameras and the likelihood πn,m is calculated using likelihood function. The main difficulty is that the latent space is not uniformly distributed and sequential poses may be not close on the latest space to each other. Therefore we use a dynamic model, as proposed by Wang et al. [4], in order to achieve smoothed transitions between sequential poses in the latent space. However, there are still some irregularities and discontinuities. Moreover, in the latent space each pose has a certain probability to occur and thus the probability to be drawn as a hypothesis should be dependent on it. For each location in the latent space the variance can be estimated that can be used for generation of the new particles. In Fig. 1(a) the lighter pixels represent lower variance, which depicts the regions of latent space that corresponds to more likely poses. The additional modification that has been done is in the way the optimal configuration is calculated. In the original annealed particle filter algorithm the optimal configuration is achieved by averaging over the particles in the last layer. However, as the latent space is not an Euclidian one, applying this method on ω will produce poor results. We propose to calculate the optimal configuration in the data space and then project it back to the latent space. At the first stage we apply the ℘ on all the particles to generate vectors in the data space. Then in the data space we calculate the average on these vectorsand project it back to N (i) (i) . π ℘ ω the latent space. It can be written as follows: ωn = ℘−1 n,0 i=1 n,0 The resulting tracker is capable of recovering after several frames of poor estimations. The reason for this is that particles generated in the latent space are representing valid poses more authentically. Furthermore because of its low dimensionality the latent space can be covered with a relatively small number of particles. Therefore, most of possible poses will be tested with emphasis on the pose that is close to the one that was retrieved in the previous frame. So if the
446
L. Raskin, M. Rudzsky, and E. Rivlin
Fig. 2. Losing and finding the tracked target despite the miss-tracking on the previous frame
pose was estimated correctly the tracker will be able to choose the most suitable one from the tested poses. At the same time, if the pose on the previous frame was miscalculated the tracker will still consider the poses that are quite different. As these poses are expected to get higher value of the weighting function the next layers of the annealed will generate many particles using these different poses. In this way the pose is likely to be estimated correctly, despite the misstracking on the previous frame as shown in Fig. 2. Another advantage of our approach is that the generated poses are, in most cases, natural. In case of CONDENSATION or annealed particle filter, the large variance in the data space, can cause generation of unnatural poses. Poses that are produced by the latent space that correspond to points with low variance are usually natural and therefore the number of the particles effectively used is higher, which enables more accurate tracking. 3.2
Obtaining Better Tracker
The problem with such a 2-staged approach is that Gaussian field is not capable to describe all possible poses. As we have mentioned above, this approach resembles using probabilistic PCA in order to reduce the data dimensionality. However, for tracking issues we are interesting to get pose estimation as close as possible to the actual one. Therefore, we add an additional annealing layer as the last step. This layer consists only from one stage. We are using data states, which were generated on the previous two staged annealing layer, in order to generate data states for the next layer. This is done with very low variances in all the dimensions, which practically are even for all actions, as the purpose of this layer is to make only the slight changes in the final estimated pose. Thus it does not depend on the actual frame rate, contrarily to original annealing particle tracker, where if the frame rate is changed one need to update the model parameters (the variances for each layer).
4
Action Classification
The classification of the actions is done based on the sequences of the poses that were detected by the tracker during the performed motion. We use Fr`echet distance [6] in order to determine the class of the motion, i.e. walking, kicking,
Tracking and Classifying of Human Motions
447
waving etc. The Fr`echet distance between two curves measures the resemblance of the curves taking into consideration their direction. This method is quite tolerant to position errors. Suppose there are K different motion types. Each type k is represented by a model Mk which is a sequence of the lk + 1 latent coordinates Mk = {μ0 , ..., μlk }. The GPAPF tracker generates a sequence of l + 1 latent coordinates: Γ = {ϕ0 , ..., ϕl }. We define a polygonal curve P E as a continuous and piecewise linear curve made of segments connecting vertexes E = {v0 , ..., vn }. The curve can be parameterized with a parameter α ∈ [0, n], where P E (α) refers to a given position on the curve, with P E (0) deE notes Mv0 and between two curves is defined as P (n) denotes vn . The distance Γ i F P , P l] , = minα,β f P Mi (α), P Γ (β) : α [0, 1] → [0, lk ] , β [0, 1] → [0, where f P Mi (α) , P Γ (β) = max P Mk (α (t)) − P Γ (β (t)) 2 : t ∈ [0, 1] and α (t) and β (t) represent sets of continuous and increasing functions with α (0) = 0, α (1) = lk , β (0) = 0, β (1) = l. The model with the smallest distance is chosen to represent the type of the action. While in general it is hard to calculate the Fr`echet distance, Alt et al. [6] has shown an efficient algorithm to calculate it between two piecewise linear curves.
5
Results
We have tested GPAPF tracking algorithm using HumanEva data set. The data set contains different activities, such as walking, boxing etc. and provides the correct 3D locations of the body joints, such as hips and knees, for evaluation of the results and comparison to other tracking algorithms. We have compared our results to the ones produced by the annealed particle filter body tracker [20] and compared the results with the ones produced by the GPAPF tracker. The error measures the average 3D distance between the locations of the joints that is provided by the MoCap system and by ones that were estimated the tracker [20]. Fig. 3 shows the actual poses that were estimated for this sequence. The poses are projected to the first and second cameras. The first two rows show the results of the GPAPF tracker. The last two rows show the results of the annealed particle filter. Fig. 4.a shows the error graphs, produced by GPAPF tracker (blue circles) and by the annealed particle filter (red crosses) for the walking sequence taken at 30 fps. The graph suggests that the GPAPF tracker produces more accurate estimation. We also compared the performance of the tracker with and without the additional annealed layer. We have used 5 double staged annealing layers in both cases. For the second tracker we have added additional single staged layer. The Fig. 4.b shows the errors of the GPAPF tracker version with the additional layer (blue circles) and without it (red crosses); Fig. 5 shows sample poses, projected on the cameras. The improvement is not dramatic. This is explained by the fact that the difference between the estimated pose using only the latent space annealing and the actual pose is not very big. That suggests that the latent space represents accurately the data space. We have also created a database, which contains videos with similar actions, produced by a different actor. The
448
L. Raskin, M. Rudzsky, and E. Rivlin
Fig. 3. Tracking results of annealed particle filter tracker and GPAPF tracker. Sample frames from the walking sequence. First row: GPAPF tracker, first camera. Second row: GPAPF tracker, second camera. Third row: annealed particle filter tracker, first camera. Forth row: annealed particle filter tracker, second camera.
Fig. 4. (a) The errors of the annealed tracker (red crosses) and GPAPF tracker (blue circles) for a walking sequence captured at 30 fps. (b) The errors GPAPF tracker with additional annealing layer (blue circles) and without it (red crosses) for a walking.
frame rate was 15 fps. We have manually marked some of the sequences in order to produce the needed training sets for GPDM. After the learning we validated the results on the other sequences containing same behavior. We have experimented with different number of particles. For the 100 particles per layer the computational cost was 30 sec per frame. Using the same number of particles and layers in the annealed particle filter algorithm takes 20 seconds per frame. However, the annealed particle filter algorithm was not capable of tracking the body pose with such a low number of particles for 30 fps and 15
Tracking and Classifying of Human Motions
449
Fig. 5. GPAPF algorithm with (a) and without (b) additional annealed layer
Fig. 6. Tracking results of annealed particle filter tracker and GPAPF tracker. Sample frames from the running, leg movements and object lifting sequences.
fps. Therefore, we had to increase the number of particles used in the annealed particle filter to 500. 5.1
Motion Classification
The classification algorithm was tested on two different data sets. The first set contained 3 different activities: (1) lifting an object, kicking with (2) the left and (3) the right leg. For each activity 5 different sequences were captured. We have used one sequence for each motion type in order to construct the models. The latent space was learned based on the poses in these models (Fig. 1.a). The latent space had a clear and very distinguishable separation between these 3 actions. Therefore, although the results of the tracker contained much noise as shown in Fig. 7, the algorithm was able to perform perfect classification. The second set contained 5 different activities: (1) hand waving, (2) lifting an object, (3) kicking, (4) sitting down, and (5) punching. Once again 5 different sequences were captured for each activity. The cross validation procedure was used to classify the sequences (see Fig. 1.b). The accuracies of the classification,
450
L. Raskin, M. Rudzsky, and E. Rivlin
Fig. 7. Tracking trajectories in the latent space for different activities: (a) lifting an object, kicking with (b) the left and (c). On each image the black lines represent incorrect activities, the red line represents the correct one, and other colored lines represent the trajectories produced by the GPAPF tracker.
Table 1. The accuracies of the classification for 5 different activities: hand waving, object lifting, kicking, sitting down, and punching. The rows represent the correct motion type; the columns represent the classification results. Hand waving Object lifting Kicking Sitting down Punching Hand waving Object lifting Kicking Sitting down Punching
15 0 0 0 6
0 17 0 3 0
0 0 20 1 20
0 3 0 16 0
5 0 0 0 14
as shown in Table 1, are 75, 85, 100, 80, 70 percent for the above interactions (1)-(5) respectively. The low classification rates of actions involving the hand gestures are due to the similarity of the native actions. The low classification rates of sitting down and object lifting actions are due to the high self occlusions, which caused the tracker to perform wrong estimations of the actual poses.
6
Conclusion and Future Work
In this paper we have introduced an approach for articulated body tracking and human motion classification using a low dimensional latent space. The latent space is constructed from pose samples from different motion types. The tracker generates trajectories in the latent space, which are classified using Fr`echet distance. The interesting problem that has not been solved yet is to perform classification of the interactions between multiple actors. The main problem is constructing a latent space. While a single persons poses can be described using a low dimensional space it may not be the case for multiple people.
Tracking and Classifying of Human Motions
451
References 1. Christoudias, C.M., Darrell, T.: On modelling nonlinear shape-and-texture appearance manifolds. In: Proc. CVPR, vol. 2, pp. 1067–1074 (2005) 2. Elgammal, A., Lee, C.: Inferring 3d body pose from silhouettes using activity manifold learning. In: Proc. CVPR, vol. 2, pp. 681–688 (2004) 3. Lawrence, N.: Gaussian process latent variable models for visualization of high dimensional data. In: NIPS. Information Processing Systems, vol. 16, pp. 329–336 (2004) 4. Wang, J., Fleet, D., Hetzmann, A.: Gaussian process dynamical models. In: NIPS. Information Processing Systems, pp. 1441–1448 (2005) 5. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: Proc. CVPR, pp. 2126–2133 (2000) 6. Alt, H., Knauer, C., Wenk, C.: Matching polygonal curves with respect to the fr`echet distance. In: Ferreira, A., Reichel, H. (eds.) STACS 2001. LNCS, vol. 2010, pp. 63–74. Springer, Heidelberg (2001) 7. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 8. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3d human figures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 702–718. Springer, Heidelberg (2000) 9. Mikolajczyk, K., Schmid, K., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Proc. ECCV, vol. 1, pp. 69–82 (2003) 10. Rohr, K.: Human movement analysis based on explicit motion models. MotionBased Recognition 8, 171–198 (1997) 11. Wang, Q., Xu, G., Ai, H.: Learning object intrinsic structure for robust visual tracking. In: Proc. CVPR, vol. 2, pp. 227–233 (2003) 12. Tenenbaum, J., de Silva, V.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 13. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000) 14. Urtasun, R., Fleet, D., Fua, P.: 3d people tracking with gaussian process dynamical models. In: Proc. CVPR, vol. 1, pp. 238–245 (2006) 15. Sato, K., Aggarwal, J.: Recognizing two-person interactions in outdoor image sequences. In: IEEE Workshop on Multi-Object Tracking, IEEE Computer Society Press, Los Alamitos (2001) 16. Park, S., Aggrawal, J.: Recognition of human interactions using multiple features in a grayscale images. In: Proc. ICPR, vol. 1, pp. 51–54 (2000) 17. Hongeng, S., Bremond, F., Nevatia, R.: Representation and optimal recognition of human activities. In: Proc. CVPR, vol. 1, pp. 818–825 (2000) 18. Park, J., Park, S., Aggrawal, J.: Video retrieval of human interactions using modelbased motion tracking and multi-layer finite state automata. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds.) CIVR 2003. LNCS, vol. 2728, Springer, Heidelberg (2003) 19. Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search. International Journal of Computer Vision 61(2), 185–205 (2004) 20. Balan, A., Sigal, L., Black, M.: A quantitative evaluation of video-based 3d person tracking. In: VS-PETS. IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 349–356. IEEE Computer Society Press, Los Alamitos (2005)
Gait Identification Based on Multi-view Observations Using Omnidirectional Camera Kazushige Sugiura, Yasushi Makihara, and Yasushi Yagi Osaka University 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan {sugiura,makihara,yagi}@am.sanken.osaka-u.ac.jp
Abstract. We propose a method of gait identification based on multiview gait images using an omnidirectional camera. We first transform omnidirectional silhouette images into panoramic ones and obtain a spatiotemporal Gait Silhouette Volume (GSV). Next, we extract frequencydomain features by Fourier analysis based on gait periods estimated by autocorrelation of the GSVs. Because the omnidirectional camera makes it possible to observe a straight-walking person from various views, multiview features can be extracted from the GSVs composed of multi-view images. In an identification phase, distance between a probe and a gallery feature of the same view is calculated, and then these for all views are integrated for matching. Experiments of gait identification including 15 subjects from 5 views demonstrate the effectiveness of the proposed method.
1
Introduction
There is a growing necessity in modern society to identify individuals in many situations, including, surveillance and access control. For personal identification, many biometrics-based authentication methods are proposed using a wide variety of cues; fingerprint, iris, face, and gait. Among these, gait identification has recently gained considerable attention because gait promises to enable surveillance systems to ascertain identity at a distance. Currently, many gait identification approaches are proposed by model base [1][2] and appearance base [3][4]. One of the difficulties facing those approches is an appearance change due to changes of viewing or walking direction. Yu et al. [5] discussed the effects of view angle variation on gait identification and reported a performance drop when view difference is large. To cope with view changes, Kale et al. [6] proposed a view transformation method based on perspective projection of the sagittal plane. The method does not, however, work well when view difference is large. Shakhnarovich et al. [7] proposed a visual hull-based method. However, the method needs multiple-view synchronized images for all subjects. As a training-based method, View Transformation Model (VTM) in the frequency domain was proposed [8]. Once the VTM is trained using sets of gait features of multiple views and subjects, a few-view reference can be transformed Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 452–461, 2007. c Springer-Verlag Berlin Heidelberg 2007
Gait Identification Based on Multi-view Observations
453
into an arbitrary-view gallery so as to match a probe view. They also reported that verification rate increased as the number of reference views increased [9]. Moreover, a method of multi-view gait identification using walking direction changes in a sequence was proposed [10], and it was reported that the verification rate increased as the number of walking directions increased. It is, however, troublesome to capture gait images many times to acquire many references in registration phase. In addition, it is unreasonable to assume that subjects always change their walking directions enough for multi-view identification. Therefore, we propose a method of gait identification based on multi-view observations from an omnidirectional camera. Note that an omnidirectional camera makes it possible to observe multi-view gait images even if a subject walks straight. Observation views are estimated by azimuth angles of tracked person regions in the omnidirectional image and walking trajectory on the floor. Then, for each gallery and probe sequence, a silhouette-based gait features are extracted for multiple basis views which are common both for the gallery and the probe. Finally the extracted multiple gait features are matched for each same view and the matching results are integrated for better identification. The outline of this paper is as follows. First, construction of a Gait Silhouette Volume (GSV) is addressed with silhouette extraction and panoramic expansion in section 2. Next, extraction and matching of multi-view frequency-domain gait feature are described in section 3. Finally, experimental results for gait identification are presented with an analysis of the effect of view variations in section 4. Section 5 contains conclusions and discussion of further work in the area.
2 2.1
GSV Construction Extraction of Gait Silhouette Images
The first step in constructing a GSV is to extract gait silhouette images by background subtraction from omnidirectional images. First, background is modeled by average color vector u(x, y) and its covariance matrix Σ (x, y) at position (x, y) using background image sequence as follows. u (x, y) =
N 1 u(x, y, n) N n=1
(1)
1 u(x,y,n)u(x,y,n)T −u(x,y)u (x,y)T , N n=1
(2)
N
Σ (x,y) =
where u(x, y, n) is a background color vector at position (x, y) at nth frame, and N is the number of total frames for a background training sequence. Second, to extract foreground regions, Mahalanobis distance D(x, y, n) between an input image c(x, y, n) and the modeled background is calculated at each position (x, y) at each nth frame as d (x, y, n) = c(x, y, n) − u (x, y)
(3)
454
K. Sugiura, Y. Makihara, and Y. Yagi
D(x,y,n) =
d (x,y,n)T Σ (x,y)−1 d (x,y,n).
(4)
A foreground region is defined as a set of pixels whose Mahalanobis distance D(x, y, n) is larger than threshold value Dthresh . Here, the threshold Dthresh is set to be 12.0 empirically. Figure 1 shows an input image and a result of background subtraction. We can see that person region is extracted correctly. Background subtraction, however, sometimes fails because of cast shadows and illumination condition changes. To overcome such difficulties, a shadow removal is processed based on color vector angle between background and foreground. Moreover, morphological closing filter is applied to improve silhouette quality. 2.2
Panorama Extension
The second step is panorama extension of silhouettes in omnidirectional image [11]. Let P (X, Y, Z) be a point in world coordinate and p(x, y) be a point in an omnidirectional image projected from point P . Then, let ρ and Z be azimuth angle and vertical position in a cylindrical coordinate whose center axis passes through mirror focal point Om and camera center Oc , and whose radius is RP . Thus, the panorama extension is expressed as follows. tan ρ = Y /X = y/x Z = RP tan α + c where α = tan−1
(b2 +c2 ) sin γ−2bc (b2 −c2 ) cos γ
(5) (6)
and γ = tan−1 √
f x2 +y 2
are viewing directions
defined in Fig. 2 respectively, and b and c are mirror parameters. 2.3
Scaling and Registration of Silhouette Images
The third step is scaling and registration of the panoramic silhouettes to acquire normalized gait patterns.
(a) Input image with omnidirectional camera
(b) Background subtraction
Fig. 1. Result of background subtraction
Gait Identification Based on Multi-view Observations
455
Fig. 2. Projection to cylindrical surface and floor surface
(b) Panoramic image coordinate (a) Omnidirectional image coordinate Fig. 3. Definition of person region for scaling and registration
First, person regions are simply tracked in the omnidirectional image considering connected region’s area sizes, and position differences between adjacent frames. Next, in order to normalize the silhouette by person region height, the maximum radius (head point) rmax and minimum radius (foot point) rmin of the person region in polar coordinate (r, ρ) of the omnidirectional image are found (see Fig. 3(a)). Then, in order to register the horizontal position, median of azimuth angle ρmed of the person region is found (see Fig. 3(a)). Note that radius and azimuth angle are corresponds to vertical and horizontal positions in the panorama image respectively. As a result, head, foot position, and horizontal center in panorama image are represented by Zmax , Zmin , and ρmed as shown in Fig. 3(b). Second, silhouette images are scaled so that the height (Zmax − Zmin ) in panoramic image can be just 30 pixels, and so that the aspect ratio of each region can be kept. Then, we produce a 20 × 30 pixel-sized image in which the horizontal median ρmed corresponds to the horizontal center of the produced
456
K. Sugiura, Y. Makihara, and Y. Yagi
(a) front-oblique
(b) fronto-parallel
(c) rear-oblique
(d) Definition of observation view
Fig. 4. GSV examples for multiple observation views
image. A GSV is finally constructed by aligning the images on the temporal axis. Figure 4 shows GSV examples for multiple observation views. We can clearly see appearance changes in each view.
3 3.1
Multi-view Feature Extraction and Matching Frequency-Domain Feature Extraction
The second step in the proposed method is frequency-domain feature extraction from the constructed GSV. First, gait period Ngait is detected by maximizing the following normalized autocorrelation C(N ) T(N) x,y n=0g(x,y,n)g(x,y,n+N) , (7) C(N)= T(N) T(N) 2 2 g(x,y,n) g(x,y,n+N) x,y n=0 x,y n=0 of the GSV g(x, y, n) with the N frame shift for the temporal axis, where Ntotal and T (N ) = Ntotal − N − 1 is the number of total and overlapped frames in the sequence respectively. The domain of N is set to [25, 45] empirically for natural gait periods. This is because various gait types such as running, brisk walking, and ’ox walking’ are not within the scope of this paper. For the autocorrelation-based period detection, adjacent gait-period sequences need to be similar each other. We assume that a walker’s trajectory is smooth to some extent and that appearance changes between adjacent gait-period sequences are small. Next, the subsequences S ns is picked up from a complete sequence S . Note that the frame range of the subsequence S ns is [ns , ns + Ngait − 1]. A Discrete Fourier Transformation (DFT) Gns (x, y, k) for the temporal axis is then applied for the subsequence, and amplitude spectra Ans (x, y, k) are calculated as ns +Ngait −1
Gns (x, y, k) =
n=ns
g(x, y, n)e−jω0 kn
(8)
Gait Identification Based on Multi-view Observations
Ans (x, y, k) =
1 |Gns (x, y, k)|. Ngait
457
(9)
where ω0 is the base angular frequency for the gait period Ngait . In this paper, direct-current elements (k = 0) (averaged silhouette) and low-frequency elements (k = 1, 2) are chosen as experimental gait features. Let a be a feature vector composed of elements of the amplitude spectra A(x, y, k). As a results, the dimension of the feature vector a sums up to 20 × 30 × 3 = 1800. 3.2
Observation View Estimation
In this section, observation view estimation for multi-view feature extraction is addressed. The observation view θ is defined as θ = (180 − φ) + ρ
(10)
where ρ is a azimuth angle, and φ is a walking direction (see Fig. 4(d)). Azimuth angle ρ is simply defined as direction of vector (x, y), where (x, y) is a foot point in the omnidirectional image. Walking direction φ is estimated from a trajectory of subject’s foot points F (X, Y ) on a floor coordinate. Let (Rf , ρ) be polar coordinate on the floor. If the floor plane is regarded as image plane, the distance Hr from mirror focal point Om to the floor can be seen as focal length to the floor image plane. Then, radius Rf is calculated as follows [11]. Rf =
−(b2 − c2 )Hr rf (b2 + c2 )f − 2bc rf2 + f 2
(11)
Thus, walking trajectory on the floor is obtained as a time series of the above floor points (Rf , ρ). Next, walking direction φ is defined as tangential direction of the estimated walking trajectory. Let (Xn , Yn ) and (VXn , VYn ) be foot point’s position and velocity at nth frame. The velocity is introduced by central difference as follows. VXn =
Xn+Δn − Xn−Δn Yn+Δn − Yn−Δn , VYn = 2Δn 2Δn
(12)
Here, Δn is set to be 15 [frame] considering velocity smoothness. Finally, walking direction φn in nth frame is defined as direction of velocity vector (VXn , VYn ). 3.3
Multi-view Feature Extraction
In this section, multi-view feature extraction is introduced based on the estimated observation views. First, multiple basis views θi (i = 1, 2, . . .) are chosen from observation views. In this time, interval of the basis views is set to 15 deg empirically. Next, a basis frame nθi corresponding to a basis view θi is found from a complete sequence, and a subsequence is picked up as a set of Ngait frames around the basis frame nθi as shown in Fig. 5(a). Concretely speaking, the start frame s in eq. (9) is replaced by ns = nθi − Ngait /2.
458
K. Sugiura, Y. Makihara, and Y. Yagi
(a) Overview of multi-view feature extraction
(b) Multi-view features for each subject (every 15 deg)
Fig. 5. Multi-view feature extraction
Results of multi-view feature extraction for multiple subjects are shown in Fig. 5(b). In this figure, each block indicates each subject, and each row and column indicate observation view and frequency respectively. We can see individual differences, for example, swing motion difference of subject 2 and 4 in 2-times frequency of 270-deg features. In addition, we can also see view differences for each subject. Thus, by integrating the different type of features across views, gait identification performance should improve more than the case of a single-view feature. Next section gives how to match the multi-view features. 3.4
Matching Features
A matching measure between two subsequences must first be found if the proposed method is to work. Let S P and S G be complete sequences for probe and galley, respectively, and let SθPi , SθGi be their subsequences for basis angle θi , respectively. Also let a(S θi ) be feature vector for subsequence S θi . The matching measure d(S θi , S θi ) is simply chosen as the Euclidean distance as d(SθPi , SθGi ) = a(SθPi ) − a(SθGi ). Complete sequences have variations in general and may contain outliers. Because a measure candidate D(SP , SG ) can handle this noise, the median value of each subsequence result is used: D(S P , S G ) = Mediani {d (SθPi , SθGi )}
4 4.1
(13)
Experiment Datasets
A total of 60 gait sequences from 15 subjects were used for the experiments. Each sequence consisted of approximately 10 steps of a straight walk in front of the omnidirectional camera, and it included 5 basis views: 240, 255, 270,
Gait Identification Based on Multi-view Observations
459
285, and 300 deg. The camera was Sony Inc. DCR-VX2000, and images were captured by 720 × 480 pixel size at 30 fps. The hyperboloidal mirror and camera parameters were a = 13.722Cb = 11.708Cc = 18.038Cf = 427.944 (unit: mm). The dataset was taken for two days, that is, there were two sequences per day for each subject. A test set is composed of one gallery sequence of a day and two probe sequences of the other day, therefore totally four combinations of test sets were generated. 4.2
Results
The gait identification experiments were done for the above four combinations of datasets and average performance was evaluated by Receiver Operating Characteristics (ROC) curves [12]. The ROC curves shows relation between verification rate PV and false alarm rate PF when the receiver changes the acceptance thresholds. The ROC curves tilting toward left top corner in the graph indicates high performance because it means high verification rate at low false alarm rate. In addition, the effect of the number of observation views and combinations of views on performance are analyzed to validate the effectiveness of multi-view observations. First, ROC curves for each single view are illustrated in Fig. 6(a). The figure shows that performance varies greatly among basis views, and that it is difficult to gain enough performance when an arbitrary single-view feature is used for matching. Next, ROC curves for two-view combinations are illustrated in Fig. 6(b). Here, the best and the worst three combinations are shown. Note that the performance order is judged by Equal Error Rate (EER), that is, error rate when false alarm rate PF becomes equal to false rejection rate (1 − PV ). Focusing on the worst cases, view differences are small (within 15 deg except for the worst 2). On the other hand, focusing on the best cases, view differences are relatively large (more than 30 deg). As a result, it is clear that the combination with large view difference is effective for identification. Moreover, ROC curves for each number of observation views are shown in Fig. 6(c). In verification rates in this graph are averaged over all the combinations for each number of observation views. As a result, we can see that the performance becomes better as the number of observation views increase. Finally, verification rates at 3% false alarm rate are picked up for each number of observation views. Figure 6(d) shows of the best, the worst, and the averaged performance over all the combinations. As for the best combinations, the verification rate for two observation views reaches the highest performance. Thus a small number of observation views are enough when the combination can be specified. As for the worst combination, the verification rates make a steady progress as the number of observation views increase. In case of the worst, because combinations are usually composed of adjacent views as known from two views combination case, the increase of the number of observation views directly leads to observation views variation. In summary, it is validated that observation view variations greatly contribute to performance improvement.
460
K. Sugiura, Y. Makihara, and Y. Yagi
(a) ROC curves for single views
(b) ROC curves for two views
(c) ROC curves for each number of views
(d) Verification rate at 3% false alarm for each number of views
Fig. 6. Experimental results
5
Conclusion
This paper describes a method of gait identification based on multi-view gait images using an omnidirectional camera. The omnidirectional silhouette images first transformed into panoramic ones and a spatio-temporal Gait Silhouette Volume (GSV) is obtained. Next, frequency-domain features are extracted by Fourier analysis. Because the omnidirectional camera makes it possible to observe a person from various views, multi-view features can be extracted from the GSVs composed of multi-view images. In an identification phase, distance between a probe and a gallery feature of the same view is calculated, and then these for all views are integrated for matching. The effect of observation view variation on gait identification performance was analyzed through experiments including 15 subjects from 5 views. As a result, average performance increases from 82% (single view) to 93% (5 views), and it was clear that observation view variation contributes to gait identification performance. In this paper, basis views are chosen only from views common for a gallery and a probe. It is possible to use other-view features interpolated by View Transformation Model (VTM) [8] for better performance in a future work. Moreover, subjects in this experiment walked within 5m from the omnidirectional camera, and thus relatively high-resolution silhouettes (approximately 60 pixel height) are obtained. Therefore, effects of distance from the camera or silhouette resolution on identification performance should be analyzed. That also leads to analysis of the optimal alignment of the omnidirectional camera to capture multi-view
Gait Identification Based on Multi-view Observations
461
gait images effectively considering both silhouette resolution and observation view variation.
References 1. Urtasun, R., Fua, P.: 3d tracking for gait characterization and recognition. In: Proc. of the 6th IEEE Int. Conf. on Automatic Face and Gesture Recognition, pp. 17–22. IEEE Computer Society Press, Los Alamitos (2004) 2. Yam, C., Nixon, M., Carter, J.: Automated person recognition by walking and running via model-based approaches. Pattern Recognition 37(5), 1057–1072 (2004) 3. Sarkar, S., Phillips, J., Liu, Z., Vega, I., Grother, P., Bowyer, K.: The humanid gait challenge problem: Data sets, performance, and analysis. Trans. of Pattern Analysis and Machine Intelligence 27(2), 162–177 (2005) 4. Han, J., Bhanu, B.: Individual recognition using gait energy image. Trans. on Pattern Analysis and Machine Intelligence 28(2), 316–322 (2006) 5. Yu, S., Tan, D., Tan, T.: Modelling the effect of view angle variation on appearancebased gait recognition. In: Proc. of 7th Asian Conf. on Computer Vision, vol. 1, pp. 807–816 (2006) 6. Kale, A., Roy-Chowdhury, A., Chellappa, R.: Towards a view invariant gait recognition algorithm. In: Proc. of IEEE Conf. on Advanced Video and Signal Based Surveillance, pp. 143–150. IEEE Computer Society Press, Los Alamitos (2003) 7. Shakhnarovich, G., Lee, L., Darrell, T.: Integrated face and gait recognition from multiple views. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 439–446 (2001) 8. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Gait recognition using a view transformation model in the frequency domain. In: Proc. of the 9th European Conf. on Computer Vision, Graz, Austria, vol. 3, pp. 151–163 (2006) 9. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Which reference view is effective for gait identification using a view transformation model? In: Proc. of the IEEE Computer Society Workshop on Biometrics 2006, New York, USA (2006) 10. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Adaptation to walking direction changes for gait identification. In: Proc. of the 18th Int. Conf. on Pattern Recognition, Hong Kong, China, vol. 2, pp. 96–99 (2006) 11. Yamazawa, K., Yagi, Y., Yachida, M.: Hyperomni vision: Visual navigation with an omnidirectional image sensor. Systems and Computers in Japan 28(4), 36–47 (1997) 12. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-recognition algorithms. Trans. of Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000)
Gender Classification Based on Fusion of Multi-view Gait Sequences Guochang Huang and Yunhong Wang Intelligent Recognition and Image Processing Lab, School of Computer Science and Engineering, Beihang University, Beijing 100083, China gc
[email protected],
[email protected]
Abstract. In this paper, we present a new method for gender classification based on fusion of multi-view gait sequences. For each silhouette of gait sequences, we first use a simple method to divide the silhouette into 7 (for 90 degree, i.e. fronto-parallel view) or 5 (for 0 and 180 degree, i.e. front view and back view) parts, and then fit ellipses to each of the regions. Next, the features are extracted from each sequence by computing the ellipse parameters. For each view angle, every subject’s features are normalized and combined as a feature vector. The combination of feature vector contains enough information to perform well on gender recognition. Sum rule and SVM are applied to fuse the similarity measures from 0o , 90o , and 180o . We carried our experiments on CASIA Gait Database, one of the largest gait databases as we know, and achieved the classification accuracy of 89.5%.
1
Introduction
Gait is an attractive biometric feature for human recognition and classification. In recent years, gait receives more and more attentions from computer vision and biometric researchers. Compared with other biometric features, such as fingerprint, face, and iris, gait has many good qualities such as non-invasive, capture at distance and non-perceivable. Gait analysis plays an important role in surveillance. Gait analysis mainly consists of two areas, one is gait recognition, which is to identify subject’s ID in the specific environment where the other biometric is very difficult to be captured. The other is gait classification, including gender recognition, action classification, and estimation of age. Meanwhile gender classification has attracted much attention recently [2,4], due to its widely potential application. Researching on gait recognition and gait classification have a long history, but most of the articles about gait analysis focus on human recognition, the articles that about how to use gait to classify gender is few. In complex real surveillance scenarios, describing the subject’s features, such as gender and age, is very important and necessary. Because in these specific environments, it is very difficult or even impossible to capture other available features to correctly identify the subject’s ID. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 462–471, 2007. c Springer-Verlag Berlin Heidelberg 2007
Gender Classification Based on Fusion of Multi-view Gait Sequences
463
The remainder of this paper is organized as follows, Section 2 summarizes the related work on the gender recognition. Gait representation and feature extraction are described in Section 3. Section 4 provides classification and fusion scheme. Experiments and results are presented in Section 5, followed by the conclusions in Section 6.
2
Related Work
Point-light display was an important method for walking manner researches and received much attention during the pass few decades. Kozlowski and Cutting [3] were the first researchers who began to study on gender recognition from human walking manner. They demonstrated that observers are able to recognize the point-light walkers’ gender. Barclay et al. [1] conducted a further research on gender recognition, they investigated the influenced of spatial and temporal factors to the correct rate. The result shows that at least two gait cycles are necessary for successful gender recognition, the speed of walkers also has great influence for on classification accuracy. In their experiments, the highest recognition accuracy is 68%. Most of the studies were based on the side-view presentation of walkers to observers, on the other hand some experiments were try to find the effect of view angle on gender recognition [1,8]. It was found that the front-view presentation contains more information than side-view for gender recognition [8]. In our method, we combine the front-view, back-view, and side-view presentation of gait, and thus conducts a higher correct rate then ever before. Troje [8] recently used linear pattern recognition technique to deal with the analysis of biological motion, and presented a two-stage PCA framework for recognizing gender. He reported 92.5% recognition rates. Davis and Gao [2] used three-mode PCA model basing point-light walkers to do gender recognition. In their experiments, there are 40 walkers, and the best recognition rate is 95.5%. But in real surveillance environments, it is very difficult to attached small point lights to the main joints of subject’s body, so weather this approach will wok at good performance on video-based gait database is unknown. Most of the aforementioned studies used point-light display to form the aspect of biological motion, as mentioned above, the point-light display has fatal limitation in surveillance environments. Lee and Grimson [5,4] recently proposed a computer vision algorithm to extract visual feature of gaits from image sequences for gender recognition. Their experiments performed on a database of 24 objects and the approach achieved the recognition rate of 84%.
3
Gait Representation
In our method, three view angles are chose,including 0o (front view), 90o (frontoparallel view), and 180o(back view). A lot of researches have demonstrated that the front- view presentation contains more information for gender classification, and some research[11] show that fusion of gait sequences with an angle difference
464
G. Huang and Y. Wang
greater than or equal to 90o can achieve better improvement than fusion of those with an acute angle. Therefore, 0o , 90o and 180o are chose to do gender classification in our method. First of all, we assume that silhouettes have been extracted from original video files. And then, the extracted silhouettes sequences are normalized to the same size, so all silhouettes have the same height.Horizontal center of normalized silhouettes are obtained and all the normalized silhouettes are aligned according to the horizontal center.
Fig. 1. Examples of normalized silhouettes in different view angle from the same subject. The top row shows the 0o normalized silhouettes, the middle row shows the 90o , and the bottom row shows the 180o .
The gait image representation method is first proposed by Lee [5]. but in Lee’s thesis, she only applied this representation method on 90o gait images, in this paper, we apply the method to divide the 0o and 180o images into five parts, and the experimental results demonstrate that the segmentation way of 0o and 180o image is effective in gender classification. For the gait images of 0o and 180o, we proportionally divide the silhouette into 5 parts as shown in Fig.2 (a). These 5 regions roughly correspond to: Z1 , head region; Z2 , left of torso; Z3 , right of torso; Z4 , left leg; Z5 , right leg. For the gait images of 90o , we proportionally divide the silhouette into 7 parts as shown in Fig.3 (a). These 7 regions roughly correspond to: N1 , head region; N2 , front of torso; N3 , back of torso; N4 , front thigh; N5 , back thigh; N6 , front foot; N7 , back foot. For each of the parts, we fit an ellipse to the foreground in this region, as shown in Fig.2 (b) and Fig.3 (b). The intuition behind the segmentation of 0o and 180o silhouette is that men show a larger extent of lateral sway of the upper body than women do, and the orientation of the major axis can reflect this phenomenon. The difference of the shoulder-hip ratio between men and women also can be reflected from the orientation of the major axis. For facilitating the expression of each part of the silhouette, we divide the 90o silhouettes into 7 regions, each region roughly correspond to one part of human body. For each ellipse fitted to these regions, we compute four parameters, including ¯ Y¯ ), the elongation of the ellipse (L), and the orientation of the the centroid (X,
Gender Classification Based on Fusion of Multi-view Gait Sequences
465
Fig. 2. Example of the 0 and 180 degree silhouette which is divided into 5 regions, and five ellipses are fitted to these regions
Fig. 3. Example of the 90 degree silhouette which is divided into 7 regions, and seven ellipses are fitted to these regions
major axis (α). The details about how to calculate these four parameters is completely included in [4], and we do not present it here. For each region there are 4 parameters, they form the region feature vector Ri ¯ i , Y¯i , Li , αi ) Ri = (X
(1)
Where i = 1, . . . , 5 (for 0o and 180o images)or 7(for 90o images).And for each image there are 20 parameters (5 regions × 4 parameters) or 28 parameters (7 regions × 4 parameters), these parameters form the image feature vector Ij ¯1 , Y¯1 , L1 , α1 , . . . , X ¯ 5(7) , Y¯5(7) , L5(7) , α5(7) )j Ij = (R1 , . . . , R5(or7) )j = (X
(2)
Where j = 1, . . . , n, n is the total number of the images in one gait sequence. By computing the mean value of image feature vectors in one sequence, we get the sequence feature vector Sp (k) Sp (k) = mean(I1 (k), . . . , In (k))p
(3)
For example ¯ 1 ) = mean(I1 (X ¯ 1 ), . . . , In (X ¯ 1 ))p = mean(I1 (1), . . . , In (1))p (4) Sp (1) = Sp (X Where p is the index of sequences, p = 1, . . . , total number of sequences, n is the total number of images from one sequences, and k is the index of features,
466
G. Huang and Y. Wang
k = 1, . . . , 20 (or 28), so there are totally 20 features for each 0o and 180o sequence and 28 features for each 90o sequence.
4 4.1
Gender Classification Similarity Measure
When the sequence feature vectors are obtained for each subject, the similarity measures are calculated next. Here, the calculation of similarity measures is applied on three view angles, respectively. We randomly choose some male and female subjects from the database to construct testing and training set. In both testing and training sets, the number of female subjects must equates to the number of male subjects, because if the number of female and male do not equate, the experimental results may be influenced by the separability of the larger class. According to the gender attribute, training set is further divided into female training subset and male training subset. Mean Euclidean distance from testing set to the female training subset or male training subset is calculated. Let M be the total number of female (or male) training sequences, and St (k) be the number k feature of number t testing sequence. The Mean Euclidean Distance of number k feature between testing sequence and female training set, DFt (k), is defined as 1 DFt (k) = Euclidean(St(k), Sn (k)) (5) M Where n = 1, . . . , M , Sn ∈ female training set, St ∈ testing set, and k = 1, . . . , 20 (or 28). And the Mean Euclidean Distance of number k feature between testing sequence and male training set, DMt (k), is defined as DMt (k) =
1 Euclidean(St (k), Sm (k)) M
(6)
Where m = 1, . . . , M , Sm ∈ male training set, St ∈ testing set, and k = 1, . . . , 20 (or 28). Both distance, DFt and DMt , are regarded as the similarity measures of number t sequence, they mean the similar degree between testing sequence and two subsets. DFt and DMt are 20 (or 28) dimensional vectors. 4.2
Fusion Scheme
Based on the similarity measure algorithm which is introduced in 4.1, we get two vectors, DFt and DMt , for each angle. So there are total three female vectors o o o o o o and three male vectors, DFt0 , DFt90 , DFt180 , DMt0 , DMt90 , DMt180 . We concatenate three female vectors to one vector and three male vectors to another vector, respectively.
Gender Classification Based on Fusion of Multi-view Gait Sequences o
o
o
CFt = concatenate(DFt0 , DFt90 , DFt180 ) o
o
o
CMt = concatenate(DMt0 , DMt90 , DMt180 ) o
o
467
(7) (8)
o
DFt0 and DFt180 are 20 dimensional vectors and DFt90 is 28 dimensional vector, so CFt is 68 dimensional vectors, and the same to CMt . Before fusion the similarity measures of all feature, we normalize these similarity measures to common a range [0, 1] by using the Min-Max normalization method [6] CFt (k) − min (9) CFt (k) = max − min CMt (k) − min (10) max − min Where max and min denote the maximum and the minimum value of number k feature in number t sequence,respectively. CFt (k) and CMt (k) denote the normalized gender similarity measures of number k feature in number t sequence. CMt (k) =
Sum Rule. Snelick et al. [7] found that the Min-Max normalization followed by the sum of scores fusion method outperform other schemes. So we adopt this fusion scheme. N CFt (k) (11) CFt = k=1
CMt =
N
CMt (k)
(12)
k=1
Where N denotes the dimension of CFt and CMt , N = 68. CFt and CMt are the similarity measures of gender. Comparing CFt and CMt , we can make the decision that whether testing sequence belong to female or male. Testing sequence is considered to have the same gender attribute with the minimum of CFt and CMt f emale, CFt < CMt gender = (13) male, CFt > CMt . Support Vector Machine (SVM). SVM method attempts to maximize the distance between the hyperplane and the closest training samples on either side of the hyperplane. It is a powerful technique for classification and in particular, for solving binary classification problems. We choose SVM to be the classifier. First, we construct feature vectors for training and testing SVM classifier. We concatenate the normalized vectors, CFt and CMt , into one vector, which will be used to represent the sequence. Gt = concatenate(CFt , CMt )
(14)
468
G. Huang and Y. Wang
We construct a feature vector G for each training sequence. And then, Gt (where t = 1, . . ., total number of training sequence) can be used as the input vector to train SVM classifier.
5 5.1
Experiments and Results Data
We carried our experiments on the CASIA Gait Database [10], one of the largest shared databases in current gait-research community. There are 124 subjects in the database, of which 93 were male and 31 were female. We only chose the normal sequences from 0 degree, 90 degree, and 180 degree to construct experimental data set. There are totally 2, 232 (124 subjects × 6 sequences × 3 angles) video sequences in our experimental data set. As mentioned above, if the numbers of female and male are unbalance in training and testing sets, the experimental results may be influenced by the separability of the larger class. In CASIA database, there are only 31 females, so, in our experiments, we randomly choose 25 males and 25 females from the database to form a training set and randomly choose 5 males and 5 females from the rest subjects to form a training set. So there are totally 600(50×6) sequences in training set and 60(10×6) sequences in testing set. If the subject is assigned to training set or testing set, all the sequences of this subject are assigned into corresponding set too. 5.2
Experimental Results
Since the training set and testing set are chose randomly, we repeat our experiments two hundred times and use the mean value of these experimental results to be the terminal recognition accuracy. So the recognition rates which are listed here can correctly reflect the performance of our method. First, we use Sum Rule scheme to fuse 20 features from 0 degree view angle, 28 features from 90 degree view angle, and 20 features from 180 degree view angle, respectively, in order to see what performance can be achieved before fusion all the three view angles. The recognition results are listed in Table 1. And then, we use Sum Rule scheme to fuse all the features from three view angles, total of 68 features, and achieve the recognition accuracy of 87.7%. Fig.4 Table 1. Results of respectively fuse the features of three view angles by using sum rule View angle (degree) Recognition rate (%) 0 83.0 90 85.5 180 85.5
Gender Classification Based on Fusion of Multi-view Gait Sequences
469
Fig. 4. Two-dimensional scatter-plots showing the testing data and decision boundary. Red ’x’ denotes male and green ’o’ demotes female. Table 2. Results of respectively fuse the features of three view angles by using Linear, Polynomial (d = 2), RBF (g = 1) kernels. (d = degree of polynomial, g = width of RBF network). View angle (degree) Kernel Recognition rate (%) 0 Linear 79.5 0 Polynomial 88.0 0 RBF 80.0 90 Linear 82.0 90 Polynomial 85.0 90 RBF 82.5 180 Linear 86.0 180 Polynomial 88.0 180 RBF 86.0 Table 3. Results of fuse all the features from three view angles by using using Linear, Polynomial (d = 2), RBF (g = 1) kernels Kernel Recognition rate (%) Linear 89.5 Polynomial 89.5 RBF 88.5
shows scatter plot of the test data and decision boundary. The x axis denotes the dissimilarity measure of male, in other words, it means the distance between testing subject and male; and y axis denotes the dissimilarity measure of female, similarly, it means the distance between testing subject and female. Three kernels, including Linear, 2nd degree Polynomial, and RBF (width of RBF network = 1) kernels, are used to train SVM. The results of respectively fuse the features of three view angles are shown in Table 2. And the results of fuse all the features from three view angles are shown in Table 3.
470
G. Huang and Y. Wang Table 4. Comparison of experimental results
Recognition rate(%) 63% 92.5% 84% 95.5% 89.5%
5.3
Authors Kozlowski and Cutting (1977) Troje (2002) Lee and Grimson (2002) Davis and Gao (2004) This paper
Database Representation View 6 subjects Point-light Side-view 40 subjects Point-light Mixed 24 subjects Video-based Side-view 40 subjects Point-light Front-view 124 subjects Video-based Mixed
Discussions
Based on the above results, the conclusions are draw as follows: using fusion schemes can improve the performance of gender recognition systems, especially by suing SVM fusion scheme the performance achieved 89.5%. Comparing the results of using Sum Rule and SVM scheme, we note that SVM scheme has more advantages than Sum Rule. And SVM scheme shows an improvement of up to 5% over Sum Rule. According to Table 2, we can find that 2nd degree Polynomial kernel performs better than other kernels, this leads us to believe that Polynomial kernel is more effective to gender recognition task. In Table 3, the performance of the three kernels show that the Linear kernel performed as well as the 2nd degree Polynomial kernels. Compared with other methods, the recognition rate of our method is higher than most of other methods. Table 4 shows the recognition accuracy of our method and other related methods. From Table 4 we note that although our recognition rate is a little lower than Davis’s and Troje’s , our experiments are applied on large database and based on video sequences. From this point, our method is more suitable to do gender recognition in surveillance environments. We also realize Lee and Grimson’s method and carry it on the same CASIA database, the recognition rate is 85%, which equates to the result of only using 90o sequences in our experiments. The rest methods in Table 4 are hard to be realized and ran on the same CASIA database, because they use point-light representation method. In this representation method, points of light must be attached to body joints in order to accurately locate the joints position, but in video sequences it is hard or even impossible to accurately locate the joints position. So it is meaningless to realize these methods and compare them with our method on the same CASIA database.
6
Conclusion
In this paper, we have presented a new gender classification scheme based on fusion the similarity measures from multi-view gait sequences. From the experimental results, we can see that our method achieved higher performance than most of other methods and more suitable for surveillance scenarios. The proposed fusion method can help improve the performance of gender classification.
Gender Classification Based on Fusion of Multi-view Gait Sequences
471
Acknowledgments This work was supported by the program of New Century Excellent Talents in University, National Nature Science Foundation of China (No. 60332010), Joint Project supported by Nation Science Foundation of China and Royal Society of UK (No. 60710059), and Hi-Tech Research and Development Program of China (No. 2006AA01Z133). Portions of the research in this paper use the CASIA Gait Database collected by Institute of Automation, Chinese Academy of Sciences.
References 1. Barclay, C.D., Cutting, J.E., Kozlowski, L.T.: Temporal and spatial actors in gait perception that influence gender recognition. Perception & Psychophysics 23(2), 145–152 (1978) 2. Davis, J.W., Gao, H.: Gender recognition from walking movements using adaptive three-mode pca. In: IEEE CVPR Workshop on Articu-lated and Nonrigid Motion, IEEE Computer Society Press, Los Alamitos (2004) 3. Kozlowski, L.T., Cutting, J.E.: Recognizing the sex of a walker from dynamic point-light display. Perception & Psychophysics 21(6), 575–580 (1977) 4. Lee, L.: Gait Analysis for Classification. Technical report, MIT AI Lab (2003) 5. Lee, L., Grimson, W.E.L.: Gait analysis for recognition and classification. In: FG. IEEE International Conference on Automatic Face Gesture Recognition, IEEE Computer Society Press, Los Alamitos (2002) 6. Nandakumar, K., Jain, A.K., Ross, A.A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(1212), 2270–2285 (2005) 7. Snelick, R., Indovina, M., Yen, J., Mink, A.: Multimodal biometrics: Issues in design and testing. In: Proceedings of Fifth International Conference on Multimodal Interfaces. Vancouver (2003) 8. Troje, N.F.: Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision 2(5), 371–387 (2002) 9. Wang, Y., Yu, S., Wang, Y., Tan, T.: Gait recognition based on fusion of multi-view gait sequences. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, Springer, Heidelberg (2005) 10. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In: ICPR 2006. Proc. of the 18’th International Conference on Pattern Recognition, Hong Kong, China (2006) 11. CASIA Gait Database, http://www.sinobiometrics.com
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models Heping Li1,2 , Zhanyi Hu1 , Yihong Wu1 , and Fuchao Wu1 National Laboratory of Pattern Recognition Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, P.O. 2728, Beijing 100080, P.R. China {hpli,huzy,yhwu,fcwu}@nlpr.ia.ac.cn 1
2
Abstract. The traditional co-training algorithm, which needs a great number of unlabeled examples in advance and then trains classifiers by iterative learning approach, is not suitable for online learning of classifiers. To overcome this barrier, we propose a novel semi-supervised learning algorithm, called MAPACo-Training, by combining the co-training with the principle of Maximum A Posteriori adaptation. This MAPACoTraining algorithm is an online multi-class learning algorithm, and has been successfully applied to online learning of behaviors modeled by Hidden Markov Model. The proposed algorithm is tested with the Li’s database as well as Schuldt’s dataset.
1
Introduction
Behavior modeling is driven by a wide range of applications, such as advanced user interface, visual surveillance, virtual reality and so on. The most existing works in this field focused on modeling the behaviors with manually labeling like [1,2,3,4]. For example, Li and Greenspan [1] built a multi-scale model from timevarying contours and Gong and Xiang [2] learned a Dynamically Multi-Linked Hidden Markov Model (DML-HMM). However, manual labeling of behavior patterns is laborious, impractical and error prone [5]. Recently, some behavior modeling methods based on semi-supervised/unsupervised learning [5,6,7,8] have been proposed. For instance, Xiang and Gong [5] discovered natural grouping of behavior patterns through unsupervised model selection and feature selection, and Zelnik-Manor and Irani [6] used the normalized-cut approach to automatically cluster the data and then build the statistical behavior model. Unfortunately, these methods need to get a great number of unlabeled examples beforehand, which are therefor unsuitable for online learning of behavior models and cannot automatically adjust the models’ parameters according to the circumstantial changes. The co-training approach proposed by Blum and Mitchell [9] is also a semisupervised learning method. Levin et al. [10] used the co-training framework in the context of boosted binary classifiers to build the automobile detectors. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 472–481, 2007. c Springer-Verlag Berlin Heidelberg 2007
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models
473
And Yan and Naphade [11] proposed a multi-view semi-supervised learning algorithm which avoids the requirement of the co-training approach about that each view of examples is sufficient for learning the target concepts. However, these methods belong to the off-line learning category. Javed et al. [12] combined the co-training approach and boosting to propose an algorithm for online detection and classification of moving objects, where behavior modeling is not considered. In this paper, we present a novel semi-supervised learning method called MAPACo-Training which combines the co-training approach and the principle of Maximum A Posteriori adaptation [8,13,16]. The proposed method can simultaneously train the parameters of multi-class models. We have successfully applied the method to online learning the parameters of behaviors modeled by Hidden Markov Model (HMM). Since it only needs a small labeled sample set beforehand, our method can alleviate the problem in the methods [1,2,3,4]. And unlike the approaches [5,6,7,8], the method can automatically adjust the parameters with the current example online. The remainder of this paper is organized as follows: Motion signature representation is outlined in Section 2. Section 3 is a detailed description of MAPACoTraining. Experimental results are reported in section 4. And conclusions as well as future research directions are listed in section 5.
2 2.1
Motion Signature Representation Feature Extraction
Background subtraction is used to detect foreground. In our approach, two types of features are considered: (1) shape feature; (2) optical flow feature [14]. The size of the foreground region varies with the distance of object to camera, camera parameters and the size of object. We therefore need to normalize the foreground region. Firstly, we equidistantly divided the bounded rectangle of the foreground into U × V non-overlapping sub-blocks. Then, the normalized value of each sub-block is calculated as follows: x1i = s− sub(i)/max,
i = 1, 2, . . . , num,
(1)
where num = U × V is the number of sub-blocks; s sub(i) is the number of the foreground pixels in the ith sub-block; max is the maximum value of {s sub(i), i = 1, 2, . . . , num}. The optical flow value of each sub-block is calculated as follows: xji = f− sub(i, j)/sum(i), i = 1, 2, . . . , num, j = 2, 3, (2) where f sub(i,j) with j = 2, 3 are respectively the sum of horizontal, vertical optical flow in the ith sub-block; sum(i) is the pixel number in the ith sub-block. Then, the feature vectors at frame t from shape and optical flows are as: odt = [xd1 , xd2 , . . . , xdnum ],
d = 1, 2, 3.
474
2.2
H. Li et al.
Motion Signature Representation
Given the observation feature sequences OTd = {od1 , od2 , · · · , odt , · · · , odT }, two different Hidden Markov Models (HMMs) are adopted to build the behavior models. The one is a single continuous HMM from shape, and the other is like a Parallel Hidden Markov Model (PHMM) with two continuous HMMs from optical flow. These HMM topologies are shown in Figure 1, where shade circles are as observation nodes and clear circles as hidden nodes. For optical flow model, the two HMMs are learned independently. The output probability density function of learning is the following Gaussian Mixture Model (GMM): p(odt |θ) =
K
αk pk (odt |μk , Σk )
(3)
k=1
where θ = {αk , μk , Σk , k = 1, 2, . . . , K} represents the parameters of GMM including weight αk , mean value μk and covariance matrix Σk of every mixture K component; αk = 1. k=1
(a)
(b)
Fig. 1. HMM topology: (a) shape model; (b) optical flow model
By the Forward procedure, we compute the observation probabilities P (OTd |λdc ), c = 1, 2, . . . , C for the observation feature sequences OTd , where C is the class number of behaviors and λdc is the HMM parameter set of the cth class behavior. Since the output probability density function is GMM, the probabilities are normalized [15] as follows: P¯ (OTd |λdi ) = P (OTd |λdi )/
C c=1
P (OTd |λdc ).
(4)
And for optical flow model (Figure 1(b)), the following operation is further performed as: P¯ (OT2 |λ2i )P¯ (OT3 |λ3i ) P¯ (OT2,3 |λ2,3 . i ) = C ¯ 2 2 ¯ 3 3 c=1 [P (OT |λc )P (OT |λc )]
(5)
The Bayes classifier is used as our base classifier. According to the Bayes rule, the posterior
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models
475
P (c|OTd ) = P¯ (OTd |λdc )P (c)/P (OTd ), where P (c) = 1/C, so P (c|OTd ) ∝ P¯ (OTd |λdc ). Thus if P¯ (OTd |λdc0 ) = max P¯ (OTd |λdc ), c
OTd belongs to the c0 –th behavior. In our proposed algorithm, we set f1i = P¯ (OT1 |λ1i ) and f2i = P¯ (OT2,3 |λ2,3 i ).
3
MAPACo-Training
In this section, we propose a new semi-supervised learning algorithm called Maximum A Posteriori Adaptation Co-Training (MAPACo-Training) which attempts to learn behavior models online. We first describe the principle of MAP adaptation, and then give the details of MAPACo-Training. 3.1
MAP Adaptation
MAP adaptation has widely been used in speaker and face verification [13]. Recently, Zhang et al. in [8,16] used it for unusual event detection and meeting event recognition. During the course of learning the parameters of GMM-based HMM in [8,16], the state-transition probabilities are kept fixed while the mean, variance and mixture weights are adapted as follows: (1) According to the existing parameters, new statistical values are computed: K d d αk pk (odt |μk , Σk ) (6) P (i|ot ) = αi pi (ot |μi , Σi ) k=1
αnew i = μnew i
T t=1
T Σinew =
=
t=1
T t=1
P (i|odt )
T
T odt P (i|odt )
t=1
(7) P (i|odt )
P (i|odt )(odt − μnew )(odt − μnew )T i i T d t=1 P (i|ot )
(8)
(9)
(2) New parameters are estimated as follows: + (1 − ρ) · αold α ˆ i = ρ · αnew i i
(10)
+ (1 − ρ) · μold μ ˆi = ρ · μnew i i
(11)
T ˆi = ρ · Σinew + (1 − ρ) · [Σiold + (ˆ Σ μi − μold μi − μold i )(ˆ i ) ]
(12)
where ρ(0 ≤ ρ ≤ 1) is the scale factor. We use the principle of MAP adaptation into our algorithm. More details about MAP adaptation can be found in [8,13,16].
476
3.2
H. Li et al.
MAPACo-Training Algorithm
The traditional co-training algorithm [9] needs to get a great number of unlabeled samples in advance and then train models by an approach of iterative learning. It is an off-line learning method. By combining the co-training and the MAP adaptation, we propose a novel online multi-class learning algorithm called MAPACo-Training as follows: Input: Labeled data L including a small training sample set Ltr and a small validation sample set Lv with two views V1 and V2 , threshold value Th > 1 and Tnum ≥ 1. Output: a classifier from the probabilitiesf 1 , f 2 , ..., f C . MAPACo-Training 1. Create f1i and f2i (i = 1, 2, . . . , C) using Ltr on V1 and V2 . Set the new training sample set of each one of the C classes Lib = φ (b=1,2); 2. For k = 1, 2, . . . , C (a) For current sample S, assume n = max{f1j ,j = 1, 2, . . . , C, j = k}, j
m = max{f2j , j = 1, 2, . . . , C, j = k}, j
i. if f1k /f1n ≥ Th , the view V2 of sample S is added into Lk2 as a new sample; ii. if f2k /f2m ≥ Th , the view V1 of sample S is added into Lk1 as a new sample; iii. if 1 < f1k /f1n < Th and 1 < f2k /f2m < Th , the view V1 of sample S is added into Lk1 as a new sample and the view V2 of sample S is added into Lk2 as a new sample. (b) if the sample number in Lkb equals Tnum , the parameters of model fbk are updated according to MAP equations (6)∼(12) with these samples in Lkb and the scale factor ρ is decided by validation sample set Lv . And then let Lkb = φ. 3. Combine f i = ω1 f1i + ω2 f2i (ω1 + ω2 = 1) using Lv . 4. Create a new classifier using f i according to the Bayes theory. Similar to co-training, two base classifiers of every class model need to be trained on separate features of the same sample. How to select samples to train the models? In this algorithm, we use a threshold Th to do it. The conditions (i)(ii) show if one base classifier can predict the label of the sample confidently, then we add this sample into the training set of the other base classifier of the corresponding class. The condition (iii) means that both base classifiers can get the same label according to the bayes rule, but neither of them is confident, which shows the sample is useful for improving the performance of the two classifiers. During the course of updating parameters by MAP adaptation equations (6)∼(12), we use validation set Lv to decide the scale factor ρ. If the class prediction for a sample from the conditions (i)∼(iii) is not correct, which means the sample is a noise, the sample is no longer used for further learning by setting
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models
477
ρ = 0 according to Lv . In our experiment, we assume the possible value of ρ is 0 or a constant¯ ρ (0 ≤ ρ¯ ≤ 0.5). Remark: From the equations (6)∼(12), we can see that the MAP adaptation only uses the current samples to calculate the new statistical values and then gets the new parameters by simple weighted estimation. It avoids to directly train the HMM parameters from a great number of samples by EM algorithm and improves the computational efficiency. The MAPACo-Training algorithm starts from a small label sample set Ltr and then updates the parameters by the MAP adaptation. So the algorithm is suitable for online multi-class learning.
4
Experiments
We test our method from two datasets: Li’s dataset [17] and Schuldt’s dataset [18]. In the experiments, U =9 and V =5 are used for dividing the bounded rectangle of foreground. To each type of features such as shape, horizontal optical flow and vertical optical flow, the Principal Component Analysis (PCA) is used to reduce the 45-dimensional features to the 8-dimensional ones. 4.1
Results on Li’s Dataset
38
24
8
36
22
7
34
20
32
18
30
16
28
14
26
12
24
10
22 0
500
1000
1500
2000
6
HTER(%)
HTER(%)
HTER(%)
We get a video consisting of five kinds of behaviors from Li’s dataset [17], of which each one is performed by 18 subjects. Image size is of 160 × 120 pixels and frame rate is of 6 frames/sec. The video totals 38120 frames including “box”
8 0
2500
5
4
3
2
500
number of samples
1000
1500
2000
1 0
2500
500
number of samples
(a)
(b)
22
1000
1500
2000
2500
number of samples
(c)
25
22
20 20 18
14
18
12 10
HTER(%)
20
HTER(%)
HTER(%)
16
15
16
14
8 12 6 4 0
500
1000
1500
number of samples
(d)
2000
2500
10 0
500
1000
1500
number of samples
(e)
2000
2500
10 0
500
1000
1500
2000
2500
number of samples
(f)
Fig. 2. The learning curves: (a) box; (b) kick; (c) lookround; (d) standup (e) wave; (f) average HTER
478
H. Li et al. Table 1. Initial confusion matrix box kick lookround standup wave
box 37.14 12.86 1.43 1.43 16.67
kick 34.29 72.86 1.90 33.80 2.38
lookround 7.62 1.90 92.86 0.48 14.76
standup 8.57 12.38 3.33 63.81 5.24
wave 12.38 0 0.48 0.48 60.95
Table 2. Final confusion matrix box kick lookround standup wave
box 54.76 0 0 0.95 10.95
kick 18.10 86.67 0 3.80 0.95
lookround 3.33 1.43 95.72 0.48 1.43
standup 7.14 11.42 3.33 94.29 5.24
wave 16.67 0.48 0.95 0.48 81.43
(8000 frames), “kick” (7600 frames), “lookround” (7820 frames), “standup” (7040 frames) and “wave” (7660 frames). We slice this video sequence into 3810 segments with the fixed time duration of 20 frames and the step length of 10 frames, where 25 segments in every class are selected for the small training sample setLtr , 12 segments for the validation sample set Lv , 210 segments for the test sample set and the remaining segments for online learning. Parameters in our algorithm are preset as: Th =1.5, Tnum =5 and ρ¯ = 0.2. MAP adaptation is only used to update the means. The proposed algorithm is implemented in Matlab 6.0 and tested on a 2.0 GHz Pentium 4 PC with 256MB memory. The average time per frame is about 0.228s. As a result, our algorithm at the correct implement could be used for those applications with a frame rate of 6∼10 frames/sec. Figure 2 gives the learning curves for behavior models of “box”, “kick”, “lookround”, “standup”, “wave” and average half-total error rate (HTER), where HTER=(FAR+FRR)/2 [8], FAR is false acceptance rate and FRR is false rejection rate. The horizontal axis shows the number of effective samples for estimating the parameters in the MAPACo-Training algorithm. The vertical axis shows the HTER. Figure 2(f) is the average HTER curve of all behaviors. From these curves, we can see the learning performance of behavior models can be markedly improved by MAPACo-Training, and after about 500 samples are used, the curves almost become stable. Table 1 gives the initial confusion matrix from the initial behavior models trained by the small training set Ltr , and Table 2 shows the final confusion matrix from the final behavior models by our algorithm. From these tables, we can see that when the initial recognition rate is low, those for “box”, “kick”, “standup” and “wave”, the final recognition rate is clearly improved. And when the initial recognition rate is high, that for “lookround”, the final recognition rate is still high.
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models
4.2
479
Results on Schuldt’s Dataset
We get a video sequence of 49813 frames from Schuldt’s dataset [18] including “box” (8370 frames), “clap” (8476 frames), “wave” (8275 frames), “run” (7945 frames), “jog” (8170 frames) and “walk” (8577 frames). We slice this video sequence into 4978 segments with the fix time duration of 25 frames and the step length of 10 frames, where 30 segments in each class are selected for the small training sample set Ltr , 16 segments for the validation sample set Lv , 240 segments for the test sample set, and the remaining segments for online learning. Parameters in our algorithm are preset as: Th =1.5, Tnum =5 and ρ¯ = 0.4. MAP adaptation is only used to update the means and variances. Figure 3 shows the learning curves. We can see that the learning results for all the behaviors except “run” are very good. For the behavior “run”, the main reason of poor performance is that running of some people is very similar to the jogging of the others in this dataset [18], which is difficult to distinguish. From the initial confusion matrix (Table 3) and the final confusion matrix (Table 4), 15
30
22 20
25 18
5
16
HTER(%)
HTER(%)
20
HTER(%)
10
15
10
14 12 10 8
5 6 0 0
500
1000
1500
2000
2500
0 0
3000
500
number of samples
1000
1500
2000
2500
4 0
3000
500
1000
number of samples
(a)
(b)
25
40
24
38
1500
2000
2500
3000
number of samples
(c) 34 32 30
36
23
28
21
HTER(%)
HTER(%)
HTER(%)
34 22
32 30
26 24 22 20
20
28 18
19
26
18 0
24 0
500
1000
1500
2000
2500
3000
number of samples
16 500
1000
1500
2000
2500
3000
14 0
500
1000
number of samples
(d)
1500
2000
2500
3000
number of samples
(e)
(f)
24
15 14
22
13 12
HTER(%)
HTER(%)
20
18
16
11 10 9 8 7
14
6 12 0
500
1000
1500
2000
number of samples
(g)
2500
3000
5 0
500
1000
1500
2000
2500
3000
number of samples
(h)
Fig. 3. The learning curves: (a) box; (b) clap; (c) wave; (d) run; (e) jog; (f) walk; (g) average HTER (h) run+jog
480
H. Li et al. Table 3. Initial confusion matrix box clap wave run jog walk
box 79.17 24.17 6.25 0.42 2.50 7.08
clap 2.50 53.75 8.75 0 4.58 1.67
wave 2.92 9.17 60.00 0 0 0
run 3.33 3.33 2.50 80.00 50.42 25.83
jog 3.33 2.08 18.75 14.16 31.67 8.75
walk 8.75 7.50 3.75 5.42 10.83 56.67
jog 0 0 0 30.42 59.58 20.00
walk 0.84 0 0 4.58 7.08 57.92
Table 4. Final confusion matrix box clap wave run jog walk
box 94.58 3.33 0 0.83 0.42 0.41
clap 2.50 94.17 7.50 0 0.42 0.42
wave 2.08 2.50 91.25 0 2.08 3.33
run 0 0 1.25 64.17 30.42 17.92
we can see the confusion values between a pair of behaviors other than “run” and “jog” are not high. But for “run” and “jog”, the HTER about “run” is only increased from 18.5% to 23% and nearly unchanged after about 2000 samples while the HTER about “jog” is declined from 38.5% to 24.5%. When we regard “run” and “jog” as one behavior “run+jog”, the result becomes quite satisfactory as shown in Figure 3(h).
5
Conclusion
In this paper, we proposed a semi-supervised learning algorithm called MAPACoTraining, which combines the traditional co-training algorithm and the principle of MAP adaptation. The algorithm is suitable for online learning of behaviors modeled by HMM. Experiments on two datasets also validate our method. In the future, we will explore a better way to train the models of similar behaviors like “run” and “jog”. Acknowledgment. This work was supported by the National Natural Science Foundation of China under grant Nos (60633070, 60475009) and by National Key Technology R&D Program under grant Nos (2006BAH02A03,2006BAH02A13).
References 1. Li, H., Greenspan, M.: Multi-scale gesture recognition from time-Varying contours. In: IEEE Int’l Conf. On Computer Vision, pp. 236–243. IEEE Computer Society Press, Los Alamitos (2005)
MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models
481
2. Gong, S.G., Xiang, T.: Recognition of group activities using dynamic probabilistic networks. In: IEEE Int’l Conf. On Computer Vision, pp. 742–749. IEEE Computer Society Press, Los Alamitos (2003) 3. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 4. Laptev, I., Linderberg, T.: Space-time interest points. In: IEEE Int’l Conf. On Computer Vision, pp. 432–439. IEEE Computer Society Press, Los Alamitos (2003) 5. Xiang, T., Gong, S.G.: Video behaviour profiling and abnormality detection without manual labeling. In: IEEE Int’l Conf. On Computer Vision, pp. 1238–1245. IEEE Computer Society Press, Los Alamitos (2005) 6. Zelnik-Manor, L., Irani, M.: Event-based analysis of video. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 123–130. IEEE Computer Society Press, Los Alamitos (2001) 7. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 819–826. IEEE Computer Society Press, Los Alamitos (2004) 8. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Semi-supervised adapted HMMs for unusual event detection. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 611–618. IEEE Computer Society Press, Los Alamitos (2005) 9. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: 11th Annual Conference on Computational Learning Theory (1998) 10. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: IEEE Int’l Conf. On Computer Vision, pp. 626–633. IEEE Computer Society Press, Los Alamitos (2003) 11. Yan, R., Naphade, M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 657–663. IEEE Computer Society Press, Los Alamitos (2005) 12. Javed, O., Ali, S., Shah, M.: Online detection and classification of moving objects using progressively improving detectors. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 696–701. IEEE Computer Society Press, Los Alamitos (2005) 13. Reynolds, D.A., Quatieri, T.F., Dumn, R.B.: Speaker verification using adapted Gauusian mixture models. Digital Signal Processing 10, 19–41 (2000) 14. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop (April 1981) 15. Lv, F., Nevatia, R.: Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In: Proc. European Conference on Computer Vision, vol. IV, pp. 359–372 (2006) 16. Zhang, D., Gatica-Perez, D., Bengio, S.: Semi-supervised meeting event recognition with adapted HMMs. In: ICME. IEEE International Conference on Multimedia Expo (2005) 17. Li, H., Hu, Z., Wu, Y., Wu, F.: Behavior modeling and recognition based on spacetime image features. In: International Conference on Pattern Recognition, vol. 1, pp. 243–246 (2006) 18. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: International Conference on Pattern Recognition, vol. 3, pp. 32–36 (2004)
Optimal Learning High-Order Markov Random Fields Priors of Colour Image Ke Zhang1,2 , Huidong Jin1,2 , Zhouyu Fu1,2 , and Nianjun Liu1,2 Research School of Information Sciences and Engineering (RSISE) Australian National University 2 National ICT Australia (NICTA), Canberra Lab, ACT, Australia {ke.zhang,huidong.jin,zhouyu.fu,nianjun.liu}@rsise.anu.edu.au 1
Abstract. In this paper, we present an optimised learning algorithm for learning the parametric prior models for high-order Markov random fields (MRF) of colour images. Compared to the priors used by conventional low-order MRFs, the learned priors have richer expressive power and can capture the statistics of natural scenes. Our proposed optimal learning algorithm is achieved by simplifying the estimation of partition function without compromising the accuracy of the learned model. The parameters in MRF colour image priors are learned alternatively and iteratively in an EM-like fashion by maximising their likelihood. We demonstrate the capability of the proposed learning algorithm of highorder MRF colour image priors with the application of colour image denoising. Experimental results show the superior performance of our algorithm compared to the state–of–the–art of colour image priors in [1], although we use a much smaller training image set. Keywords: Markov random fields, image prior, colour image denoising.
1
Introduction
The need for prior models of image structure occurs in a lot of computer vision problems including stereo, optical flow, denoising, super-resolution, image-based rendering and to name a few. Whenever an observed“scene” must be inferred from noisy, degraded or loss partial image information, a natural image prior is required [2]. Modeling image priors is a challenging task, because of the highdimensionality of images, their non-Gaussian statistics and the need to model correlations in image structure over extended image neighbourhoods [3]. Some researchers attempted to use sparse coding approaches to address the modeling of complex image structure. Based on a variety of simple assumptions, they obtained sparse representations of local image structure in terms of the statistics of filters that are local in position, orientation, and scale [4][5]. However, these methods which focus on image patches provide no direct way of modeling the statistics of whole images [3]. Markov random fields (MRF) on the other hand have been widely used in computer vision but exhibit serious limitations. In particular, as MRF priors typically exploit handcrafted clique potentials and small neighbourhood systems, it is limited in the expressiveness of the models, and it only crudely captures the statistics Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 482–491, 2007. c Springer-Verlag Berlin Heidelberg 2007
Optimal Learning High-Order MRF Priors of Colour Image
483
of natural images [6]. Since typical MRF models consider simple nearest neighbour relations and model first derivative filter response, extremely local (e.g. first order) priors employed by most MRF methods may not show any advantages, comparing with rich, patch based priors obtained by sparse coding methods [3]. However, Roth and Black [3] went beyond this limitation with the Fields of Experts (FoE) model, which is a generic MRF model of image priors over extended neighbourhoods. For availability of application, they represented the MRF potentials as a Product of Experts (PoE) [7]. As FoE takes the product over all neighbourhoods of each image patch, the number of parameters is only determined by the size of the maximal cliques in the MRF model and the number of filters defining the potential [3]. Furthermore, because of the homogeneity of the potential functions, the model does not have any restriction for the size of images [3]. As shown in their experiment, FoE can achieve the state–of–the– art performance for monochromatic image denoising and inpainting. Based on the work of Roth and Black [3], McAuley et al. [1] proposed an MRF colour image prior model by generalising the FoE model to capture the correlations between different colour channels. Their model was compared with the original FoE monochromatic prior for colour image denoising and evidenced performance improvements although their learning algorithm is clearly sub-optimal. In this paper we build on McAuley et al.’s [1] contribution to further improve the learning algorithm for colour image priors. The proposed model of colour image priors is akin to the one in [1], yet by improving the estimation of the model partition function, both high dimensional filters and their corresponding weights can be optimally learned by maximising the likelihood. The experimental results show improvements of results reported in McAuley et al. [1] on colour image denoising, although we use a much smaller training image set. The remainder of this paper is organised as follows. In Section 2, we briefly illustrate the MRF image prior models and their original learning approaches. Our optimised learning algorithm is introduced in Section 3. In Section 4, we demonstrate the performance of our learning algorithm and compare the denoising quality with three other methods (McAuley et al’s [1], Bilateral Filtering and Wavelet-based denoising approach). Finally, Section 5 concludes this paper.
2 2.1
MRF Prior Model Monochromatic MRF Prior Model
In [3], Roth and Black merged the ideas of learning in MRF and sparse image coding in order to develop a high-order MRF prior model where the cliques are square image patches [8][4]. According to the Hammersley–Clifford theorem, the joint probability distribution of an MRF with clique set C can be written as P (x) =
1 φc (xc ), Z(Θ) c∈C
where φc (xc ) is a potential function and Z(Θ) is the partition function.
(1)
484
K. Zhang et al.
The potential functions over these cliques are assumed to be Products of Experts [7], i.e., products of individual function φf (with a parameter αf ) given the response of a filter Jf to the image patch xc , φc (xc ; J, α) =
F
φf (xc ; Jf , αf ),
(2)
f =1
In the prior model, the potential functions are assumed to be stationary, i.e., every clique in the image has the same parameter set Θ = {Jf , αf : 1 ≤ f ≤ F }. The particular form they postulated for individual experts is related to the Student–T distribution, and is given by [3] 1 φf (xc ; Jf , αf ) = (1 + Jf , xc 2 )−αf . 2
(3)
The problem of learning MRF priors can then be recast in a parameter estimation setting. The optimal J’s and α’s are recovered by maximising the likelihood given by the joint probability in Equation 1. However, calculating the true partition function Z(Θ) is intractable due to its high complexity. Approximate procedures are often used here to get an estimation of the partition function, such as the contrastive divergence method used by Roth and Black [3]. 2.2
Colour Image MRF Prior Model
McAuley et al. [1] extended the monochromatic MRF prior to handle colour images. They proposed the “higher” order MRF prior model, i.e., using 3 × 3 × 3 clique instead of 3 × 3 clique, to represent the correlations of colour channels over the local neighbourhood. To deal with the dramatic increase in the computational load due to the significant rise of data dimensions in this colour model, they adopted a simple gradient-ascent-based learning algorithm rather than learning by maximising the likelihood. In their learning algorithm, they performed singular value decomposition (SVD) over the covariance matrix of training data to learn filters J’s, and only updated α’s along the gradient direction with fixed filters J’s. The estimation of the α’s in their model takes the following form: Let D = {X1 , X2 , · · · , XM } a set of training images, R = {Y1 , Y2 , · · · , YN } a set of random images, P (D|Θ) the likelihood of the training images given the model. Let Θ = {θ1 , θ2 , · · · , θF } where θf = (Jf , αf ) a set of filters and their corresponding weights. Then the likelihood of training images, P (D|Θ) is given by P (D|Θ) =
M i=1
1 φc (xic ; J, α), Z(Θ)
(4)
c∈C
where Z(Θ) is the partition function. They used the arithmetic mean of responses from the random images, Zˆam (Θ), to approximate the real value of partition function, given by
Optimal Learning High-Order MRF Priors of Colour Image
Z(Θ) ∝ Zˆam (Θ) =
N 1 φc (xic ; J, α). N i=1
485
(5)
c∈Yi
From Eq. 4, the gradient of the log-likelihood function with respect to αk is given by M ∂ ∂ log P (D|Θ) = ψc (Jk , xic ) − M log Z(Θ), ∂αk ∂α k i=1
(6)
c∈Xi
where ψ(a, b) = − log(1 + 12 a, b2 ). From Eq.5, the gradient of the log partition function is given by ∂ log Z(Θ) = ∂αk
N c∈Yi ψc (Jk , x)( c∈Yi φc (xc ; J, α))] i=1 [ , N c∈Yi φc (xc ; J, α) i=1
(7)
and the α’s are updated along the gradient ascent direction which can be obtained by Eq. 6 and Eq. 7. Note that the algorithm proposed by McAuley et al. [1] only updates the α’s with fixed values of filters. It is not optimal for several reasons: (1) the filters obtained from SVD are sub–optimal because they ideally should be learned by maximising the likelihood; (2) the α’s in their learning algorithm must be initialised to zero for numerical reasons [1], and the gradient ascent dose not work (absolute values of α’s are not convergent and their relative values remain the same) after the first iteration according to our implementation. Our aim is to learn a set of filters and their corresponding weights that maximises the likelihood. This can be achieved via standard gradient ascent method given initial estimates of the model parameters. However, as the Z(Θ) estimated in [1] is just proportional to the true Z(Θ), we can not obtain a reliable model likelihood by Eq.4. Furthermore, when we performed the partial derivative with respect to J and implemented it, we found that the J–α gradient iterations did not converge due to numerical instability in the estimation of Z(Θ).
3 3.1
An Optimised Learning Algorithm Estimation of Partition Function
Although the approximation of Z(Θ), Zˆam (Θ), has a clear physical meaning (when the images used to approximate Z(Θ) cover all possible images, this estimation represents the true form of Z(Θ)), the main problem is in the update of parameters, i.e. the complicated form of Eq. 7, which makes the gradient iteration error-prone due to numerical problems. For solving the problems described above, we use the geometric mean Zˆgm (Θ) with more robust performance in the case of non-Gaussian distributions [9], instead of arithmetic mean Zˆam (Θ) (Eq.5). The geometric mean of the partition function can be written as
486
K. Zhang et al.
Zˆgm (Θ) = (
N
1
φc (xic ; J, α)) N .
(8)
i=1 c∈Yi
According to Jensen’s inequality [10], we can obtain the upper and lower boundaries of Zˆam (Θ) and Zˆgm (Θ) N N N N fi (ε)2 1 1 1 1 −1 fi (ε) ≥ ( fi (ε)) N ≥ N ( (9) ( i=1 )2 ≥ ) , N N i=1 f (ε) i=1 i=1 i where fi (ε) = c∈Yi φc (xic ; J, α). As shown in Fig. 1, the log values of Zˆgm (Θ) and Zˆam (Θ) are very close along the increasing number of random images. Furthermore, we found that the standard deviations of log Zˆgm (Θ) is smaller than those of log Zˆam (Θ) over various amount of random images tests. Therefore, we can say that Zˆgm (Θ) is a robust approximation to the mean of the partition function, and can use a small set of random images to estimate the mean values of partition function in log form. In the calculation of the model likelihood given a set of parameters, ˆ we can use Z(Θ) = T × Zˆgm (Θ) to estimate the true value of Z(Θ). Here, T is the number of assignments for all the possible pixel values of the images patch, i.e., T = 2563×3×3 (3 × 3 clique size) or T = 2565×5×3 (5 × 5 clique size). Based ˆ on this observation, the approximation of log–partition function Z(Θ) can be rewritten as: ˆ log Z(Θ) = log T + log Zˆgm (Θ) = log T +
N F 1 αf ψf (Jf , xic ). N i=1
(10)
c∈Xi f =1
3.2
Proposed Learning Algorithm
The log–likelihood of MRF prior model (Eq.4) can be rewritten as follows: log P (D|Θ) =
M F
αf ψf (Jf , xic ) − M log T
i=1 c∈Xi f =1
−
N F M αf ψf (Jf , xic ), N i=1
(11)
c∈Yi f =1
where parameters have the same denotations in the Section 2.1. As we can expect to obtain a more accurate value of the log model likelihood than the one suggested by [1], Eq.11 can be an indicator that determines whether a given set of parameters has higher likelihood in our learning algorithm. Furthermore, based on Eq.11, the partial derivative with respect to both filters J’s and their corresponding weights α’s are significantly simplified: M N M −αk xic Jk , xic ∂ log P (D|Θ) −αk xic Jk , xic = − ,(12) ∂Jk N i=1 (1 + 12 Jk , xic 2 ) (1 + 12 Jk , x2c 2 ) i=1 c∈X c∈Y i
i
Optimal Learning High-Order MRF Priors of Colour Image M N ∂ log P (D|Θ) M = ψc (Jk , xic ) − ψc (Jk , xic ). ∂αk N i=1 i=1 c∈Xi
487
(13)
c∈Yi
Our learning algorithm is summarised as follows: 1. Initialise the filters (J’s) by performing SVD over training images. The initial values of α’s are randomly generated. 2. Update α’s by applying a line search in the gradient direction given by Eq. 13. The step size μα is chosen such that the highest log–likelihood in Eq. 11 is reached. ∂ α ← α + μα { log P (D|Θ)}. (14) ∂αi 3. Update J’s by applying a line search in the gradient direction given by Eq. 12. The step size μJ is, again, chosen by maximising the log–likelihood in Eq. 11. ∂ log P (D|Θ)}. (15) J ← J + μJ { ∂Ji 4. Repeat steps 2–3 until the log–likelihood of the model dose not change. As Eq. 14 and Eq. 15 indicate that both α’s and J’s are updated along the direction of maximising model likelihood, the proposed learning algorithm is optimal based on the model likelihood in Eq. 11. Since the update step sizes (μα and μJ ) are very sensitive to the input parameters, it is quite difficult to specify them with any constants. In our implementation, we employ back–tracking line search to find the optimal solution in each update step [11].
Fig. 1. The log values of two Z(Θ) estimation methods along the increasing number of random images
488
3.3
K. Zhang et al.
Inference
After we get the MRF prior model, in order to perform inference (i.e. denoising in our experiments), we adopted a standard gradient based approach, as used by McAuley et al [1]. Gradient ascent is a valid technique in the case of denoising, since the noisy image is “close to” the original image, meaning that a local maximum is likely to be a global one [1]. In the denoising problem, the purpose is to infer the most likely correction for the image given the image prior and the noise model. Thenoise model assumed in our experiments, as in [1], is i.i.d. Gaussian: P (y|x) ∝ j exp(− 2σ1 2 (yj − xj )2 ). Here, j ranges over all the pixels in the image, yj denotes the real colour value of the noisy image at pixel j, and xi denotes the colour to be estimated at pixel j; σ denotes the variance of Gaussian noise. Combining the noise model and the MRF prior (Eq.1), the gradient of the log–posterior becomes [1]: ∇x log P (x|y) =
F
αf Jf− ∗
f =1
(Jf ∗ x) λ + 2 (y − x) σ 1 + 12 (Jf ∗ x)2
(16)
where ∗ denotes matrix convolution, and the algebraic operations above are performed in an elementwise fashion on the corresponding convolution matrix. Jf− denotes the mirror image of Jf in two dimensions. λ is a critical parameter that gauges the relative importance of the prior and the image terms. The updated image is then simply computed by xt+1 = xt + δ
∂ log P (x|y) ∂x
(17)
where δ is the step size of the gradient ascent. We find it is not sensitive to the inference result and can be selected empirically.
4
Experimental Results and Comparison
In our experiments, to initialise our filters J’s, we randomly selected 8,000 3×3×3 and 5 × 5 × 3 patches, cropped from 200 images in the Berkeley Segmentation Database, and performed singular value decomposition (SVD) over their covariance matrices [12]. Thus, we obtained 27 and 75 filters for the two clique sizes, with 27 and 75 dimensions for each kind of filters respectively. α’s were initialised to be a set of random values with the same dimension as the number of filters. There was no constraint on the scale of the initial α’s, and they converged for both absolute and relative values after several steps of update. In the updating process, we randomly selected 2,000 training image patches and 2,000 random images patches from the same image database for each update step. The sizes of training/random image patches for 3 × 3 × 3 and 5 × 5 × 3 cliques were, respectively, 7 × 7 × 3 and 13 × 13 × 3.
Optimal Learning High-Order MRF Priors of Colour Image
489
Fig. 2. Typical denoising results. The first column displays the original image (up), the noisy image (middle) with σ = 75 (red), 25 (green), 15 (blue) (PSNR=14.97) and the result of bilateral filtering 5 × 5 window (down, PSNR=23.55); the second column shows the result using Wavelet-based approach 3 × 3 window (up, PSNR=23.32), the results of McAuley et al’s 3 × 3 prior (middle, PSNR=25.03) and our 3 × 3 prior (down, PSNR=25.90). the third column shows the result of 5 × 5 window Waveletbased approach (up, PSNR=23.84), McAuley et al’s 5 × 5 prior (middle, PSNR=25.99) and our 5 × 5 prior (down, PSNR=26.82).
In the inference, we did not need to eliminate the filter with highest variance since in our algorithm α’s are normalised (with range of [0,1]). The least important filter will be ignored automatically in the denoising process because its corresponding weight will be zero. In the selection of λ, we used images other than those involving in our experiments. This was done by denoising a test image with several candidate λ–value, and selecting whichever one yields the best results [1]. The step size δ has been chosen to grow linearly with the noise level, which was found to work well in practice. The denoising performances 2
I,J,K
(Ri,j,k −Oi,j,k )2
255 ), where M SE = i,j,k IJK ; were evaluated by PSNR (10 log10 ( MSE R denotes restored image and O denotes original image). In Figure 2 we show results obtained for denoising an image in which a different amount of noise is applied to each of the three channels, and compare these with the state-ofthe-art [1], simple bilateral filtering (using the MATLAB code from [13]), and
490
K. Zhang et al.
Table 1. (3 × 3 × 3 window) Average denoising performance over 50 testing images. Results are measured in PSNR. image/σ 5 Noisy image 34.10 McAuley et al1 36.19 Our algorithm 1 37.12∗ McAuley et al2 36.83 Bilateral filtering 28.11 Wavelet-based denoising 36.11 ∗
15 24.70 29.17 29.78∗ 29.74 27.18 28.99
25 20.15 26.04 26.98∗ 26.69 25.75 25.98
50 14.16 22.45 23.38∗ 23.15 21.87 22.41
Indicates significant difference in performance compared with the upper one.
Table 2. (5 × 5 × 3 window) Average denoising performance over 50 testing images. Results are measured in PSNR. image/σ 5 Noisy image 34.10 McAuley et al1 36.57 Our algorithm 1 37.56∗ McAuley et al2 37.28 Bilateral filtering 29.32 Wavelet-based denoising 36.41 ∗
15 24.70 29.59 30.11∗ 30.08 27.78 29.50
25 20.15 26.55 27.41∗ 27.16 25.82 26.32
50 14.16 22.79 23.69∗ 23.41 21.90 22.46
Indicates significant difference in performance compared with the upper one.
Wavelet-based denoising [14]. In the bilateral filtering and Wavelet-based denoising experiments, the RGB test images were converted into YCbCr format before processing, which has less correlation between colour channels. As we show by using different priors, the denoising performance of the priors learned by our algorithm has significant improvements comparing that of using McAuley et al.’s [1] and the other two denoising approaches. Tables 1 and 2 summarise results about the denoising 50 test colour images (from the Berkeley Segmentation Database) in which all three channels have been equally corrupted, and compare our algorithm with McAuley et al.’s priors and two well known methods in the 3 × 3 × 3 and 5 × 5 × 3 case, respectively. As results show, the performance of priors learned by our proposed algorithm for both model sizes is statistically significantly (paired T-test at the 0.05 level) superior to its counterparts which use the same training/random image set. Furthermore, the performance of our priors learned from 2,000/2,000 training images/random image patches is comparable with the priors learned in [1], which used 100,000/50,000 training/random image patches.
5
Conclusion
In this paper, we have proposed the learning algorithm of high-order MRF prior models for colour image denoising. By collecting a relatively small set of sample 1 2
Indicates priors learned from 2,000/2,000 training/random patches. Indicates priors learned from 100,000/50,000 training/random patches.
Optimal Learning High-Order MRF Priors of Colour Image
491
colour image patches from a standard colour image database, we have learned priors specified for colour images using several steps of gradient ascent update with the rule of maximum likelihood. Results comparing the colour prior models learned by our algorithm to a state–of–the-art colour image prior model [1] show performance improvements.
References 1. McAuley, J., Caetano, T., Smola, A., Franz, M.: Learning high-order MRF priors of color images. In: ICML. LNCS, vol. 4503, pp. 617–624. Springer, Heidelberg (2006) 2. Freeman, W., Pasztor, E., Carmichael, O.: Learning low-level vision. International Journal of Computer Vision 40, 25–47 (2000) 3. Roth, S., Black, M.: Fields of experts: A framework for learning image priors. In: ICCV, pp. 860–867 (2005) 4. Olshausen, B., Field, D.: Sparse coding with an overcomlete basis set: a strategy employed by v1? Vision Research 37, 3311–3325 (1997) 5. Welling, M., Hinton, G., Osindero, S.: Learning sparse topographic representations with products of Student-t distributions. In: NIPS, vol. 15, pp. 1359–1366 (2003) 6. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. PAMI 6(6), 721–741 (1984) 7. Hinton, G.: Products of experts. In: 9th ICANN, pp. 1–6 (1999) 8. Zhu, S., Wu, Y., Mumford, D.: Filters, random fields and maximum entropy (frame): Towards a unified theory of texture modeling. International Journal of Computer Vision 27, 107–126 (1998) 9. Abramowitz, M., Stegun, I.E.: The process of the Arithmetic–Geometric mean. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing, 571 (1972) 10. Krantz, S.: In: Handbook of Complex Variables. Boston, MA:Brikhauser, p.118 (1999) 11. Mor´e, J., Thuente, D.: Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Software 20, 286–307 (1994) 12. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and ites application to evaluating segmentation algorithms and measuring ecological statistics. In: 8th ICCV, pp. 416–423 (2001) 13. http://mesh.brown.edu/dlanman/photos/Bilateral 14. Portilla, J., Strela, V., Wainwright, M., Simoncelli, E.: Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans. Image Processing. 12(11), 1338–1351 (2003)
Hierarchical Learning of Dominant Constellations for Object Class Recognition Nathan Mekuz and John K. Tsotsos Center for Vision Research (CVR) and Department of Computer Science and Engineering, York University, Toronto, Canada M3J 1P3 {mekuz,tsotsos}@cse.yorku.ca
Abstract. The importance of spatial configuration information for object class recognition is widely recognized. Single isolated local appearance codes are often ambiguous. On the other hand, object classes are often characterized by groups of local features appearing in a specific spatial structure. Learning these structures can provide additional discriminant cues and boost recognition performance. However, the problem of learning such features automatically from raw images remains largely uninvestigated. In contrast to previous approaches which require accurate localization and segmentation of objects to learn spatial information, we propose learning by hierarchical voting to identify frequently occurring spatial relationships among local features directly from raw images. The method is resistant to common geometric perturbations in both the training and test data. We describe a novel representation developed to this end and present experimental results that validate its efficacy by demonstrating the improvement in class recognition results realized by including the additional learned information.
1
Introduction
Humans are highly adept at classifying and recognizing objects with significant variations in shape and pose. However, the complexity and degree of variance involved make this task extremely challenging for machines. Current leading edge methods use a variety of tools including local features [1,2,3,4], global [5] and region [6] histograms, dominant colors [7], textons [8] and others, collecting features sparsely at detected key points, at random locations, or densely on a grid and at single or multiple scales. In practice, different types of features are often complementary and work well in different scenarios, and good results are often achieved by combining different classifiers. Of the above approaches, much focus has recently been dedicated to learning with local appearance descriptors, which have been shown to be extremely effective thanks to their discriminant qualities and high degree of resistance to geometric and photometric variations as well as partial occlusions. A very effective and widely-used technique that enables the use of efficient search methods borrowed from the text retrieval field is vector quantization, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 492–501, 2007. c Springer-Verlag Berlin Heidelberg 2007
Hierarchical Learning of Dominant Constellations
493
whereby each patch is associated with a label (visual word) from a vocabulary. The vocabulary is usually constructed offline by means of some clustering algorithm. To avoid aliasing effects arising from boundary conditions, soft voting is employed, whereby each vote is distributed into several nearby words using some kernel function. Finally, images are coded as histograms of their constituent visual words. While the importance of local features’ spatial configuration information for object class recognition is widely recognized, the basic scheme described above is typically employed on sets of isolated local appearance descriptors. However, for the most part, local appearance descriptors were designed to recognize local patches. When used for recognizing objects, the spatial layout that they appear in is of paramount importance. The SIFT algorithm [9], for example, represents local features in a way that is invariant to geometric perturbations. However, it also stores the parameters of the local geometry, and subsequently applies a Hough transform to select from potential hypotheses a model pose that conforms to the geometry associated with a large number of identified keys. Current systems that capture spatial information do so by learning and enforcing local relationships [10,11], global relationships [12,13], using dense sampling [2,1], or at multiple levels [14,15]. In [14], the system learns groups of local features that appear frequently in the training data, followed by global features composed of local groupings. In [11], spatial consistency is enforced by requiring a minimum number of features to co-occur in a feature neighborhood of fixed size. The authors of [16,12] demonstrate the benefit of learning the spatial relationships between various components in an image from a vocabulary of relative relationships. In [2], appearance models are built where clusters are learned around object centers and the object representation encodes the position and scale of local parts within each cluster. Significant performance gains are reported resulting from the inclusion of location distribution information. Fergus et al. [1] learn a scale-normalized shape descriptor for localized objects. However, the shapes are not normalized with respect to any anchor point. Consequently, some preprocessing of the input images is required. A boosting algorithm that combines local feature descriptors and global shape descriptors is presented in [13], however extracting global shape is extremely difficult under occlusion or cluttered background. We take a different approach and seek to learn object class-specific hierarchies of constellations, based on the following principles: Unsupervised learning. A clear tradeoff exists between the amount of training data required for effective learning, and the quality of its labeling. Given the high cost of manual annotation and segmentation, and the increased availability (e.g. on the internet) of images that are only globally annotated with a binary class label, a logical goal is the automatic learning of constellation information from images with minimal human intervention. Specifically, this precludes manual segmentation and localization of objects in the scene. Invariance to shift, scale and rotation. In order to be able to train with and recognize objects in various poses, we require a representation that
494
N. Mekuz and J.K. Tsotsos
captures spatial information, yet is resistant to common geometric perturbations. Robustness. In order to successfully learn in an unsupervised fashion, the algorithm must be robust to feature distortions and partial occlusion. A common approach for achieving robustness is voting. Learn with no spatial restrictions We would like to learn spatial relationships over the entire image, without restrictions of region or prior (e.g. Gestalt principles). This allows grouping discontinuous features, e.g. such that lie on the outline of an object with a variable texture interior. The main contribution of this paper is a novel representation that captures spatial relationship information in a scale and rotation invariant manner. The constellation descriptors are made invariant by anchoring with respect to one local feature descriptor, similar to the way the SIFT local descriptor anchors with respect to the dominant orientation. We present a framework for learning spatial configuration information by collecting inter-patch statistics hierarchically in an unsupervised manner. To tackle the combinatorial complexity problem, higher level histograms are constructed by successive pruning. The most frequently occurring constellations are learned and added to the vocabulary as new visual words. We also describe an efficient representation for matching learned constellations in novel images for the purpose of object class recognition or computing similarity. The remainder of this paper is organized as follows: in Section 2 we describe our proposed constellation representation. This is followed by implementation details of the voting scheme in Section 3, and the matching algorithm in Section 4. The results obtained on images of various categories are presented in Section 5, and finally, Section 6 concludes with a discussion.
2
Invariant Constellation Representation
The constellation representation captures the types and relative positions, orientations and scales of the constituent parts. An effective representation must
f2 f4
f1
f3
Fig. 1. An illustration of the constellation representation. Local features are represented with circles, with arrows emanating out of them to indicate dominant orientation. Feature f1 is selected as anchor, and the positions, scales and orientations of the remaining features are expressed relative to it.
Hierarchical Learning of Dominant Constellations
495
be resistant to minor distortions arising from changes in pose or artifacts of the local feature extraction process. Pose changes can have a significant effect on the coordinates of local features. Another key requirement is a consistent frame of reference. In the absence of models of localized objects, a frame of reference can be constructed as a function of the constituent features. One option is to use the average attributes of local features [17]. However, since our method uses local features that are quantized into discrete visual words, we opt for the simpler alternative of pivoting at the feature with the lowest vocabulary index. This results in a more compact representation (and in turn, computational complexity savings) by eliminating the need to store spatial information for the anchor feature. On the downside, the anchor feature may not lie close to the geometric center of the constellation, reducing the granularity of position information for the other features. Whatever method is used for selecting the pivot, detection of the constellation depends on the reliable recovery of the pivot feature. However, even if the pivot feature cannot be recovered (e.g. under occlusion), subsets of the constellation may still be detected. Our representation is illustrated graphically in Figure 1, with the local appearance features represented as circles, and their dominant orientation as arrows. Feature f1 is selected as anchor and the coordinate system representing the remaining constellation features is centered about it and rotated to align with its dominant orientation. More formally, given a set of local features Fi = Γi , ti , xi , yi , αi where Γi is the index of the visual word corresponding to feature Fi , ti is its scale, xi and yi its position in the image and αi its orientation, relative to the global image coordinate system, we select the anchor F∗ as F∗ = arg min Γi , and construct the constellation descriptor encoding the anchor feature’s type Γ∗ , as well as the following attributes for each remaining feature Fj : Type: Γj Scale ratio: tj /t∗ Relative orientation: αj − α∗ Relative position: atan2(yj − y∗ , xj − x∗ )− α∗ where atan2 is the quadrantsensitive arctangent function. This attribute ignores distances, and merely provides a measure of Fj ’s polar angle relative to F∗ , using F∗ ’s coordinate system. As an example, using this representation, a pair of local features {F1 , F2 } with Γ1 < Γ2 is represented as Γ1 , Γ2 , t2 /t1 , α2 − α1 , atan2(y2 − y1 , x2 − x1 ) − α1 . A constellation of m local features is represented as an n-tuple with n = 4m − 3 elements. To maintain consistent representation, the descriptor orders the local features by their vocabulary index. If the lowest vocabulary index is not unique to one local feature, we build multiple descriptors, just as the SIFT algorithm creates multiple descriptors at each keypoint where multiple dominant orientations exist.
496
3
N. Mekuz and J.K. Tsotsos
Voting by Successive Pruning
ī1
ī1
The learning phase performs histogram voting in order to identify the most frequently occurring constellations in each category, using the representation described above. Since the descriptor orders the local features by their type attribute, each resulting histogram takes the shape of a triangular hyper-prism, with the Γ (type) axes along the hyper-triangular bases and the other attributes forming the rectangular component of the prism. Spatial information is encoded in 8 × 8 × 8 bins. The relative orientation and relative position attributes encode the angle into one of eight bins, similar to way this is done in SIFT. Scale ratios are also placed into a bin according to log2 (tj /t∗ ) + 3. Co-occurrences with scale ratios outside the range [1/16, . . . , 32] are discarded. As in SIFT, in order to avoid aliasing effects due to boundary conditions, we use soft voting whereby for each attribute, each vote is hashed proportionally into two neighboring bins. For the type attribute, the vote is distributed into several nearby visual words using a kernel. We use a Gaussian kernel with σ set to the average cluster radius, although other weighting formulas are certainly possible. It is also possible to threshold by distance rather than fixing the number of neighbors. The exponential complexity of the problem, and in particular, the size of the voting space, call for an approximate solution. A simple method that has often been used successfully is successive pruning. In computer vision, good results have been reported in [18] although some human supervision was necessary at the highest levels of the hierarchy. In some sense, the successive pruning strategy can be viewed as a coarse-to-fine refinement process. Starting with coarse bins, the algorithm identifies areas of the search space with a high number of counts. It then iteratively discards bins with a low number of counts, re-divides the rest of the voting space into finer bins, and repeats the voting process. In the case of multi-dimensional histograms, coarse bins can also be created by collapsing dimensions. This latter approach is more convenient in our case since it fits
ī2
ī2
(a)
(b)
Fig. 2. (a) A depiction of a triangular histogram used for voting for the most frequently co-occurring pairs. (b) Histogram pruning: the bin with a low number of counts are discarded. Finer bins are allocated for each bin with a number of counts above a threshold.
Hierarchical Learning of Dominant Constellations
497
naturally with the notion of hierarchical learning, creating larger constellations from smaller ones. Also, collapsing the dimensions associated with spatial information offers computational advantages by allowing early termination of the voting in the discarded bins, since visual word indices for local features are available immediately. Figures 2(a) and 2(b) illustrate the structures used in the two-phase voting process to identify pairs of local features that appear frequently in a particular spatial configuration. In the first phase, local feature descriptors are extracted and cached from all images belonging to an object class, and a triangular histogram (Fig. 2(a)) is collected to count the number of times each pairs of local features co-occurs, regardless of geometry. In the second phase (Fig. 2(b)), subhistograms are allocated for the bins with a high number of counts, and the remaining bins are discarded. Each sub-histogram consists of 8 × 8 × 8 bins and captures spatial information for its associated bin in the triangular histogram. Finally, the vocabulary is augmented with new visual words corresponding to the most frequent constellations in each class identified in phase 2.
4
Indexing Constellation Descriptors for Efficient Matching
Given novel input images, the system compares constellations extracted from these images against the learned constellation stored in its vocabulary. In order to achieve this, an exhaustive search of all local feature combinations in the input images is not necessary. A more efficient search is possible by indexing the learned constellation information offline, as depicted in Figure 3. At the first level, the structure consists of a single array indexed by local feature type index. Given a moderate number of learned constellations, the resulting first level array is sparse. Local feature types for which learned constellations exist, have their array entries point to arrays of stored constellation descriptors, sorted lexicographically by the other type indices. The matching algorithm works by constructing an inverted file [19] of the local features in the image. A sparse inverted file containing only links to vocabulary
local feature words
sorted constellation descriptors
sorted constellation descriptors
Fig. 3. A depiction of an efficient indexing structure for fast lookup of constellation features
498
N. Mekuz and J.K. Tsotsos
entries that are in use suffices thanks to the sorted second-level arrays: a match is sought simply by traversing both lists simultaneously. Spatial relationship information is compared only when all type attributes in a constellation descriptor are matched in the inverted file.
5
Evaluation
We tested our technique by examining the effect of using spatial relationship information captured using our descriptors on object class recognition performance. In order to isolate the effect of our constellation learning algorithm on class detection, we limited our evaluation to still greyscale images. We used the SIFT detector and descriptor to extract and represent local appearance features. We constructed a vocabulary of 13,000 visual words by extracting features from the first 800 hits returned by Google Images for the keyword ‘the’ and clustering with k-means. The result is a neutral vocabulary, that is not tuned specifically for any object category. For training and test data, we used 600 images of faces, airplanes, watches, bonsai trees and motorbikes from the PASCAL 2006 data set [20], divided equally into training and test images. All images were converted to greyscale, but no other processing was performed. In the training phase, image descriptors were collected for each of the training images encoding histograms of their constituent visual words. We used our neutral vocabulary constructed as described above, and quantized each feature descriptor to its 15 nearest neighbors using a Gaussian kernel with σ set to the average radius covered by each vocabulary entry. In the testing phase, each image was matched and classified using a simple unweighted nearest neighbor classifier against the trained image descriptors. As is standard practice, we used
Faces Airplanes Watches Bonsai trees Motorcycles
Faces 88.3 53.3 45.0 21.7 70.0
Airplanes 3.3 36.7 11.7 1.7 5.0
Watches 5.0 0.0 26.7 1.7 1.7
Bonsai trees 1.7 8.3 16.7 71.7 8.3
Motorcycles 1.7 1.7 0.0 3.3 15.0
Bonsai trees 0.0 5.0 13.3 76.7 6.7
Motorcycles 0.0 1.7 1.7 1.7 21.7
(a)
Faces Airplanes Watches Bonsai trees Motorcycles
Faces 91.7 50.0 40.0 16.7 66.7
Airplanes 3.3 43.3 10.0 3.3 5.0
Watches 5.0 0.0 35.0 1.7 0.0 (b)
Fig. 4. Confusion matrices (a) using a vocabulary of only local appearance features. (b) using an augmented vocabulary with an additional 50 constellation words per category.
Hierarchical Learning of Dominant Constellations
499
100
No spatial words 10 Spatial words 20 Spatial words 50 Spatial words
90
% Correct classification
80
70
60
50
40
30
20
10
0 faces
airplanes
watches
Category
bonsai
motorcycles
Fig. 5. Object class categorization performance as a function of the number of constellation visual words used. The error bars represent a margin of 3 standard errors.
a stop list to discard the 2% most frequently occurring visual words. Although more elaborate classifiers (e.g. SVM) and weighting schemes (e.g. tf-idf [21]) are possible, we opted for the simple scheme described here in order to focus on the effect of the additional spatial information. We expect the tf-idf scheme to place increased weights on the constellation features, since they carry more class-specific discriminant information. We tested the effect of augmenting the vocabulary with 10, 20 and 50 constellation words per object class on class recognition performance. It is worth noting that the vocabulary used for constructing these additional constellation words was again our generic neutral vocabulary: the only class-specific information captured in the training phase was the most frequently co-occurring pairs and their spatial relationships in each class. A 2% stop list was again used on the visual words associated with the local features but not on the pairs. Figure 4 presents confusion matrices for the categorization tests (a) using no spatial information, and (b) using 50 additional constellation visual words. Perhaps surprisingly, poor performance is realized in the motorcycles category, where local feature-based methods typically excel. A likely explanation is that normally the vocabulary is constructed using images of the modeled class, and
500
N. Mekuz and J.K. Tsotsos
captures features such as wheels in the case of the motorcycle class, whereas in our experiments we used a neutral vocabulary that was not trained specifically for any class. More importantly, however, we note that the addition of a few visual words corresponding to learned spatial features clearly boosted recognition performance in all classes, with average gains of about 5% using 50 constellation words. Figure 5 shows correct recognition results (corresponding to the diagonal of the confusion matrices) with different numbers of constellation words. The general trend shows recognition performance improving as more constellation features are used. The error bars represent an interval of 3 standard errors.
6
Discussion
This paper has presented a novel approach for representing constellation information that is learned directly from raw image data in a hierarchical fashion. The method is capable of learning spatial configuration information from possibly cluttered images where objects appear in various poses and possibly partly occluded. Novel images are tested for the presence of learned configurations in a way that is robust to common geometric perturbations. Additionally, the paper presents implementation details for an efficient voting algorithm that allows collecting robust co-occurrence statistics in a computationally highly complex voting space, and efficient indexing structures that allow fast lookup in the matching phase. Our experimental results confirm the importance of spatial structure to the class recognition problem, and show that the proposed representation can provide significant benefit with constellations consisting of pairs. We are currently exploring richer constellation structures corresponding to higher levels of the hierarchy and looking at ways for visualizing the learned constellations.
Acknowledgments The authors are grateful to Erich Leung and Kosta Derpanis for many helpful discussions. This work was supported by OGSST and Precarn incorporated.
References 1. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 264–271 (2003) 2. Leibe, B., Mikolajczyk, K., Schiele, B.: Efficient clustering and matching for object class recognition. In: British Machine Vision Conference, Edinburgh, England (2006) 3. Berg, A.C., Berg, T.L., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society Press, Los Alamitos (2005) 4. Dorko, G., Schmid, C.: Object class recognition using discriminative local features (2005)
Hierarchical Learning of Dominant Constellations
501
5. Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S., Huang, T.S.: Supporting similarity queries in mars. In: ACM International Conference on Multimedia, pp. 403– 413. ACM Press, New York (1997) 6. Carson, C., Thomas, M., Belongie, S., Hellerstein, J., Malik, J.: Blobworld: a system for region-based image indexing and retrieval. Technical report, Berkeley, CA, USA (1999) 7. Mukherjea, S., Hirata, K., Hara, Y.: Amore: a world-wide web image retrieval engine. In: CHI 1999. Extended abstracts on human factors in computing systems, pp. 17–18. ACM Press, New York (1999) 8. Malik, J., Belongie, S., Shi, J., Leung, T.K.: Textons, contours and regions: Cue integration in image segmentation. In: IEEE International Conference on Computer Vision, pp. 918–925. IEEE Computer Society Press, Los Alamitos (1999) 9. Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision, vol. 1150, IEEE Computer Society Press, Los Alamitos (1999) 10. Lazebnik, S., Schmid, C., Ponce, J.: Affine-invariant local descriptors and neighborhood statistics for texture recognition. In: IEEE International Conference on Computer Vision, vol. 649, IEEE Computer Society, Los Alamitos (2003) 11. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1470– 1477 (2003) 12. Lipson, P., Grimson, E., Sinha, P.: Configuration based scene classification and image indexing. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1007, IEEE Computer Society, Los Alamitos (1997) 13. Zhang, W., Yu, B., Zelinsky, G.J., Samaras, D.: Object class recognition using multiple layer boosting with heterogeneous features. In: IEEE Conference on Computer Vision and Pattern Recognition 14. Amit, Y., Geman, D.: A computational model for visual selection. Neural Comput. 11, 1691–1715 (1999) 15. Agarwal, A., Triggs, W.: Hyperfeatures - multilevel local coding for visual recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 16. Sinha, P.: Image invariants for object recognition. Invest. Opth. & Vis. Sci. 34(6) (1994) 17. Shokoufandeh, A., Dickinson, S.J., J¨ onsson, C., Bretzner, L., Lindeberg, T.: On the representation and matching of qualitative shape at multiple scales. In: European Conference on Computer Vision, pp. 759–775. Springer, Heidelberg (2002) 18. Fidler, S., Berginc, G., Leonardis, A.: Hierarchical statistical learning of generic parts of object structure. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 182–189. IEEE Computer Society Press, Los Alamitos (2006) 19. Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann Publishers Inc, San Francisco (1999) 20. Everingham, M., Zisserman, A., Williams, C.K.I., Van Gool, L.: The PASCAL Visual Object Classes Challenge. In: VOC2006 (2006) 21. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Multistrategical Approach in Visual Learning Hiroki Nomiya and Kuniaki Uehara Graduate School of Science and Technology, Kobe University
[email protected],
[email protected]
Abstract. In this paper, we propose a novel visual learning framework to develop flexible and accurate object recognition methods. Currently, most of visual learning based recognition methods adopt the monostrategy learning framework using a single feature. However, the real-world objects are so complex that it is quite difficult for monostrategy method to correctly classify them. Thus, utilizing a wide variety of features is required to precisely distinguish them. In order to utilize various features, we propose multistrategical visual learning by integrating multiple visual learners. In our method, multiple visual learners are collaboratively trained. Specifically, a visual learner L intensively learns the examples misclassified by the other visual learners. Instead, the other visual learners learn the examples misclassified by L. As a result, a powerful object recognition method can be developed by integrating various visual learners even if they have mediocre recognition performance.
1
Introduction
For the flexible and accurate recognition in computer vision, an effective framework called visual learning has been proposed by introducing machine learning frameworks into computer vision understanding. However, conventional visual learning methods adopt monostrategy learning. That is, they are based on a single learning strategy and able to utilize only a few features. Thus, they often fail to distinguish complex objects. Since an image is described with various features such as contour, color and texture, a wide variety of features are required for flexible recognition. Therefore, it is essential to integrate multiple features. To solve this problem, multistrategy learning is developed [1]. Since it can deal with multiple learning frameworks, it has a potentially better competence. For example, Nomiya et al. proposed a multistrategy visual learning method by integrating two types of object recognition methods (appearance-based and region-based) using decision trees and discriminant analysis [2]. But the integration method is so simple that the recognition performance is inadequate. Most of existing multistrategy learning methods simply integrates the learning results of all base learners using, for example, linear combination. Moreover, each learner is separately trained. Since the features are mutually interrelated, the visual learners should be trained collaboratively. Thus, we propose an effective learning framework that the visual learners can cooperate with each other. If a visual learner L frequently misclassifies some examples, it seems to be difficult for L to correctly discriminate the examples. Then, L never learns the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 502–511, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multistrategical Approach in Visual Learning
503
examples and they are learned by the other visual learners. Instead, L learns the examples misclassified by the other learners. As a result, each visual learner is specialized in discriminating the objects which have particular features. For example, an appearance-based visual learner will be specialized in discriminating the objects whose shapes are unique while they have various colors and textures. Conversely, a region-based visual learner will be specialized in discriminating the objects whose colors and textures are unique while they have similar shapes. To classify an example (i.e. an image), the more accurate prediction can be obtained by determining the most suitable visual learner based on the learning result. Therefore, our method can be more efficient than the conventional multistrategy learning methods. In order to develop this learning scheme, we propose the intensive and collaborative learning framework in the following section.
2
Multistrategical Visual Learning
We propose an effective multistrategy learning model based on AdaBoost [3]. We improve its weighting algorithm so that each visual learner can be trained collaboratively. First, we extend AdaBoost to solve multiclass problems. AdaBoost.M1 [3] is a direct extension to the multiclass problems. But it has the same property as AdaBoost that the classification error of each weak learner must be less than 1 2 . This is a crucial constraint when the number of classes is large. AdaBoost.M2 [3] gives a solution to this problem by using pseudo-loss. A hypothesis whose pseudo-loss is lower than 12 is much more easily generated for multiclass problems. Thus, we introduce AdaBoost.M2 algorithm. It takes a training set {(x1 , y1 ), · · · , (xm , ym )} where m is the number of training examples. xi is an element of instance space X and yi is an element of label space Y = {1, · · · , C}, where C is the number of classes. In object recognition problems, an example corresponds to an image. A class label corresponds to the object in the image. AdaBoost.M2 can be extended to our multistrategy learning framework. l In our framework, the label weighting functions qtX are computed for each base Xl (i), which learner as follows, using the weight vectors of the l-th base learner wt,y is the i-th example’s weight for the y-th class (y = yi ) at the t-th round: l
l qtX (i, y)
l
X where w1,y (i) =
1 m(k−1)
=
X wt,y (i) l
WtX (i)
for all l and i, and l Xl WtX (i) = wt,y (i). y=yi l
Then, the weight distribution DtX for the l-th base learner at the t-th round can be computed based on the weight vectors of the other base learners as follows: n Xj l 1 W (i) DtX (i) = . (1) m t X j n−1 i=1 Wt (i) j=l
504
H. Nomiya and K. Uehara
Equation (1) represents the intensive and collaborative learning framework. It l is obvious that the weight distribution DtX depends on the learning result of the other base learners X j (j = l). If X j misclassifies an example, then X l ’s weight is increased. Thus, X l intensively learns the examples misclassified by X j . Conversely, if X l misclassifies some examples, X j intensively learns them. Consequently, each base learner is collaboratively trained by leaving the examples misclassified by itself to the other base learners and training the examples misclassified by the other base learners. This intensive and collaborative learning framework leads to the better performance than the conventional multistrategy learning which integrates only the learning results of each base learners. l of the l-th base learner at the t-th round is given by The pseudo-loss εX t ⎛ ⎞ m l l l l l 1 ⎠ DX (i) ⎝1 − [hX qtX (i, y)[hX εX t = t (xi ) = yi ] + t (xi ) = y] . 2 i=1 t y=yi
For any predicate π, we define that [π] = 1 if π holds, otherwise [π] = 0. At the next round, the new weights vectors can be calculated as
l
βtX Xl Xl Xl Xl wt+1,y (i) = wt,y (i) exp 1 + [ht (xi ) = yi ] − [ht (xi ) = y] 2 l
l
l
X where, βtX = log{(1 − εX t )/εt }. l We define the final hypothesis H X of the base learner X l as follows: l
l
l
H X (x) = argmax HTX (c, x); HtX (c, x) = c∈Y
t
l
l
βτX [hX τ (x) = c].
τ =1
where T is the number of rounds. To evaluate each hypothesis, we compute the class separability at each round. The class separability is a criterion calculated using a confusion matrix. If most of the training examples belonging to a class are correctly classified, then the class separability is high. Conversely, if most of the training examples are confused with the other class(es), then the class l l separability is low. We define the class separability sX t (c) of X for the c-th class at the t-th round as follows: ⎧ l Xl ⎨ sX t+ (c) if argmax Ht (y, x) = c l X y∈Y (2) st (c) = ⎩ sX l (c) otherwise. t− where,
C C
l
l sX t+ (c)
nX t,c,c
= C
i=1 l
l
nX t,c,i
,
l sX t− (c)
l
i=c
j=c
nX t,i,j
i=c
j=1
nX t,i,j
= C C
l
and nX t,i,j denotes the number of the examples whose class label is i and classified l
into the class j by HtX .
Multistrategical Approach in Visual Learning
505
As the learning proceeds, each base learner will be specialized in discriminating a kind of objects which are relatively easy to classify. As a result, the base learner can very precisely classify some kinds of objects but may sometimes misclassify the other objects. Thus, we estimate the confidence of the predictions of all the base learners to determine reliable base learners. We utilize the class separability of each weak hypothesis to estimate the confidence of the prediction of a base learner. Especially, the weak hypotheses generated at the beginning of the learning are more suitable because the weak hypotheses in the later stage of the learning are specialized. Thus, we define the confidence K based on the class separability of each weak hypothesis, emphasizing the suitable hypotheses. Xl (c) + KtXl (c) = Kt−1
t
l sˆX i (c)
(3)
i=1 X where K0X (c) = 0 for all c and sˆX t (c) is calculated by replacing Ht in equation X (2) with ht . The final hypothesis H is computed by integrating the learning results of all the base learners considering their confidence values as follows:
L X Xl l KT (c)HT (x) . (4) H(x) =argmax c
l=1
That is, the final hypothesis contains two prediction steps. The first step is to predict which base learner can correctly classify the example. The second step is to predict the class label of the example by the hypotheses of the base learners.
3
Base Learners
The appearance is an essential feature to recognize objects. Thus, we utilize a set of straight lines extracted from the contour. We call the lines contour fragments. Since a contour fragment is too simple, we discriminate objects by finding meaningful combinations of the contour fragments called patterns. In the learning process, we first extract contour fragments using stick growing method [4]. Next, we find meaningful combinations of contour fragments. We show the process in Figure 1. In Figure 1, (a) and (b) represent the original image and the extracted contour fragments respectively. P1 , P2 and P3 in (c) are the patterns found by searching mutually adjacent contour fragments.
Fig. 1. An example of the pattern extraction process
506
H. Nomiya and K. Uehara
A frequent pattern, which is common to the examples in a certain class is useful. We define a frequent pattern as the pattern which satisfies the condition nc that N > ρ, where, for the c-th class, nc is the number of examples which c nc contain the pattern. Nc is the number of examples in the c-th class. Thus, N c corresponds to the probability that the pattern is included in the c-th class. ρ is the frequency threshold. To find useful frequent patterns, we define a criterion to evaluate the usefulness U of a frequent pattern p for the c-th class as follows: nc . (5) Uc (p) = C i=1 ni where C is the number of classes. When a test example is given, the frequent patterns {pi } (i = 1, · · · , m) are extracted from the object, where m is the number of frequent patterns. Each pattern pi is compared with each useful frequent pattern qic (i = 1, · · · , mc ) for each class, where mc is the number of frequent patterns in the c-th class. The similarity σ(pi , qi ) between pi and qi is calculated for each frequent pattern. Then, the confidence Sc of the test example for the c-th class is calculated as follows. Sc corresponds to the possibility that the class label of the example is c. Sc =
M
σ(pi , qi ) where M = min{m, mc }.
(6)
i=1
If p is similar to q, σ(pi , qi ) is Uc (qi ), otherwise 0. In equation (6), pi and the corresponding frequent pattern qi are determined so that Sc is minimized. To determine whether a pattern p is similar to a pattern q, we define the following conditions for each contour fragment lip and liq in p and q: Condition 1: Condition 2: Condition 3:
np = nq = n. p q 1 r < |li |/|li | < r (i = 1, · · · , n). p q A(li , li ) < θ (i = 1, · · · , n).
where np and nq are the numbers of the contour fragments included in p and q. A(x, y) represents the angle between x and y. r and θ are the thresholds. When all the contour fragments in p and q satisfy these conditions, p is similar to q. The test example is classified into the class which has the highest confidence. The region component which represents the color and texture is a discriminative feature. We consider the region component in the minimum region that contains all contour fragments represented by the encircled regions in Figure 2. Figure 2 (a) is the minimum region in the original image. (b) is the corresponding contour fragments. We use the pixel intensity values in the minimum region. But there is the problem of the computational cost caused by a large amount of pixels. To solve this problem, we introduce Generic Fourier Descriptor (GFD) [5] and reduce the dimensionality of the feature vectors. GFD is a rotation-invariance descriptor derived by applying a 2D polar Fourier transform to the polar image as shown in Figure 2 (c). The transform is given by F (ρ, φ) =
−1 R−1 T r=0 i=0
2πi r φ) f (r, θi ) exp j2π( ρ + R T
(7)
Multistrategical Approach in Visual Learning
507
Fig. 2. An example of the minimum region and GFD
where R and T are the radial and angular resolutions and θi = 2πi T (0 ≤ i < T ). Figure 2 (d) represents the Fourier coefficients. We use the GFD feature vector as the feature vector calculated as follows:
GF D =
|F (0, n)| |F (m, n)| |F (0, 0)| ,···, ,···, area |F (0, 0)| |F (0, 0)|
(8)
where area represents the area of the polar image. m and n are the maximum numbers of the radial and angular frequencies respectively. GFD is calculated to obtain the feature vector of the minimum region. The similarity between two images is calculated as the distance between the two GFD feature vectors. The similarity S(x, t) between a test example x and a training example t is defined as the Euclidian distance between the GFD feature vectors as follows: S(x, t) =
d
− 12 {GF Di (x) − GF Di (t)}
2
(9)
i=1
where d is the number of dimensions of the GFD feature vectors. GF Di (x) and GF Di (t) are the i-th dimension’s values of the examples x and t respectively. The example x is classified into the class which has the highest average similarity.
4
Experiments
We carried out some experiments to verify the performance of our method with real-world images. The images are in the ETH-80 Image Set database [6]. This data set contains 8 different objects; apple, car, cow, cup, dog, horse, pear, and tomato. Each class contains 410 images (10 kinds of objects from 41 different directions). We used a total of 656 images as the training set and the other 2624 images as the test set. The number of rounds is experimentally set to 100. In this experiment, we used the following six base learners. The first and second learners are the appearance and region based methods proposed in this paper. The third learner is based on feature tree [2]. This method combines some predefined features into some decision trees called feature trees. The fourth learner is based on Scale Invariant Feature Transform (SIFT) [7]. This method generates the deformation-invariant descriptors by finding some keypoints in an object. The fifth learner is based on PCA-SIFT [8]. It can generate more distinctive and compact descriptors than SIFT by introducing Principal Component
508
H. Nomiya and K. Uehara
Analysis. The sixth learner is based on shape context method [9]. It discriminates an object using a set of points on its contour called shape context. 4.1
Comparison with Other Object Recognition Methods
We compare the recognition performance of our method with the following six object recognition methods. First, the shape context by Belongie et al. [9]1 . Second, an appearance-based method called multidimensional receptive histogram by Schiele et al. [10]. This method describes local shapes of objects using statistical representations to recognize the objects. Third, a region-based method based on color indexing by Swain et al. [11]. It discriminates an object using RGB histograms calculated from the pixels in the object. Fourth, a regionbased method using local invariant features by Grauman et al. [12]. It utilizes deformation-invariant local features generated by a gradient-based descriptor. Fifth, a region-based method based on boosting by Tu et al. [13]. In this method, probabilistic boosting-tree framework is introduced to construct discriminative models. Finally, a visual learning method by Mar´ee et al. [14]. In this method, an object is described by randomly extracted multi-scale subwindows in the image. The random decision tree ensembles are constructed using the subwindows. The recognition accuracy for each method is shown in Table 1. Table 1. The recognition accuracy (in %) of each method methods Swain accuracy(%) 64.85
Mar´ee 74.51
Tu 76
Schiele 79.79
Grauman 81
Belongie 81.06
proposed 84.49
Our method outperforms all the other recognition methods. This result reflects the effectiveness of the multistrategical learning. Although the base learner using shape context method greatly contributes the high accuracy, our multistrategy learning is fully effective because the recognition performance is considerably improved compared with the recognition accuracy of a single base learner. 4.2
Recognition Performance of Multistrategical Learning Methods
In order to verify the effectiveness of our multistrategy learning from the viewpoint of the number of base learners, we construct five object classifiers using two, three, four, five and six different base learners and compare them. We call these classifiers L2 , L3 , L4 , L5 and L6 respectively. Li contains a total of i base 1
It is reported in [6] that the shape context method achieved 86.40% accuracy. However, this recognition accuracy has been achieved using over 98% of examples in the training set. Since this method discriminates objects by matching with each object in the training set, the accuracy is proportional to the number of the training examples as shown in [9]. Thus, we show in Table 1 the recognition accuracy of the shape context method using 20% of training examples in the same way as our experiments.
Multistrategical Approach in Visual Learning
509
95% L2 L3 L4
90
L5 L6 shape context
Accuracy
85 80 75 70 65 60
apple
car
cow
cup
dog
horse
pear
tomato
total
Fig. 3. The recognition accuracy with multiple base learners 95% 90
Accuracy
85
L2 L3
L5 L6
L4
shape context
80 75 70 65 60
apple
car
cow
cup
dog
horse
pear
tomato
total
Fig. 4. The recognition accuracy with hard example elimination
learners. In addition, since we use the shape context method as the sixth base learner, we also show the recognition accuracy of the shape context method for comparison. The result of the experiment is shown in Figure 3. The recognition accuracy is proportional to the number of base learners. The third learner (feature tree) and the fourth learner (SIFT) give new features to L3 and L4 , so that more complex objects can be precisely described. The fifth learner (PCA-SIFT) is similar to the fourth learner because it is based on SIFT. Thus, the recognition performance is slightly improved in L5 . Introducing the sixth base learner (shape context) considerably improved the total recognition accuracy. Although this improvement is due to the high recognition performance of the shape context method, the recognition accuracy of L6 is significantly higher than that of the shape context method. In addition, the recognition accuracy of our method for apples does not degrade by introducing the shape context method, in spite of the low recognition accuracy of the shape context method. But there is room for improvement because the classification accuracy of our method for cars, cows and cups is lower than that of the shape context method.
510
H. Nomiya and K. Uehara
The main reason of this result is that our method is vulnerable to hard examples. Hard examples mean noisy examples such as deformed or occluded images. In our method, the examples misclassified by all the base learners are regarded as hard examples. In AdaBoost, it is a crucial problem that hard examples often cause overfittng and degrades the classification performance. In order to investigate the influence of hard examples, we performed an additive experiment by introducing NadaBoost [15], which can detect hard examples. We used NadaBoost instead of AdaBoost.M2 and eliminated hard examples detected by NadaBoost during learning process. The result of the experiment is shown in Fig 4. By eliminating hard examples, overall recognition performance of our method is improved and our method outperforms the shape context method in all the objects. From the result, we confirmed the influence of hard examples and the necessity to appropriately detect and eliminate hard examples. 4.3
Effectiveness of the Method to Integrate Visual Learners
To verify the effectiveness of our integration method, we compare our method with a voting method using the six base learners. The voting method separately trains the base learners and combines them using a voting method without weighting. The result using 5-fold cross validation is shown in Table 2. Table 2. The recognition accuracy (in %) of our method and the voting method apple car cow cup dog horse voting 84.51 85.06 69.45 82.38 70.07 70.43 proposed 90.30 89.88 76.10 87.01 78.23 79.27
pear 83.60 89.63
tomato 75.79 85.49
total 77.66 84.49
For all objects and the total recognition accuracy, the accuracy of our method is significantly better than the voting method using t-test with a significant level at 5%. The voting method equally treats the predictions of all the base learners even if the object is often misclassified by a base learner while the other base learner correctly classifies it. In addition, each base learner is separately trained. As a result, using the voting method deteriorates the total recognition performance. Our method can avoid this problem by selecting suitable base learner depending on the given object. This result shows the advantage of our integration method and our learning framework.
5
Conclusion and Future Work
In this paper, we proposed an effective object recognition method based on multistrategical visual learning. Since our method collaboratively trains and integrates multiple visual learners, the discrimination performance can be improved compared with the monostragety visual learning methods. Through the experiments, we verified the performance of our methods. The experimental result shows that complex objects can be correctly discriminated by integrating diverse visual learners. However, the recognition accuracy of our method is still
Multistrategical Approach in Visual Learning
511
inadequate in the presence of hard examples. But from the result of an additive experiment, we should make our method more robust over the hard examples by appropriately detecting and eliminating hard examples.
References 1. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Proc. of the 9th IEEE International Conference on Computer Vision, pp. 626–633. IEEE Computer Society Press, Los Alamitos (2003) 2. Nomiya, H., Uehara, K.: Feature construction and feature integration in visual learning. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 86–95. Springer, Heidelberg (2005) 3. Freund, Y., Schapire, R.E.: A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 4. Nelson, R.C.: Finding line segments by stick growing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(5), 519–523 (1994) 5. Zhang, D., Lu, G.: Enhanced generic fourier descriptors for object-based image retrieval. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3668–3671. IEEE Computer Society Press, Los Alamitos (2002) 6. Leibe, B., Schiele, B.: Analyzing appearance and contour based methods for object categorization. In: Proc. of International Conference on Computer Vision and Pattern Recognition, pp. 409–415 (2003) 7. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 8. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 506–513 (2004) 9. Belongie, S., Malik, J., Puzicha, J.: Matching shapes. In: Proc. of the 8th IEEE International Conference on Computer Vision, pp. 454–463. IEEE Computer Society Press, Los Alamitos (2001) 10. Schiele, B., Crowley, J.L.: Recognition without correspondence using multidimensional receptive field histograms. International Journal of Computer Vision 36(1), 31–50 (2000) 11. Swain, M.J., Ballard, D.H.: Color indexing. International Journal of Computer Vision 7(1), 11–32 (1991) 12. Grauman, K., Darrell, T.: Efficient image matching with distributions of local invariant features. In: Proc. of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 627–634 (2005) 13. Tu, Z.: Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In: Proc. of the 10th IEEE International Conference on Computer Vision, vol. 2, pp. 1589–1596 (2005) 14. Mar´ee, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classification. In: Proc. of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 34–40 (2005) 15. Nakamura, M., Nomiya, H., Uehara, K.: Improvement of boosting algorithm by modifying the weighting rule. Annals of Mathematics and Artificial Intelligence 41, 95–109 (2004)
Cardiac Motion Estimation from Tagged MRI Using 3D-HARP and NURBS Volumetric Model Jia Liang1, Yuanquan Wang2, and Yunde Jia1 1
School of Computer Science&Technology, Beijing Institute of Technology, Beijing 100081, P.R. China 2 School of Computer Science, Tianjin University of Technology, Tianjin 300191, P.R. China {liangjia,yqwang,jiayunde}@bit.edu.cn
Abstract. Concerning analysis of tagged cardiac MR images, harmonic phase (HARP) is a promising technique with the largest potential for clinical use in terms of rapidity and automation without tags detection and tracking. However, it is usually applied to 2D images and only provides “apparent motion” information. In this paper, HARP is integrated with a nonuniform rational Bspline (NURBS) volumetric model to densely reconstruct 3D motion of left ventricle (LV). The NURBS model represents anatomy of LV compactly, and displacement information that HARP provides within short-axis and long-axis images drives the model to deform. After estimating the motion at each phase, we smooth the NURBS models temporally to achieve a 4D continuous timevarying representation of LV motion. Experimental results on in vivo data show that the proposed strategy could estimate 3D motion of LV rapidly and effectively benefiting from both HARP and NURBS model. Keywords: Tagged MRI, LV, motion estimation, HARP, NURBS model.
1 Introduction Tagged MRI [1,2] is a noninvasive technique that can be used for quantitative assessment of myocardial function and dynamic behavior of human heart, which is invaluable in the diagnosis of myocardial diseases [3]. In tagged MRI, tags move with underlying tissue during cardiac cycle, providing unsurpassed information about myocardium motion which can be used to calculate local strain and deformation indices from different myocardial regions. Hence, many researches have contributed to analyze deformation of tags to derive a motion model of the underlying tissue, such as [4,5,6]. These methods for tags detection and tracking are almost manual or semiautomatic with human interaction, which make cardiac motion analysis more time-consuming and more dependent on the validity of the detected tags. Therefore it is imperative to develop an automatic method for estimating heart motion. Recently, Osman et al. [7-9] have introduced a harmonic phase (HARP) technique for cardiac motion tracking without extracting tag features from an image. The approach treats harmonic phase which is computed from inverse Fourier transform of the first off-center isolated spectral peak in the Fourier domain of tagged MRI as a Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 512–521, 2007. © Springer-Verlag Berlin Heidelberg 2007
Cardiac Motion Estimation from Tagged MRI
513
material property for tracking underlying motion and calculating myocardial strain. It is a promising technique with the largest potential for clinical use in terms of rapidity, simplicity, robustness and automation. However HARP technique is mostly applied to 2D images, typically on short-axis (SA) image planes and thus, only provides information about “apparent motion” while the true motion of heart is 3D. Some of work has been done to deal with this problem. Ryf et al. [10] presented a combined 3D tagging and imaging approach. Haber and Westin [11] constructed a FEM within LV wall and the HARP phase computed on a sparse collection of image planes. Pan et al. [12] developed a mesh model approximating the mid-wall of LV and then applied HARP to track the mesh. In this paper, we present a novel method of integrating HARP with a nonuniform rational B-spline (NURBS) volumetric model to densely reconstruct 3D motion of LV. First we extend HARP from 2D to 3D mathematically, making full use of information afforded by SA and LA images each with a grid tagging pattern to obtain three mutual orthogonal components of LV motion displacements. This process still remains the advantages of rapidity and automation of the HARP method. Then a NURBS volumetric model is employed to fit the complex anatomy of LV. The model has its trait of immediate and compact representation of wide variety of shapes, and can model the geometry of LV well only with few control points coming from the sparse image planes. Also the model interpolates the known sparse displacements to get a 3D dense motion field reflecting the natural continuity and smoothness of the three-dimensional tissue deformations. Finally, after smoothing the model temporally, we could estimate 3D motion of LV at any time, and consequently the local strain and deformation indices from different myocardial regions could be computed.
2 3D Extension of HARP Harmonic phase (HARP) is an image processing technique developed to analyze the motion of heart rapidly using MR tagging [7,8]. It is mostly applied to 2D images, typically on SA image planes and thus, only provides information about “apparent motion”. The actual motion of heart is not confined to the imaging plane but likely out of it, however, 2D-HARP motion computations do not yield the true motion. Therefore we extend this method to 3D-HARP that yields a more comprehensive description of the full 3D motion of myocardial tissues. This extension is also based on the principle that harmonic phase value of a material point is time-invariant which material property we could use to track the motion of material points of heart in threedimensional space. In tagged MRI, one tag pattern can only provide one component of the underlying motion. To achieve a full 3D tracking of any point, the information coming from different mutually orthogonal tagging patterns has to be combined and interpolated in space and time. Hence, we need acquire SA image planes with two orthogonal tag orientations to obtain in-plane motion and several LA image planes to capture the third directional component of tissue motion normal to the short axis image planes. In LA images arranged radially, tagging planes are usually applied orthogonal to the long axis appear as parallel lines measuring longitudinal compression. Since HARP
514
J. Liang, Y. Wang, and Y. Jia
method needs 2D tagging images, LA images should be added another orthogonal tag pattern when imaging. SA and LA images each with a grid tagging pattern are acquired for 3D-HARP. To detail the procedure of 3D-HARP analysis formally, we define a material point p ∈ IR 3 that lies in the cross-section of SA image and LA image. Points y S and y L are its projections onto the SA and LA. To extract useful information of the Fourier transform of the SA and LA, we use the band-pass filter [8] to isolate the off-center (non-dc) spectral peak in one tag direction and rotate this filter by 90 degree to isolate the spectral peak in the orthogonal tag direction. Then zero padding the rest of the Fourier transform and performing inverse Fourier transform yields a complex image whose calculated angle is called a harmonic phase image. The harmonic phase image gives a detailed picture of myocardial motion in corresponding direction. Harmonic
phases (φ1, φ 2 ) computing from the SA in x/y tag direction corresponding to y S and T
(φ3 , φ4 )T from
the LA in z/z’ (orthogonal to z) tag direction corresponding to y L remain time-invariant throughout the motion. Thus, we track the point that has the same phase value in the filtered image sequence. Here the phase value of p is denoted by a vector φ = (φ1 , φ 2 , φ 3 ) , where φ1 , φ 2 and φ 3 come from three mutually orthogonal tagging directions, φ 4 is just for tracking the point on LA with 2D-HARP. Assume a T
material point located at p t at time t . If p t +1 is the position of this point at time t + 1 , then
( )
(
)
φ n p t , t = φ n p t +1 , t + 1
n = 1,2,3 .
(1)
This relationship provides the basis for tracking p t from time t to time t + 1 . In practice, φ can not be calculated and visualized from the data directly, so its principle value produced by wrapping function often takes the place of it. Despite the wrapping artifact, the principle value is also the material property of tagged tissue and maintains time-invariant. The following work is to target y S with the same harmonic phase (φ1, φ 2 ) from SA and y L with the same harmonic phase (φ3 , φ 4 ) from LA. The T
T
(
displacement field u = u x ,u y ,u z
)T for all intersections could be received.
3 NonUniform Rational B-Splines Model 3.1 Principle
NURBS, an acronym for nonuniform rational B-splines, have become the de facto standard for computational geometric representation [13]. Their predominance lies on parametric continuity, local support, compact and unified mathematical representation and an extra degree of freedom, in the form of weights, apart from knot vector and control points, which can be used in designing wide variety of shapes.
Cardiac Motion Estimation from Tagged MRI
515
The NURBS volumetric model can be expressed as m
n
h
∑∑∑ ω
Di,j,f N i,k1 (u )N j,k2 (v )N f,k3 (w )
i,j,f
i =0 j =0 f =0 m n h
P(u,v,w) =
∑∑∑ ω
i,j,f
i =0 j =0 f =0
N i,k1 (u )N j,k 2 (v )N f,k3 (w)
,
(2)
where Di,j,f represents a control point, ωi,j,f is the corresponding weight. Quantities
u , v and w are location parameters. N (* ) is B-spline basis function. The variables k1 , k 2 and k 3 are orders (one more than degree) of model in the direction of u , v and w
respectively. After defining knot vectors U = {u1 ,… , u ms } , V = {v1 ,… , vns } and
W = {w1 ,… , whs } each to be a sequence of nondecreasing real numbers, called knots, a NURBS volume model can be uniquely defined. The following work is to solve control points and weights in a least-squares sense. Given a set of discrete points Γ = Pp ,q ,r = x p ,q ,r ,y p ,q ,r ,z p ,q ,r and a corresponding set
{
{
}
(
)}
of weights Λ = ξ p ,q ,r which quantifies the relative confidence in the measurement of each corresponding point, where p ∈ [0 , s1 ], q ∈ [0 , s 2 ], r ∈ [0 , s 3 ] . Then the weighted least squares error criterion Eξx for the x-coordinate is
⎛ ⎜ ⎜ Eξx = ξ p ,q ,r ⎜ x p ,q ,r − ⎜ p ,q ,r ⎜ ⎜ ⎝
∑
2
⎞ N i,k1 u p N j,k2 vq N f,k3 (wr ) ⎟ ⎟ i =0 j =0 f =0 ⎟ , m n h ⎟ ωi,j,f N i,k1 u p N j,k 2 vq N f,k 3 (wr ) ⎟ ⎟ i =0 j =0 f =0 ⎠ m
n
h
∑∑∑
ωi,j,f Dix, j , f
( )
( )
∑∑∑
( )
( )
(3)
where Dix, j , f is x component of control point Di,j,f . Other parameters are defined as previous. It is known that rational B-splines can be produced by projecting nonrational B-splines in hyperplane w = 1 . Therefore we substitute 3D control points
{ D ′ = {D
}
Di = Dix , Diy , Diz = {xi , yi , zi } in 3D space by 4D homogeneous coordinates wx i
i
, Diwy
squares error
Eξwx
}
, Diwz , Diw = {wi xi , wi yi , wi z i , wi } , and then the weighted least
criterion Eξwx
for the x-homogeneous coordinate is given by
m ⎛ = ξ p ,q ,r ⎜ x p ,q ,r D pw,q ,r − ⎜ p ,q ,r i =0 ⎝
∑
n
h
∑∑∑
k1 ,k 2 k3 Diwx , j , f Bi , j , f
j =0 f =0
2
⎞ u p , vq , wr ⎟ , ⎟ ⎠
(
)
(4)
where
(
)
( )
( )
Bik,1j,,kf2 k3 u p , vq , wr = N i,k1 u p N j,k 2 vq N f,k3 (wr ) ,
(5)
516
J. Liang, Y. Wang, and Y. Jia m
D pw,q ,r =
n
h
∑∑∑ ω
i,j,f
(
)
Bik,1j,,kf2 k3 u p , vq , wr .
i =0 j =0 f =0
(6)
Similar equations are used for y and z components. Then we follow the methodology given by Tustison and Amini [14] to solve the equations for the control points and the weights. 3.2 Anatomy of the LV
Image planes acquired by tagging MRI imaging technique are usually sparse 2D images while the actual anatomy of LV is very complex and looks like a prolate spheroid. In order to cover the full geometry we employ a NURBS volumetric model for it can be created with only few control points. From sparse SA and LA images, a set of finite discrete points locating on intersections of contours and images are observed. The phantom of LV anatomy and the parametric directions of model are shown in Fig.1(d). Applying the fitting algorithm in Section 3.1, the NURBS volumetric model is received. As the nature of parametric continuity and compact representation, the model could show any points of the 3D LV very well. 3.3 3D Motion Estimations
After building the NURBS volumetric model of the current phase, 3D-HARP is applied on the SA and LA images at the next phase to obtain the sparse 3D displacements. Then these displacements drive the model to deform. For the sake of the predominance of parametric continuity, local support and differentiability, we could gain the dense 3D motion displacement field by difference of the NURBS volumetric models. 3.4 Four-Dimensional (4D) NURBS Model
After the 3D NURBS volumetric model is generated across 3D space for each phase, a time variable dimension is added to smooth it along all continuous time points. Given orders of the model, 4D grid control points, corresponding weights, and knot vector sequence, the 4D NURBS model can be written as P(u,v,w, t) =
∑∑∑∑ N (u )N (v )N (w)N (t )ωD , ∑∑∑∑ N (u )N (v )N (w)N (t )ω
(7)
where t denotes the time instant. The meanings of other parameters are the same as defined previously, and the subscripts are omitted for concise writing.
4 Strain Analysis Strain is a dimensionless quantity measuring the percent change in length at different points to describe the internal deformation of a continuum body. It is an appealing tool to study and quantify myocardial deformation.
Cardiac Motion Estimation from Tagged MRI
517
Given the spatial coordinates of a point in the material, X at time t = 0 and x at time t > 0 , the deformation gradient tensor F includes both the rotation and deformation around a point in the material and can be calculated by
Fpq = ∂x p ∂X q ,
(8)
where the subscripts p and q range from 1 to 3 and denote one of the 3D Cartesian coordinates. The Lagrangian strain tensor E only includes the deformation of the material with respect to its initial configuration, and is related to F as follow:
(
)
E = 1/2 F T F − I ,
(9)
where the superscript T represents the matrix transpose and I represents the identity matrix. The Lagrangian strain E is used to describe systolic deformation in a region surrounding a point in the heart wall relative to its initial position at end-diastole.
5 Experiments on in vivo Data 5.1 Imaging Protocol
For the studies shown in this paper, the following imaging protocol was utilized on a 1.5T Sonata produced by the Manufacturer of Siemens. Two sequences consisting of 11 phases of 9 tagged MR SA images and 9 LA images each with a grid tagging plane, 256 × 208 acquisition matrix and 8 mm slice thickness were acquired throughout the cardiac cycle from a normal healthy volunteer. LA images were taken perpendicular to SA images and arranged radially every 20 degrees around the long axis of LV. Fig. 1 shows the spatial relative position of the SA and LA images. 5.2 Initial NURBS Volumetric Model
To construct the reference NURBS volumetric model of LV, we segment the endocardial and epicardial contours of LV from the end-diastolic images manually by experience. Then the initial reference NURBS (cubic spline) volumetric model of LV is built illustrated by the left top one of Fig. 2. Here color means nothing and the cyan pentagrams denote the points at the reference phase.
(a)
(b)
(c)
(d)
Fig. 1. (a) The spatial relative position of the SA and LA images of LV. Red circles denote the origins of parallel SA images. (b) LA images locations (denoted by red lines) on a sample SA image. (c) SA images locations (denoted by red lines) on a sample LA image. (d) Parametric directions of the NURBS volumetric model of LV.
518
J. Liang, Y. Wang, and Y. Jia
Fig. 2. The NURBS volumetric models of LV at six different instants and the displacements are illustrated. The cyan pentagrams denote the points at the reference phase and the blue pentagrams mean the corresponding positions at the current phase. The pink vectors represent the displacements roughly. The color bar denotes the length of displacements. Left top one shows the initial NURBS volumetric model of the real LV and the reference points.
5.3 3D Motion Reconstruction
Once the initial reference NURBS volumetric model of LV is achieved, the phases of the material points on the model are known. The following work is to track the 3D motion of this model according to the material property of phase-invariant using 3DHARP method. Fig. 2 shows the NURBS volumetric models and the displacement field during the cardiac systole. These results are similar to the work [15]. 5.4 4D NURBS Model and Strain Analysis
The 4D NURBS model is generated by smoothing the 3D NURBS volumetric models over time using cubic splines, which is exactly suitable for real heart well. The movement of each myocardial point over time can be captured accurately by assigning u , v and w to any fractional values. In addition, the shape of the LV at any time instant can be obtained by setting t to any desired value. Using this model the changes of displacement and strain over time can be obtained at all myocardial points with sub-pixel accuracy. Due to the LV geometry, it is appropriate to calculate the myocardial strains based on the radial, circumferential, and longitudinal directions. The basal and midcavity portions of it are each divided into six regions in the SA view: antero-septal, anterior, lateral, posterior, inferior, and infero-septal. Normal LV strains, i.e., average radial, circumferential, and longitudinal strains, are given in Fig. 3. These results are similar to the work [16]. The radial strains mostly remain positive indicative of the systolic thickening of the LV. Circumferential and longitudinal strains are negative denoting
Cardiac Motion Estimation from Tagged MRI
519
Fig. 3. Average Lagrangian normal strains are plotted for the six basal and midcavity regions of the left ventricle of a normal human volunteer. The different geometric shapes (star, diamond, and square), represent the radial, longitudinal, and circumferential strain values, respectively. The x axis marks the time point during systole.
520
J. Liang, Y. Wang, and Y. Jia
shortening in the circumferential direction and compression in the longitudinal direction during LV contraction.
6 Conclusion In this paper, we have proposed a novel method for dense 3D motion estimation of LV without tag detection and tracking. This method takes advantages of rapidity and automation of HARP technique. Also it benefits from NURBS properties, such as parametric continuity, local support, and compact and unified mathematical representation for wide variety of shapes. Under this framework, we have created the compact representation of the complex LV anatomy and reconstructed the motion of the LV on in vivo data, experimental results show that the dense 3D motion estimation and the local strains could be calculated rapidly and effectively. It is strongly felt that this tool will help take MR tagging from the ranks of a valuable scientific research tool into the ranks of a valuable diagnostic clinical tool. This method is also suitable for the right ventricle and even for the atria only with a different model initialization. Acknowledgments. We are grateful to Prof. Pheng Ann Heng in the Chinese University of Hong Kong for providing the in vivo human heart data. This work was supported by the Natural Science Foundation of China under grants 60602050 and 973 Program of China (No. 2006CB303105).
References 1. Zerhouni, E.A., Parish, D., Rogers, W., Yang, A., Shapiro, E.: Human heart: tagging with MR imaging — a method for non-invasive assessment of myocardial motion. J. Radiology 169, 59–63 (1988) 2. Axel, L., Dougherty, L.: MR imaging of motion with spatial modulation of magnetization. J. Radiology 171, 841–845 (1989) 3. Masood, S., Yang, G.-Z., Pennell, D.J., Firmin, D.N.: Investigating intrinsic myocardial mechanics: the role of MR tagging, velocity phase mapping, and diffusion imaging. J. Magn. Reson. Imag. 12, 873–883 (2000) 4. Guttman, M.A., Prince, J.L., McVeigh, E.R.: Tag and contour detection in tagged MR images of the left ventricle. IEEE Trans. Med. Imag. 13, 74–88 (1994) 5. Amini, A.A., Chen, Y., Curwen, R.W., Mani, V., Sun, J.: Coupled B-snake grids and constrained thin-plate splines for analysis of 2-D tissue deformations from tagged MRI. IEEE Trans. Med. Imag. 17, 344–356 (1998) 6. Young, A.: Model tags: direct 3D tracking of heart wall motion from tagged magnetic resonance images. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 92–101. Springer, Heidelberg (1998) 7. Osman, N., Kerwin, W., McVeigh, E., Prince, J.: Cardiac motion tracking using CINE harmonic phase (HARP) magnetic resonance imaging. J. Magn. Reson. Med. 42, 1048– 1060 (1999) 8. Osman, N., McVeigh, E., Prince, J.: Imaging heart motion using harmonic phase MRI. IEEE Trans. Med. Imag. 19, 186–202 (2000)
Cardiac Motion Estimation from Tagged MRI
521
9. Osman, N.F., McVeigh, E.R., Prince, J.L.: Visualizing myocardial function using HARP MRI. J. Phys. in Med. and Biol. 45, 1665–1682 (2000) 10. Ryf, S., Spiegel, M.A., Gerber, M., Boesiger, P.: Myocardial tagging with 3D-SPAMM. J. Magn. Res. Imag. 16, 320–325 (2002) 11. Haber, I., Westin, C.F.: Model-based 3D tracking of cardiac motion in HARP images. In: Int. Soc. Mag. Reson. Med. Honolulu, HI (2002) 12. Li, P., Prince, L.J., Lima Jiao, A.C., Osman Nael, F.: Fast tracking of Cardiac Motion Using 3D-HARP. IEEE Trans. BioMed. Eng. 52, 1425–1435 (2005) 13. Piegl, L., Tiller, W.: The NURBS Book. Springer, Berlin (1997) 14. Tustison, N.J., Amini, A.A.: Biventricular myocardial kinematics based on tagged MRI from anatomical NURBS models. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 514–519 (2004) 15. Luo, G., Heng, P.A.: LV Shape and Motion: B-Spline-Based Deformable Model and Sequential Motion Decomposition. IEEE Trans. Inform. Technol. BioMed. 9, 430–446 (2005) 16. Moore, C., Lugo-Olivieri, C., McVeigh, E., Zerhouni, E.: Three-dimensional systolic strain patterns in the normal human left ventricle: characterization with tagged MR imaging. J. Radiology 214, 453–466 (2000)
Fragments Based Parametric Tracking Prakash C, Balamanohar Paluri, Nalin Pradeep S, and Hitesh Shah Sarnoff Innovative Technologies Private Limited, Asha arch, Magrath Road, Bangalore-560025, India
Abstract. The paper proposes a parametric approach for color based tracking. The method fragments a multimodal color object into multiple homogeneous, unimodal, fragments. The fragmentation process consists of multi level thresholding of the object color space followed by an assembling. Each homogeneous region is then modelled using a single parametric distribution and the tracking is achieved by fusing the results of the multiple parametric distributions. The advantage of the method lies in tracking complex objects with partial occlusions and various deformations like non-rigid, orientation and scale changes. We evaluate the performance of the proposed approach on standard and challenging real world datasets.
1
Introduction
Two prominent components of a tracking system are: object descriptor and search mechanism. Object descriptor is the representation of the object to be tracked using a set of features that capture various properties of the object such as the appearance, shape, texture etc. Given an object descriptor, the search mechanism like [1,2], locates the region in a new image that best matches the object description. There are multiple methods suggested in the literature for object descriptors. Most of the successful methods for tracking employ non-parametric object descriptor like histogram [1,3,4,5,6,7], as it faithfully captures the variability in the features of the object to be tracked. However, with the increase in number of objects to be tracked or the features to be considered, the histogram size grows exponentially which is an undesired behavior. To address this issue, we propose a parametric object descriptor for color based tracking. An N-dimensional Gaussian distribution is employed as the object descriptor in the proposed approach. Such a descriptor can accurately model a unimodal object. But objects under consideration for tracking are generally multimodal in color space making N-d Gaussian descriptor insufficient. Hence, we need to convert multimodal objects to unimodal representation. Primarily, there are two ways to achieve this conversion: – By projecting the multimodal object into a space where it becomes unimodal – By representing each mode separately Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 522–531, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fragments Based Parametric Tracking
523
An approach similar to Collins et al. [5] can be used to find a linear transformation to project the multimodal color object into a unimodal space. Also, non-linear transformation as suggested by Larry et al. [7] can be used to the same effect. However, in both the cases, the search for finding such an optimal transformation is not exhaustive, as it is computationally expensive, over the entire space of possible transformations. Hence the obtained transformation is suboptimal. For representing a multimodal object, Gaussian mixture model (GMM) distributions can also be used. However, computation of GMM parameters are expensive and an aprior knowledge of number of modes is quintessential, thus rendering it is not applicable for object descriptors in tracking. Therefore, in this paper we propose a method based on fragmenting multimodal objects into multiple homogeneous models using Discriminant Analysis. The fragmentation process finds the fragments online as opposed to fragmenting the object into fixed sizes as suggested in [4]. Each fragment is then modelled using a single N-dimensional Gaussian distribution and tracked separately. These parametric distributions are used to generate a probability density function termed as strength image. The maximum likelihood(ML) framework proposed in [2] is used to estimate the location(mean) and shape(covariance) of the best matching region in the subsequent frame. The paper is organized as follows: Section 2 explains the proposed approach. Experimental results are presented in Section 3 to illustrate the performance of the tracker and Section 4 concludes the work by presenting the future work.
2
Proposed Method
The proposed tracking approach is color based, hence in modelling an object we use the color values of the pixel. Our initial step involves fragmenting based on the color values of the pixels. Prior work involved application of multi-level thresholding technique to segment an illumination/gray image [8]; but these techniques can’t be applied directly in our case as the our objective is to group regions similar in color rather than illumination (gray). Hence, multi-level thresholding is done in color space. The input template is in the RGB space. Multi level thresholding on the histogram generated using all three channels is not possible due to the immense size of the histogram (256 × 256 × 256). So, given the color template of the object in the RGB space, we first transform the input to HSV space. Since Hue represents the color component alone, multi-level thresholding on Hue gives the desired results. The grouped regions similar in color are then modelled using single parametric distribution. The Uni/Multi modal classification and the fragmentation processing use the Hue image for processing. 2.1
Fragmentation
The given region of interest (ROI) is initially divided into uniform blocks of size M × N . Each block(B) with mean(μ), variance(σ) and Hue histogram(H) is then classified as homogeneous if any of the following two criterions satisfy:
524
Prakash C. et al.
1. If the variance of the region is less than a certain pre-defined value. The variance of the region is given by 2 (Pi − μ) (1) σ2 = iB
where Pi is the hue value at i and the μ is the mean of the block. 2. If the block is divided into two classes C1 with values [1, . . . , t] and C2 with values [t + 1, . . . , L] using the optimal threshold given by eqn 2 and if the Separability factor(SF ) given by eqn 3 of the block is less than a certain pre-defined value. arg max t
N
2
Wi (μi − μ)
(2)
i=1
where N is the number of classes, Wi is the total number of pixels in the class i, μi is the mean of the it h class and μ is the mean of the block. BCV (3) TV where T V is the total variance of the block given by 1 and BCV is the between class variance given by: SF =
BCV =
N
Wi (μi − μ)2
(4)
i=1
The fragmentation process is applied only for non-homogeneous regions. The multi-level thresholding is carried until the SF of the block is less than a predefined valueT hSF . The T V of the region is constant and is used for normalizing purposes. The BCV will be high when the fragments of similar color are grouped and dissimilar colors are separated.The multi level thresholding is done in the following way: Each time the fragment/class with the maximum within class variance is selected (Initially, the entire block is started as one class), since high within class variance signifies that the class is non-homogeneous. The division is done by finding the Optimal threshold given by 2. The process is repeated till the SF of the block is less than T hSF . The class pool thus created needs to be assembled together based on the color similarity. The assembled regions will signify the multiple unimodal regions of the multimodal object. The assembling process is started with a new region which includes the first class of the first region. This is followed by a merging process which finds out the classes similar to this class. The criterion for similarity is the difference of the mean values of the two classes. The class with the least difference is identified and if the difference of the means is less than a pre-defined value T hmean the class is merged into the region. If none of the classes can be merged to any of the existing regions, a new region is created by picking up the class which
Fragments Based Parametric Tracking
(a)
(b)
(d)
(e)
525
(c)
(f)
Fig. 1. Parrot sequence: The input image (a) is fragmented into five parts. (b) represents the body of the parrot(green), (c) represents the forehead(white), (d) represents the hair(blue), (e) represents the cheeks(red) and (f) represents the beak(yellow).
has the largest difference in the mean value with the existing regions. Then the unclassified classes are again tried to merge to this new region. The process is repeated until all the classes are merged to the regions. The regions obtained thus form the unimodal fragments of the multimodal object. An example of the fragmentation is shown in Figure 1. 2.2
Modelling the Object
Each fragment obtained after the fragmentation process is modelled separately. Each region R ⊂ Regions is described by the color values {R, G, B} , thus t the feature descriptor at an image location x = x y is computed as f (x) = t R(I, x) G(I, x) B(I, x) . The region covariance, C of the feature descriptors in R is computed as C=
1 (f (x) − μ )(f (x) − μ )t |R|
(5)
x∈R
1 where μ = |R| x∈R f (x) is the mean feature descriptor in R and |R| is the number of pixels in the region R. A simple covariance matrix computed with color features contains the information needed to capture the appearance of the object. An estimate of the color distribution in the target region is the Gaussian distribution. The ML estimates of the parameters of the Gaussian, Θ=(μ, C) is the target model. The probability density function (PDF), also termed as the strength image, is computed over the new image. The value of each pixel in
526
Prakash C. et al.
the strength image signifies the probability with which the pixel belongs to the target model. In the remainder of this paper we denote this value as p(x|Θ ) where x is the pixel location. The PDF in this case is computed as : p(x|Θ ) ∝ exp(−(f (x) − μ )t C−1 (f (x) − μ ))
(6)
The PDF is calculated for each of the unimodal regions of the object obtained through the fragmentation process. The PDF will have high values for pixels which belong to the particular parametric distribution and vice-versa. In the next section, we show how the PDF computed for the image can be used to track the region accurately in presence of various deformations. 2.3
ML Framework
The region to be tracked R0 is represented by an ellipse in our case. The position and shape of the object are described by the mean M0 and covariance V0 of the pixels in the region. Given the target model Θ, the objective of the search mechanism is to find a region R in the new frame described by mean and covariance (M, V) that maximize the function: p(x|Θ )L(x|M, V) (7) J(M, V) = x∈R
where the term L(x|M, V) ∝ exp(−(x − M)t V−1 (x − M))
(8)
prevents pixel locations that are farther from the original region from distracting the tracker. As a pixel’s contribution falls off with the distance from the original region, this helps in both reducing the effect of outlier pixel on the search as well as preventing the tracker from drifting away from the object. As shown in [2,9,10], the maximum-likelihood estimates of M and V can be obtained via an EM-like iterative procedure. The key to the method is to assume a set of hidden variables w(x). Starting with an initial estimate M0 , V0 of R, the EM-iteration proceeds as below: – E-Step: Given current estimates Mk and Vk of the mean and covariance of the region in k th iteration, compute hidden variables wk (x): p(x|Θ )L(x|Mk , Vk ) k k x ∈R p(x |Θ )L(x |M , V )
wk (x) =
(9)
– M-Step: Using the hidden variables computed above, compute the next estimates of mean and covariance of the region, Mk+1 and Vk+1 of that maximize J(., .): wk (x)x (10) Mk+1 = x∈R
V
k+1
=
x∈R
wk (x)(x − Mk+1 )(x − Mk+1 )t
(11)
Fragments Based Parametric Tracking
527
The optimal values for M and V are obtained by iterating the above steps until convergence. Our experimental results demonstrate that the search mechanism described above is both efficient and robust to a wide variety of changes in the shape of the object. In Algorithm 1, we explain the complete tracking algorithm. Algorithm 1. Track(Video V , Region R0 ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
3
I0 ←Initial Frame(V ) (M0 , V0 ) ← Fit Ellipse(R0 ) H ← HSV(R0 ) Class P ool ← Multi Level Thresholding(H) F ragments ← Assembling(Class P ool) Θ ← Region Covariance(R0 , F ragments) for each frame Ii in V do (Mi , Vi ) ← (Mi−1 , Vi−1 ) S ← Strength Image(I, Θ ) k←0 while not converged do compute weight wk using equation 9 update estimates (Mi , Vi ) using equations 10, 11 k ←k+1 end while end for
Experimental Results
The tracking algorithm was tested on various challenging datasets [11]. It was also tested on few low contrast videos taken from Internet. The tracker performance was encouraging when tested for its ability to handle the following aspects: Non-rigid deformations: Tracking non-rigid objects is a challenging problem in tracking. Couple of examples are highlighted in Figures 2, 4. In Cat sequence (Figure 2), the cat is tracked accurately under considerable deformations (sitting,jumping and running). Also, note that the contrast between the cat and the background is quite low. In case of Figure 4, the monkey is tracked successfully under extreme deformations. In both the cases, the tracked ellipse changes accurately to handle the non-rigid deformations of the object. Orientation: Change in the orientation of objects is a common scenario in tracking. Tracking the object with accurate orientation is possible in our case since we track the object using an ellipse. The mean of the ellipse characterizes the location and the covariance signifies the scale and orientation. In Figure 4(b,c), the monkey undergoes considerable changes in orientation. The orientation of the ellipse changes according to the orientation of the object. Figure 3 is another example where the fish is tracked accurately in the presence of rapid orientation changes.
528
Prakash C. et al.
(a)
(b)
(c)
Fig. 2. Cat sequence: The cat is tracked successfully in presence of changes in scale and non-rigid deformations
(a)
(b)
(c)
Fig. 3. Fish sequence: An example of low quality video containing partial occlusions and orientation changes. Note the other fishes in the tank with the similar color(but not pattern) to the object being tracked. Many existing trackers fail in such cases.
(a)
(b)
(c)
Fig. 4. Monkey sequence: Inspite of changes in orientation and non-rigid deformations, the monkey is tracked precisely
Scale: Earlier trackers relied on techniques such as searching through exhaustive search space [12] or using templates of the object at different scales [13]. In our case, the EM-like algorithm enables efficient handling of scale changes by estimating the covariance of the tracked ellipse. Figure 2, shows how the tracking handles scale changes. Partial Occlusion: Parital and full occlusions occur frequently in tracking scenearios and the tracker needs to handle these sucessfully. Even if an object is
Fragments Based Parametric Tracking
(a)
(b)
529
(c)
Fig. 5. Caviar sequence: The sequence shows the handling of partial occlusion of the person (blue ellipse) when he crosses two other people
(a)
(b)
(c)
Fig. 6. Parrot sequence: The multimodal object parrot is decomposed into multiple unimodals and tracked seperately
(a)
(b)
(c)
Fig. 7. Parrot sequence: The multimodal object is tracked sucessfully. Note that the ellipse completely fits the entire parrot enclosing all the homogenous regions.
completely occluded for considerable time, the tracker should be able to track the object on reappearance. A scenario with complete and partial occlusions is shown in the Figures 3, 5. Figure 3(b) shows the cases where the fish is partially occluded. In Figure 3(c), the fish reappears after being completely occluded and the tracker was able to relocate the object. On a standard dataset as in Figure 5, the ellipse fits the partially visible person where major portion of the person is occluded by two other people. Handling Multimodal: Many of the datasets to be tracked has multimodal objects. We handle multimodal objects by fusing information from each unimodal.
530
Prakash C. et al.
Figures 2, 3, 7 show tracking results on multimodal objects. For instance, Figure 7 shows the example of parrot, where each homogeneous region is extracted and modelled seperately as explained previously and Figure 6 shows the tracking of each unimodal region. Videos with low quality and contrast: The quality of multimedia data available on the web varies significantly owing to various compression and transmission techniques. Several tests using videos with low quality and contrast were carried out to test our technique. In case of Figures 3,4 taken from googlevideos, the quality is poor owing to compression. In these videos, background color merges more with the color of the object. The tracker performance is very good and insensitive to these variations in video.
4
Conclusion
We have proposed a fragment based tracking approach in which the multimodal objects were fragmented into homegenous regions based on hue. These unimodal regions are then tracked using a single parametric distribution and these distributions are fused to form the final tracking result of the entire object. The tracker proposed was also complimented with an efficient search mechanism to make the system robust to handle non-rigid deformations, occlusions, scale and orientation changes efficiently. For modelling the object more efficiently, the research is currently focused on combining other cues like motion,edge,texture with present color based tracker.
References 1. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: ICCV, vol. 2, pp. 1197–1203 (1999) 2. Zivkovic, Z., Krose, B.: An em-like algorithm for color-histogram-based object tracking. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 798–803 (2004) 3. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conf. on Comp. Vis. and Pat. pp. 142–151. IEEE Computer Society Press, Los Alamitos (2000) 4. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: CVPR 2006. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 798–805. IEEE Computer Society Press, Los Alamitos (2006) 5. Leordeanu, M., Collins, R.T., Liu, Y. 6. Birchfield, S.T., Rangarajan, S.: Spatiograms versus histograms for region-based tracking. In: Proceedings of the Computer Vision and Pattern Recognition, vol. 2, pp. 1158–1163. IEEE Computer Society, Los Alamitos (2005) 7. Han, B., Davis, L.: Object tracking by adaptive feature extraction. In: In proceeding of International Conference on Image Processing (2004) 8. Liao, P.S., Chen, T.S., Chung, P.C.: A fast algorithm for multilevel thresholding.
Fragments Based Parametric Tracking
531
9. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977) 10. Neal, R.M., Hinton, G.E.: A new view of the EM algorithm that justifies incremental, sparse and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. Kluwer Academic Publishers, Dordrecht (1998) 11. project/IST 2001 37540, E.F.C.: found (2004), at http://homepages.inf.ed. ac.uk/rbf/caviar/ 12. Porikli, F., Tuzel, O.: Covariance tracking using model update based on means on riemannian manifolds. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2006) 13. Birchfield, S.: Elliptical head tracking using intensity gradients and color histograms. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 232, IEEE Computer Society, Los Alamitos (1998)
Spatiotemporal Oriented Energy Features for Visual Tracking Kevin Cannons and Richard Wildes York University Department of Computer Science and Engineering Toronto, Ontario, Canada {kcannons,wildes}@cse.yorku.ca Abstract. This paper presents a novel feature set for visual tracking that is derived from “oriented energies”. More specifically, energy measures are used to capture a target’s multiscale orientation structure across both space and time, yielding a rich description of its spatiotemporal characteristics. To illustrate utility with respect to a particular tracking mechanism, we show how to instantiate oriented energy features efficiently within the mean shift estimator. Empirical evaluations of the resulting algorithm illustrate that it excels in certain important situations, such as tracking in clutter with multiple similarly colored objects and environments with changing illumination. Many trackers fail when presented with these types of challenging video sequences.
1
Introduction
Target tracking is a critically important aspect to a wide range of computer vision applications, including surveillance, smart rooms, and human-computer interfaces. Significant contributions have been made to the field, but no generalpurpose tracker has been found that can operate effectively in every real-world setting [1]. Scenarios that are present in realistic sequences and challenge many trackers include changes in illumination, small targets, and significant clutter. In general, to facilitate accurate tracking, features must be selected that distinguish targets from the background and from one another, even while being robust to photometric and geometric distortions. In response to these requirements, many different proposals have been made; here, representative examples are provided. Perhaps the simplest approach is to make use of image intensitybased templates for feature definition [2,3,4]. To provide robustness to photometric distortions, consideration has been given to discrete features [5,6,7]. To encompass object outlines, methods have emerged that use contours and silhouettes [8,9,10]. Other features (e.g., color, texture) have been derived on a more regional basis [11,12,13]. Recovered motion also has been used in feature definitions [14,15,16]. Limited attention has been given to the integrated analysis of both the spatial and temporal domains when considering features for visual tracking. Potential benefits of a more integrated approach include the ability to combine static and dynamic target information in a natural fashion as well as simplicity of Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 532–543, 2007. c Springer-Verlag Berlin Heidelberg 2007
Spatiotemporal Oriented Energy Features for Visual Tracking
533
design and implementation. In response to this observation, the present paper documents a novel feature set for visual tracking that uses energy measures to capture a target’s multiscale, spatiotemporal orientation structure. A considerable body of research has emerged on the use of orientation selective filters in the spatiotemporal domain for the purpose of analyzing motion [17,18,19]. However, it appears that no previous research has explored the use of multiscale, spatiotemporal oriented energies that uniformly encompass space and time as the basis for defining features in the service of visual tracking. To illustrate the use of the proposed oriented energy feature set, we make use of the mean shift tracking paradigm [13,20,21,22], a framework upon which these features readily map. Although, the energy features are also applicable to alternative paradigms, e.g., those that preserve within target spatial relationships, as the oriented energies are calculated locally. In light of previous research, the main contributions of this paper are as follows. (1) A novel oriented energy feature set is defined for visual tracking. This representation captures the spatiotemporal characteristics of a target in an integrated, compact fashion. (2) Oriented energy features are instantiated with respect to the mean shift estimator. (3) The performance of the resulting system is documented both qualitatively and quantitatively. Our algorithm outperforms a color-based mean shift implementation in three common, real-world situations: substantial clutter; multiple targets with similar color; and illumination changes.
2 2.1
Technical Approach Oriented Energy Features
Oriented Energy Computation. Events in a video sequence will generate diverse structures in the spatiotemporal domain. For instance, a textured, stationary object produces a much different signature in image space-time than if the same object were moving. One method of capturing the spatiotemporal characteristics of a video sequence is through the use of oriented energies [17]. These energies are derived using the filter responses of orientation selective bandpass filters when they are convolved with the spatiotemporal volume produced by a video stream. Responses of filters that are oriented parallel to the image plane are indicative of the spatial pattern of observed surfaces and objects (e.g., spatial texture); whereas, orientations that extend into the temporal dimension capture dynamic aspects (e.g., velocity and flicker). The basis of our approach is that energies computed at orientations which span the space-time domain can provide a rich description of a target for visual tracking. Here, multiscale processing is also important, as coarse scales capture gross spatial pattern and overall target motion while finer scales capture detailed spatial pattern and motion of individual parts (e.g., limbs). With regard to dynamic aspects, simple motion is captured (orientation along a single spatiotemporal diagonal) as well as more complex phenomena, e.g., multiple juxtaposed motions as limbs cross (multiple orientations in a spatiotemporal region). By encompassing both spatial and temporal target characteristics in an integrated fashion,
534
K. Cannons and R. Wildes
tracking is supported in the presence of significant clutter. Further, as detailed below, such representations can be made invariant to local image contrast to support tracking throughout substantial illumination changes. For this work, filtering was performed using broadly tuned, steerable, separable filters based on the second derivative of a Gaussian, G2 , and their corresponding Hilbert transforms, H2 [23], with responses pointwise rectified (squared) and summed. Filtering was executed across θ = (η, ξ) 3D orientations (η, ξ specifying polar angles) and σ scales using a Gaussian pyramid formulation. Hence, a measure of local energy, e, can be computed according to e (x; θ, σ) = [G2 (θ, σ) ∗ I (x)]2 + [H2 (θ, σ) ∗ I (x)]2 ,
(1)
where x = (x, y, t) corresponds to spatiotemporal image coordinates, I is the image sequence, and ∗ denotes convolution. This initial measure of local energy is dependent on image contrast. To attain a purer measure of the relative contribution of the orientations irrespective of contrast, (1) is normalized as e (x; θ, σ) eˆ (x; θ, σ) = , ˜σ e x; θ, ˜ + ˜ σ ˜ θ
(2)
where is a bias term to avoid instabilities when the energy content is small and the summations in the denominator cover all scale and orientation combinations. (In this paper, our convention is to superscript variables of summation with˜.) For illustrative purposes, Fig. 1 displays a subset of the energies that are computed for a single frame of a MERL traffic sequence [24]. Here, there is a white car moving to the left near the center of the frame. Notice how the energy channel that is tuned for leftward motion is very effective at distinguishing this car from the static background. Consideration of the channel tuned for horizontal structure shows how it captures the overall orientation structure of the white car. In contrast, while the channel tuned for vertical textures captures the outline of the crosswalks, it shows little response to the car, as it is largely devoid of vertical structure at the scales considered. Finally, note how the energies become more diffuse and capture more gross structure at the coarser scale. Given that the tracking problem is being considered, the goal is to locate the target’s position as precisely as possible. However, as seen in Fig. 1, the energies computed at coarser scales are diffuse due to the downsampling/upsampling that is employed in pyramid processing. Coarse energies are important because they provide information regarding the target’s gross shape and motion, but a method is required to improve their localization for accurate tracking. To that end, a set of weights are applied to the normalized energies of (2) according to ˆ (x; θ, σ) = eˆ (x; θ, σ) b (x; θ) , E
(3)
where b are pixel-wise weighting factors for a particular orientation channel, θ. The weighting factors for a specific orientation are computed by integrating the energies across all scales and applying a threshold, Tθ , according to b (x; θ) = eˆ (x; θ, σ ˜ ) > Tθ . (4) σ ˜
Spatiotemporal Oriented Energy Features for Visual Tracking
535
Fig. 1. Frame 29 of the MERL traffic video sequence with select corresponding energy channels. Finer and coarser scales are shown in rows two and three, resp. From left to right, the energy channels roughly correspond to horizontal structure, vertical structure, and leftward motion.
When computing the weights, summing across scales allows the better localized fine scales to sharpen the coarse scales, while the coarse scales help to smooth the responses of the fine scales. Furthermore, by calculating weights separately for each orientation, we avoid being prejudiced toward any particular type of oriented structure (e.g., static vs. dynamic). Two significant advantages of the proposed oriented energy feature set must be further highlighted. First, normalized energy, as defined by (1) and (2), captures local spatiotemporal structure at a particular orientation and scale with a degree of robustness to scene illumination: By virtue of the bandpass filtering, (1), invariance will be had to changes that are manifest in the image as additive offsets to image brightness; by virtue of the normalization, (2), invariance will be had to changes that are manifest in the image as multiplicative offsets. Second, the calculation of the defined normalized oriented energies requires nothing more than 3D separable convolution and pointwise nonlinear operations, and is thereby amenable to compact, efficient implementation [25]. Histogram Representation. As defined, oriented energies provide local characterization of image structure. Therefore, the energy measurements could be used to provide pointwise descriptors for target tracking (e.g., in conjunction
536
K. Cannons and R. Wildes
with spatial template-based matching). Alternatively, the pointwise measurements can be aggregated over target support to provide region-based descriptors (e.g., in conjunction with mean shift tracking). Here, we pursue the second option and demonstrate the efficacy of the features as regional descriptors. With an eye to mean shift tracking, we collapse the spatial information in our initial energy measurements and represent the target as a histogram. Each histogram bin corresponds to the weighted energy content of the target at a particular scale and orientation. Specifically, the template histogram that defines the target in the first frame is given by qˆu = C
n ˆ (x∗ ; φu ) , k x∗i 2 E i
(5)
i=1
where k is the profile of the tracking kernel, C is a normalization constant to ensure the histogram sums to unity, x∗i = (x∗ , y ∗ ) is a single target pixel at some temporal instant, i ranges so that x∗i covers the template support, and φu is the scale and orientation combination which corresponds to bin u of the histogram. When tracking a target, it may be necessary to evaluate several target candidates for the current frame. Candidate histograms are defined as
nh y − x∗i 2 E ˆ (x∗i ; φu ) , k (6) pˆu (y) = Ch h i=1 where y is the center of the target candidate’s tracking window, h is the bandwidth of the tracking kernel and i ranges so that x∗i covers the candidate support. A sample energy histogram for the target region shown in Fig. 1 (represented by the white box) is shown in Fig. 2. The bin corresponding most closely to leftward motion at the finest scale (bin 5) has by far the most energy. The next two high energy counts are found in bins 2 and 9 which are tuned to combinations of dynamic and static structure, with an emphasis on leftward motion and spatial orientation similar to that of the target. The overall horizontal structure of the car is captured by the energy in bins 1 and 4. In contrast, bins 3 and 6, which roughly represent static, vertical structure, do not have strong responses, given the nature of the car target. The histogram also shows that the oriented energies for the highest frequency structures have the strongest response, as the target is fairly small and dominated by relatively finer scale structure. 2.2
Oriented Energy Features in the Mean Shift Framework
Target Position Estimation. Under the mean shift framework, tracking an object involves locating the candidate position in the current frame that produces the histogram that is most similar to the template. Thus, a measure of similarity between two histograms is required. For histogram comparisons we utilize the Bhattacharyya coefficient, the sample estimate of which can be computed using ˆ] = ρ [ˆ p (y) , q
m pˆu (y) qˆu , u=1
(7)
Spatiotemporal Oriented Energy Features for Visual Tracking
537
0.25
Weighted Energy
0.2
High Frequency Energies Mid Frequency Energies Low Frequency Energies
0.15
0.1
0.05
0 0
5
10
15
20
25
30
Scale and Orientation
Fig. 2. Oriented energy histogram for the target region in Fig. 1
ˆ (y) and q ˆ are histograms with m bins apiece. Due to the definition of where p the Bhattacharyya coefficient, in order to minimize the distance between two histograms, (7) must be maximized with respect to the target position, y. The Bhattacharyya coefficient can be maximized via mean shift iterations [20]. The specific mean shift vector that can be used for this maximization is ⎡ ⎤ nh ∗ yˆ 0 −x∗i 2 x w g ⎥ m i h ⎢ i=1 i qˆu ∗ ˆ (x ; φu ) ˆ1 = ⎢ y E ,(8) ⎥ i ⎣ n ⎦ , where wi = ∗ 2 p ˆ (ˆ y0 ) ˆ y −x u 0 h i u=1 i=1 wi g h g (x) = k (x) is the derivative of the tracking kernel profile, k, with respect to x, ˆ 0 is the current target position. The Epanechnikov kernel has been shown and y to be effective [20] and is the most commonly used kernel for mean shift tracking. Thus, the position of the target in the current frame is estimated as follows. Starting from the target’s position in the previous frame, the mean shift vector is computed and the target candidate is moved to the position indicated by the mean shift vector. These steps are repeated until convergence has been reached or a fixed number of iterations have been executed. Template and Scale Updates. When tracking an object through a long video sequence, it is common that its characteristics will change. To combat the changes a target may incur over time (e.g., due to alterations in velocity or rotation), our tracker includes a simple template update mechanism defined as ˆ (yi ) , ˆ i+1 = απˆ qi + (1 − α) (1 − π) p q
(9)
ˆ i is the temwhere α is a weighting factor the speed of the updates, q to control i ˆ (yi ) , q ˆ is the Bhattacharyya coefficient between plate at frame i, and π = ρ p the current template and the optimal candidate found in the ith frame. Empirically, α was set to 0.85. Following each application of (9), the resulting template is renormalized and thereby remains consistent with our overall formulation. Owing to dependence on the Bhattacharyya coefficient, the template update rule
538
K. Cannons and R. Wildes
indicates that if the template and the optimal candidate are well-matched, the update to the template will be minimal. The size of a target may change during a video sequence as well. Although there are more effective methods of dealing with changes of object scale in the mean shift framework [21,22], in the current implementation we employ a simple approach, similar to that taken in [20]. In particular, our system performs mean shift optimization three times per frame using three different bandwidth values, h. Unless stated otherwise, h values of ±5% are used. We obtain the new bandwidth, hnew , by combining the best of the three bandwidths evaluated at the current frame, hopt , with the previous target size, hprev , according to hnew = γhopt + (1 − γ) hprev .
(10)
Empirically, we set γ = 0.15.
3
Empirical Evaluation
The performance of the oriented energy-based mean shift tracker has been evaluated on an illustrative set of test sequences. For comparative purposes, a mean shift tracker based on RGB color space was also developed and tested. Other than the use of different histograms, the two trackers were identical. The colorbased tracker was implemented in a similar manner to [20], whereby each color channel was quantized into 16 levels (yielding a histogram with 163 bins). In our current implementation of the energy-based tracker, energies were computed at 3 scales with 10 different spatiotemporal orientations per scale. Hence, the energy-based histograms contained 30 bins. For the oriented energy feature set, 10 orientations were selected because they span the space of 3D orientations for the highest order filters that we use (H2 ) [23]; in particular, the selected orientations correspond to the normals to the faces of an icosahedron with antipodal directions counted once, which provides a uniform tessellation of a sphere. For all results in this paper, an Epanechnikov kernel, K, was used. The thresholds for (4) were empirically set as 2.75× the mean energy for each orientation channel. The color and energy-based trackers were hand-initialized with identical target regions in the first frame of each video. Figure 3 illustrates the effectiveness of oriented energy-based features in dealing with illumination changes. An individual starts walking in a poorly lit area;
Fig. 3. Video sequence (x × y × t = 360 × 240 × 60) of a man walking through shadows. From left to right, frames 4, 18, 31, and 55 are shown. Tracked regions are highlighted with white boxes.
Spatiotemporal Oriented Energy Features for Visual Tracking
539
Fig. 4. Video sequence (x × y × t = 360 × 240 × 50) of people walking through a room with similar colored clothing. From left to right, frames 6, 18, 32, and 50 are shown. Tracked regions are highlighted with white boxes.
Fig. 5. MERL traffic video sequence (x × y × t = 368 × 240 × 64) where a white car is tracked as it travels through an intersection. From left to right frames 13, 24, 38, and 58 are shown. Tracked regions are highlighted with white boxes.
then, he travels into and out of the bright region as he walks across the room. Using our proposed feature set, the tracker appeared to be relatively unaffected by the changes in illumination. This robustness arises from the normalization performed in (2). In comparison, our color-based mean shift tracker completely lost track of the target after only a few frames, even when histograms created using normalized RG-space [20] were utilized. Figure 4 shows a case where two persons with similar colored clothing walk in opposite directions and the individual starting on the right side is being tracked. Despite the full occlusion that occurs for several frames, the tracker using energy features is capable of following the true target throughout the video. The different texture patterns and velocities of the walkers were sufficient cues for the energy-based tracker to achieve success, as the representation spans the spatiotemporal domain. In comparison, our color-based tracker became distracted by the other walker as the individuals have near-identical color distributions. Figure 5 shows a real-life, grayscale video sequence of a cluttered traffic scene that was obtained from MERL [24] (a portion also used in Fig. 1). As the figure shows, our proposed system experiences some slight difficulty when tracking the vehicle as it passes over the crosswalk (e.g. notice off-centered tracking in frames 13 and 24). This performance decrease occurs because the lack of contrast (essentially uniform white on white) between the car and the crosswalk yields little energy for the involved portions of the car. Nevertheless, the tracker never loses the target; indeed, the frames shown are representative of the worst case performance in this video. Our feature set was also successfully used when tracking people and vehicles in videos obtained from the PETS2001 dataset [26]. Figure 6 shows an example
540
K. Cannons and R. Wildes
Fig. 6. PETS2001 video sequence (x × y × t = 384 × 288 × 85) where a cyclist is being tracked. From left to right frames 18, 32, and 73 are displayed. Tracked regions are highlighted with white boxes.
Fig. 7. Video sequence (x × y × t = 360 × 240 × 100) showing an individual walking in an erratic pattern. From left to right frames 22, 74, 86, and 100 are displayed. Tracked regions are highlighted with white boxes.
of our results on this dataset where a cyclist is tracked. The tracker that utilizes oriented energy features is successful despite the fact that the cyclist is partially occluded by another individual near the beginning of the sequence. The results on this data sequence are impressive given that the video accurately reflects real-world surveillance settings where targets of interest are often small and of low-resolution. In contrast, our implementation of the color-based mean shift tracker drifted off the target after only a few frames. In Fig. 7 an individual is shown walking erratically, making sudden changes in direction and moving at a wide variety of speeds. Since the oriented energy features encompass both spatial and temporal information, tracking of the target continues throughout each change in velocity. In particular, at instances where the target motion changes radically, the spatially-based components of the representation keep the tracker on target. Subsequently, template updates, (9), incorporate changes to adapt the model for further tracking. Figure 8 shows footage that one might obtain from overhead surveillance cameras in public areas. The oriented energy-based tracker follows the target of interest even though there are multiple similar walkers with little texture, cast shadows, and complex reflectance effects, as the video was recorded through a window. Using the oriented energy feature set, the target is not lost, even during the partial occlusion. The tracker does lag behind the target for a few frames immediately following the occlusion; however, it ultimately follows the correct person. Indeed, frame 39 is representative of its worst-case performance for this
Spatiotemporal Oriented Energy Features for Visual Tracking
541
Fig. 8. Video sequence (x × y × t = 320 × 240 × 70) showing multiple people in motion that are similar in appearance. From left to right frames 9, 31, 39, and 59 are displayed. Tracked regions are highlighted with white boxes.
Bhattacharyya Coefficient
1.1
MERL
1.05
PETS 1 0.95 0.9 0.85 0.8 0.75 0
20
40
60
80
Frame Number
Fig. 9. Bhattacharyya coefficients over the entire video sequence for the MERL and PETS2001 videos
video. In comparison, our color-based implementation was only able to follow the true target for approximately 30 frames. Quantitative performance analysis was performed for the video sequences that are publicly available — MERL and PETS2001. Specifically, Fig. 9 shows the Bhattacharyya coefficient vs. frame number for these two sequences. The Bhattacharyya coefficient is a measure of the system’s confidence in the target found in each frame, with 1 being the largest possible value. For the MERL video, the decreased level of performance at the crosswalks that was qualitatively observed is also indicated quantitatively. In particular, Fig. 9 shows two slight decreases in the Bhattacharyya coefficient at frames 12 and 58 — precisely the frames when the vehicle is passing over the crosswalks. For the PETS video sequence, the significant deviation the Bhattacharyya coefficient experiences is a result of the partial occlusion of the cyclist by the walker (approximately frames 15 - 34). The other, less substantial decreases are a result of the significant background clutter (e.g., parked cars). Also of note is that an average of 3 mean shift iterations were required to reach convergence for these two videos. Twenty iterations, the maximum we allow, was observed only three times.
4
Summary
Spatiotemporal oriented energy features provide a rich, yet compact representation of a target’s characteristic structure across both space and time. In particular,
542
K. Cannons and R. Wildes
by encompassing a range of orientations and scales, the proposed feature set provides a natural integration of the static (e.g., spatial texture) and dynamic (e.g., motion) aspects of a target. To illustrate their usefulness with respect to a particular tracking mechanism, we provide an instantiation with respect to the mean shift estimator. In our experiments over a wide range of video sequences, the energybased tracker was considered to perform as well or better than an identical algorithm that used color histograms. Of primary interest in our work were surveillance-inspired video sequences that included challenges such as substantial background clutter, targets that contained similar colors to other objects in the scene, and changes in illumination. Tracking with the use of oriented energy features was shown to be robust to these challenges. Acknowledgments. Portions of this work were funded by an Ontario Graduate Scholarship to K. Cannons and an NSERC Discovery Grant to R. Wildes.
References 1. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. Comp. Surv. 38(4), 1–45 (2006) 2. Lucas, B., Kanade, T.: An iterative image registration technique with application to stereo vision. In: DARPA IUW, pp. 121–130 (1981) 3. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. IJCV 2(3), 283–310 (1989) 4. Shi, J., Tomasi, C.: Good features to track. CVPR 1, 593–600 (1994) 5. Sethi, I., Jain, R.: Finding trajectories of feature points in monocular images. PAMI 9(1), 56–73 (1987) 6. Deriche, R., Faugeras, O.: Tracking line segments. IVC 8(4), 261–270 (1991) 7. Rangarajan, K., Shah, M.: Establishing motion correspondence. CVGIP 54(1), 56– 73 (1991) 8. Terzopoulos, D., Szeliski, R.: Tracking with kalman snakes. In: Blake, A., Yuille, A. (eds.) Active Vision, pp. 553–556. MIT Press, Cambridge (1992) 9. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 343–354. Springer, Heidelberg (1996) 10. Haritaoglu, L., Harwood, D., Davis, L.: W4: Real-time surveillance of people and their activities. PAMI 22(8), 809–830 (2000) 11. Birchfield, S.: Elliptic head tracking with intensity gradients and color histograms. CVPR 1, 232–237 (1998) 12. Sigal, L., Sclaroff, S., Athitsos, V.: Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. CVPR 2, 152–159 (2000) 13. Elgammal, A., Duraiswami, R., Davis, L.: Probabilistic tracking in joint featurespatial spaces. CVPR 1, 781–788 (2003) 14. Bolgomolov, Y., Dror, G., Lapchev, S., Rivlin, E., Rudzsky, M.: Classification of moving targets based on motion and appearance. In: BMVC, pp. 142–149 (2003) 15. Cremers, D., Schnorr, C.: Statistical shape knowledge in variational motion segmentation. IVC 21(1), 77–86 (2003)
Spatiotemporal Oriented Energy Features for Visual Tracking
543
16. Sato, K., Aggarwal, J.: Temporal spatio-velocity transformation and its application to tracking and interaction. CVIU 96(2), 100–128 (2004) 17. Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. JOSA 2(2), 284–299 (1985) 18. Heeger, D.: Optical flow from spatiotemporal filters. IJCV 1(4), 297–302 (1988) 19. Enzweiler, M., Wildes, R., Herpers, R.: Unified target detection and tracking using motion coherence. Wrkshp. Motion & Video Comp. 2, 66–71 (2005) 20. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE PAMI 25(5), 564–575 (2003) 21. Collins, R.: Mean-shift blob tracking through scale space. CVPR 2, 234–240 (2003) 22. Zivkovic, Z., Krose, B.: An EM-like algorithm for color-histogram tracking. CVPR 1, 798–803 (2004) 23. Freeman, W., Adelson, E.: The design and use of steerable filters. IEEE PAMI 13(9), 891–906 (1991) 24. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE PAMI 22(8), 844–851 (2000) 25. Derpanis, K., Gryn, J.: Three-dimensional nth derivative of Gaussian separable steerable filters. ICIP 3, 553–556 (2005) 26. PETS (2006), http://peipa.essex.ac.uk/ipa/pix/pets/
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras Jinshi Cui1, Yasushi Yagi2, Hongbin Zha1, Yasuhiro Mukaigawa2, and Kazuaki Kondo2 1
State Key Lab on Machine Perception, Peking University, China {cjs,zha}@cis.pku.edu.cn 2 Department of Intelligent Media, Osaka University, Japan {yagi,mukaigawa,kondo}@am.sanken.osaka-u.ac.jp
Abstract. A movie captured by a wearable camera affixed to an actor’s body gives audiences the sense of “immerse in the movie”. The raw movie captured by wearable camera needs stabilization with jitters due to ego-motion. However, conventional approaches often fail in accurate ego-motion estimation when there are moving objects in the image and no sufficient feature pairs provided by background region. To address this problem, we proposed a new approach that utilizes an additional synchronized video captured by the camera attached on the foreground object (another actor). Formally we configure above sensor system as two face-to-face moving cameras. Then we derived the relations between four views including two consecutive views from each camera. The proposed solution has two steps. Firstly we calibrate the extrinsic relationship of two cameras with an AX=XB formulation, and secondly estimate the motion using calibration matrix. Experiments verify that this approach can recover from failures of conventional approach and provide acceptable stabilization results for real data. Keywords: Wearable camera, synchronized ego-motion estimation, stabilization, two face-to-face cameras, extrinsic calibration.
1 Introduction The goal of this work is to recover ego-motion of two face-to-face moving cameras simultaneously. This work aims at some situations where ego-motion with only one camera may fail and use another camera to provide additional information. Ego-motion estimation of a moving camera is the task of recovering camera motion trajectory given a set of 2D image frames. It has many applications like stabilization in our application. Most existing methods take one of the following two cases. For the case of static scenes, the problem of fitting a 3D scene compatible with the images is well understood and essentially solved [1, 2]. The second case deals with dynamic scenes, where the segmentation into independently moving objects and the motion estimation for each object have to be solved simultaneously [3-4]. Above methods may fail in camera ego-motion estimation if: (1) the foreground occupies too much space in the image, (2) there are insufficient features in the background Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 544–554, 2007. © Springer-Verlag Berlin Heidelberg 2007
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras
545
Fig. 1. Two image pairs captured by one wearable camera with a moving foreground. Left image pair: Camera motion can be computed using background region with enough feature point matches.Right image pair: There are very few feature matches in background region. It’s impossible to estimate ego-motion without additional information. And motion of the foreground point matches is related with both camera motion and person’s motion. If we know foreground person’s motion, camera ego-motion can be estimated.
Fig. 2. Two face-to-face cameras in our application of “Dive into Movie”. One camera is attached to the body of each person.
region of image pair, or (3) there is too much repeated structure for features to get a good match. Fig. 1 showed the situation that almost the whole image is covered by moving foreground. It’s impossible to estimate ego-motion in this case. Additional information can be utilized, such as inertial data [6] or synchronized image frames from another camera. In the case of using another camera, there can be two cases. One is that the additional camera is fixed somewhere watching person1 or person2. In the case of watching person1, the motion of camera is directly estimated by pose estimation. In the case of watching person2, first the motion of person 2 is estimated, and then camera motion is estimated by eliminating the person’s motion from the foreground motion of the camera. In both cases, it’s necessary to make the fixed camera always watching the moving person. The other one is that the additional camera is just the one attached on the foreground object (i.e. another person’s body). This configuration is very natural in our application (see Fig. 2). Motivation for the above work is from a new application of computer vision technology in entertainment, so called “Dive into Movie”. In this application, a movie captured by a wearable camera attached to the actor’s body can give audiences the sense of “immerse in the movie”. The raw movie captured by wearable camera needs to be stabilized due to jitters and ego-motion of the actor. And accurate ego-motion estimation of a moving camera is not easy when there are moving objects in the
546
J. Cui et al.
Fig. 3. Overview of the proposed approach at time k
image. In this application, there are at least two face-to-face interacting actors in a scene. The audience can choose anyone of the actors to see the movie from different views. One camera is attached to each actor. For simplicity, in this paper, we only consider the case of two actors in the scene. Then our goal is to recover ego-motion of two face-to-face cameras using information from both cameras. To address this problem, we first configure the sensor system as two face-to-face moving cameras. And then we derived the relationship between four views that consist of two consecutive views of each camera. In estimation stage, two cameras are calibrated first, and then ego-motion is estimated using calibration result. The calibration problem is formed as AX=XB and refer to the solutions in traditional robotics hand-eye calibration [6-9]. Compared with the consistent motion of hand/eye in traditional hand-eye calibration, we deal with two independently moving cameras. To our knowledge, there is no other work reported on this problem. In [10], a similar configuration of stereo camera is proposed, which used two face-to-face static cameras. The epipolar geometry for these mutual cameras is studied and used to improve the performance of structure from motion approach. In contrast to [10], our approach tries to estimate the ego-motion of two moving face-to-face cameras. The flowchart of the proposed system is showed in Fig. 3. Firstly input videos are pre-processed to segment out the background region and object region which moves consistently with the opposing camera. SIFT features are extracted and matched within consecutive two image pairs for background region and object respectively. If there are enough reliable point matches in background region, ego-motion is estimated and output stabilized frame. Above steps are processed for both cameras. Secondly, if estimation fails with background region, go to the synchronized estimation step, which includes two stages. Extrinsic parameters are calibrated in first stage for two cameras. Here, it’s necessary to get at least three consecutive images from each camera. And then, ego-motions are estimated with calibration result. The following section provides the two-camera geometry. Section III describes estimation procedure. Finally, the evaluation of the experiments is given in Section IV.
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras
547
2 Two-Camera Geometry Our application of the ego-motion estimation is stabilization for “Dive into Movie”. Cameras are affixed to the actors’ body and move consistently with person (see Fig. 2). First of all, it is convenient to assign frames of reference.
W : a fixed frame of reference; C1 (k): the camera1 frame located at the optical center of camera1 with positive z axis along the optical axis at time k, attached on person 1, watching camera2, varying with camera1’s motion; C2 (k): the camera2 frame located at the optical center of camera2 with positive z axis along the optical axis at time k, attached on person 2, watching camera1, varying with camera2 motion. The relation between any two coordinates is represented by the rotation matrix Ra −> b ∈ SO (3) and a translation vector ta −> b ∈ℜ3 . Ta −> b = [ Ra −> b ta −> b ; 03×3 1] is
the transformation from coordinates a to coordinates b. We express a point X a with respect to the reference a, then X b = Ta −>b X a . We assume that the internal parameters of cameras are initialized as known. Given enough correct feature matches in two views (with static scenes) captured by the same camera, camera ego-motion can be computed easily. In the following, we first recall two-view geometry of conventional static scene. And then foreground motion is taken into account. Finally, four-view (two from each camera) geometry is derived by 3D motion analysis on two moving cameras. 2.1 Two-View Geometry: Epipolar Constraint and Essential Matrix
As well known, the Essential matrix constrains the motion of points between two views from one camera. It encodes the epipolar constraint and motion matrix. The set of homogeneous image points {x i }, i = 1,..., n in the first image is related to the set
{x′i }, i = 1,..., n in the second image by Essential matrix with the following equation: ˆ , Tˆ = [0 −t xi′Exi = 0, E = TR 3
t2 ; t3
0 −t1 ; −t2
t1 0]
(1)
From above equation, given feature matches in two-view, Essential matrix can be determined, and then rotation matrix and translation vector can be computed up to a universal scale. We used RANSAC [1] for transformation matrix estimation. 2.2 Two-View Geometry with Moving Foreground
Let a set of homogeneous 3D space points { X F ,i (k )}, i = 1,..., n be positions of foreground points at time k in the view of camera1 with a rigid motion independently from camera1’s motion. Then, motion of these points in C1 (k) can be represented as
548
J. Cui et al.
X F1 ,C1 (k ) = TF1 ,C1 (k ) X F1 ,C1 (k − 1) = TC−11 (k )TC1 <−W ( k − 1)TF1 ,W ( k )TW <− C1 (k − 1) X F1 ,C1 ( k − 1)
(2)
where TF ,C1 ( k ) represents 3D foreground motion in C1 ’s coordinates from time k-1 to k. TC1 (k ) is C1 ’s motion and TF1 ,W (k ) is foreground motion in world coordinates. TF ,C1 ( k ) and TC1 (k ) can be computed with two-view geometry described in Section 2.1 using feature matches in foreground region and background region respectively. If there is no enough background feature matches for TC1 (k ) , and if TF1 ,W (k ) is given by some other way, TC1 (k ) can be computed using Equation (2). 2.3 Four-View Geometry of Two Face-to-Face Cameras
In this case (see Fig.2), motion of camera1’s foreground points F1 in C2 (k ) coordinates is the same as C2 ’s motion TC2 ( k ) , i.e. TF1 ,C2 (k ) = TC2 (k ) . Then TF1 ,W ( k ) = TW <− C2 ( k − 1)TF1 ,C2 ( k )TC2 <−W = TW <− C2 (k − 1)TC2 ( k )TC2 <−W ( k − 1)
(3)
Now, let’s derive relations among four 3D motion transformation matrices: TF1 ,C1 (k ) , TC2 ( k ) , TC1 (k ) and TF2 ,C2 ( k ) . With the relations, given any three matrices of these four, remaining unknown matrix can be computed. TC2 ( k ) and TC1 (k ) are the target matrices in this paper. From Equation (2) and (3), we have TF1 ,C1 (k ) = TC−11 (k )TC1 <−W ( k − 1)TW <− C2 ( k − 1)TC2 (k )TW <− C1 (k − 1)TC2 <−W (k − 1) = TC−11 (k )TC1 <− C2 (k − 1)TC2 ( k )TC−11<− C2 ( k − 1) If we let TC2 <− C1 = TC 2 −1 for simplicity, then we have TF1 ,C1 (k ) = TC−11 (k )TC1− 2 (k − 1)TC2 ( k )TC−11− 2 (k − 1)
(4)
Similarly, considering foreground points of Camera2, we can obtain: TF2 ,C2 (k ) = TC−21 (k )TC 2 −1 ( k − 1)TC1 (k )TC−21−1 ( k − 1)
(5)
Now let check relations between above matrices and image observations. a) TF1 ,C1 (k ) : motion of foreground points (belong to person2) in camera1; b) TF2 ,C2 ( k ) : motion of foreground points (belong to person1) in camera2, c) TC2 ( k ) : computed from motion of background points in camera2 d) TC1 (k ) : computed from motion of background points in camera1; e) TC 2 −1 (k − 1) : Extrinsic calibration matrix between camera1 and camera2. a)-d) can be computed using two-view relations described in Section 2.1. e) can not be directly computed, and to be determined in Section 3.1.
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras
549
3 Synchronized Estimation Recall the overview of the algorithm in Fig. 3. Synchronized estimation stage is divided into two steps: extrinsic calibration using frames at time k-3, k-2 and k-1 and motion estimation using frames at time k-1 and k. 3.1 Extrinsic Calibration of Two Face-to-Face Cameras
First we present an outline of our calibration procedure, and then the details of each step will be presented. The extrinsic calibration of two cameras is broken down into the following steps: a) for time k-3, k-2 and k-1, compute TF1 ,C1 , TF2 ,C2 , TC2 and TC1 with steps in Section 2.3 ; b) compute TC 2 −1 (k − 1) using Equation (6) in the following soon. To get a unique solution, at least three views from one camera is necessary [6], with avoiding special configurations of view angles. In the following equation, matrices with underline denote that they can be calculated as known. Using Equation (4) ⎧TF1 ,C1 ( k − 1) = TC−11 (k − 1)TC 1− 2 ( k − 2)TC2 (k − 1)TC−11− 2 ( k − 2) ⎪ ⇒ ⎨ −1 ⎪⎩TC 1− 2 (k − 1) = TC 1 (k − 1)TC 1− 2 (k − 2)T C2 (k − 1) TF1 ,C1 (k − 1) = TC1− 2 (k − 1)TC2 (k − 1)TC−11− 2 ( k − 1)TC−11 ( k − 1)
(6)
TF1 ,C1 (k − 1)TC 1 (k − 1)TC 1− 2 (k − 1) = TC1− 2 (k − 1)TC2 (k − 1)
(7)
From Equation (6), ⎧TF1 ,C1 (k − 2) = TC 1− 2 (k − 2)TC2 (k − 2)TC−11− 2 ( k − 2)TC−11 (k − 2) ⎪ ⇒ ⎨ −1 ⎪⎩TC 1− 2 (k − 2) = TC 1 (k − 1)TC 1− 2 (k − 1)TC2 (k − 1) TF1 ,C1 (k − 2) = TC 1 (k − 1)TC 1− 2 (k − 1)TC−21 (k − 1)TC2 ( k − 2)TC2 ( k − 1)TC−11− 2 ( k − 1)TC−11 ( k − 1)TC−11 ( k − 2) TC 1 (k − 1)TF1 ,C1 (k − 2)TC 1 ( k − 2)TC 1 (k − 1)TC 1− 2 (k − 1) = TC1− 2 ( k − 1)TC−21 ( k − 1)TC2 (k − 2)TC2 (k − 1)
(8)
(9)
In estimation of extrinsic motion, we decompose T into R and t. Then problem can be simplified as compute X that satisfies AX = XB in Equation (7) and (9) for X = RC1− 2 ( k − 1) . tC1− 2 ( k − 1) can be easily obtained from RC1− 2 ( k − 1) and Equation(7), (9). Here, both A and B are known, and X is unknown and has to be solved. While solutions to this question have been studied when A and B are general n n × n matrices, here we need solutions that belong to Euclidean group. In the context of robot sensor calibration, [6] first motivate this equation, and provide a closed-form solution. Their approach is based on geometric interpretations
550
J. Cui et al.
of the eigen-values and eigenvectors of a rotation matrix. Both translation and orientation values are calculated simultaneously using least-square fitting. [7] used this formulation of the problem and developed a non-linear optimization technique to solve it. Martin and Park [8] derive a closed form solution as a linear least squares fit . [9] formulated the problem using canonical coordinates of the rotation group, which enables a particularly simple closed form solution. In [6], conditions for uniqueness of solutions are discussed. It is concluded that the solution can not be found with only one measurement, and the parameters can be uniquely estimated with two camera positions, but the orientations of the camera cannot be zero or π value. In this paper, we used the approach described in [8]. 3.2 Ego-Motion Estimation
Given TF1 ,C1 (k ) - motion of foreground point in view of camera1 and TC2 ( k ) - motion of background points in view of camera2, and TC1− 2 (k − 1) obtained in Section 3.2,
TC1 (k ) is computed using Equation (4). TF1 ,C1 (k ) = TC−11 (k )TC1− 2 (k − 1)TC2 ( k )TC−11− 2 (k − 1)
(10)
4 Evaluations Both simulated data and real data are used for evaluations. Using synthesis data, we check the accuracy of the approach and the sensitivity to various levels of noise. Using real data, the procedure outlined in Fig. 3. is implemented along with the proposed calibration and estimation approach. Furthermore, stabilization results using estimated ego-motion matrices are shown to prove the feasibility and accuracy of the approach. 4.1 Evaluations with Simulated Data
The simulated data was created using a set of known 3D points and transformations. Transformations between two cameras and ego-motions of both cameras are constructed with random rotation axis, angle and translation vector. Additionally, in order to analysis the influence of noise, the data sets were defined by the radius of the Gaussian noise in the 2D pixel points. We create data sets with three different levels of noise. The resulting error in the calibration transformation is plotted in Fig.4. The error in the final estimation of transformation matrix, or residual error, is defined as Ttrue − Testimated F , where ⋅ F is the Frobenious norm of the matrix. Referring to the results (Fig.4 and 5), some interesting observations are made. The proposed approach produces results with low error. In Fig.5, we also showed the residual error resulted from noise in calibration matrix. We can find that the noise in calibration matrix doesn’t impact the final estimation result much. This is because the error in calibration stage might be eliminated by an inverse computation.
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras
551
Gaussian noise with var 1 pixel Gaussian noise with var 2 pixel Gaussian noise with var 4 pixel
0.15
Residual error
0.10
0.05
0.00 0
5
10
15
20
25
30
35
Date No.
Fig. 4. Residual error of calibration result with different level of noise added to 2d image pixels 0.20
5 degrees of rotation angle error 10 degrees of rotation angle error 20 degrees of rotation angle error 45 degrees of rotation angle error
Residual error
0.15
0.10
0.05
0.00 0
5
10
15
20
25
30
35
Data No.
Fig. 5. Residual error of estimated matrices with different level of noise on calibration matrix
4.2 Experimental Results with Real Data
For real experiments, recall the overview of the algorithm in Fig. 3. We used a real video data with cameras affixed on the body. Before estimation, synchronized input videos from both cameras are pre-processed to segment out the background region and object region. In this step, color distribution based mean-shift region tracking [11] is implemented for object region. SIFT features [12] are extracted and matched within consecutive two image pairs for background region and object respectively. In Fig.6, we showed the result of feature matches with SIFT features. The synchronized estimation step has two stages. Extrinsic parameters are calibrated in first stage for two cameras with totally four image pairs (two for each camera using data at three time steps) using method in Section 3.1. Two-view transformation matrices for foreground and background region are computed using RANSAC [1]. Calibration matrix is computed using the approach described in [8]. Ego-motion is estimated using method in Section 3.2.
552
J. Cui et al.
Fig. 6. One set of data for transformation computation (A and B) in calibration stage. Left column: Up) Point matches in background region of video2; Middle) Point matches in foreground region of video2; Down) Point matches in background region of video1. Right column: Up) stabilization result using background region in video2. Middle) stabilization result using foreground region in video2. Down) Stabilization result using background region in video1.
Fig. 7. Stabilization results. Left-top: point matches in background region of video2. No enough features and the conventional approach failed in this case. Right-top: original image before motion. Down: stabilization result using the proposed approach. Compared with original image on the right-top, we can see our approach can provide acceptable stabilization result.
Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras
553
Finally, a 2D affine transformation is derived from motion matrix for stabilization only considering effect of rotation: x ′ = sRx , where we set s = 1 R33 for simplicity. x and x′ are homogeneous image points before and after motion. Since the main purpose of this paper is ego-motion recovery, stabilization has not been carefully considered, which can be our future work. In Fig. 7, we showed the stabilization result on background region and foreground region, they all failed. Stabilization result with the proposed approach is given. Compared with original image, we can see it provides acceptable result.
5 Conclusion Accurate estimation of ego-motion is not easy when there is moving foreground. Especially in some special situations it’s almost impossible. To address the problem, we proposed a new approach that utilizes additional video captured by the camera attached on the foreground object (i.e. another actor in our application). We first configure the sensor system as two face-to-face moving cameras. And then we derived the relationship between four views from two cameras. In estimation stage, two cameras are calibrated firstly, and then ego-motion is estimated. We calibrate the extrinsic relationship of two cameras with an AX=XB formulation. Experiments with simulated data and real data verify that this approach can provide acceptable ego-motion estimation and stabilization results.
Acknowledgment This work was supported in part by the NKBRPC (No. 2006CB303100), NSFC Grant (No. 60333010), NSFC Grant (No. 60605001) and Key grant Project of Chinese Ministry of Education (No. 103001).
References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Faugeras, O., Luong, Q.T., Papadopoulo, T.: The geometry of multiple images. MIT Press, Cambridge (2001) 3. Schindler, K., Suter, D.: Two-view multibody structure-and-motion with outliers through model selection. IEEE T-PAMI 28(6), 983–995 (2006) 4. Wolf, L., Shashua, A.: Two-body segmentation from two perspective views. In: Proc. CVPR, pp. 263–270 (2001) 5. Makadia, A., Daniilidis, K.: Correspondenceless Ego-Motion Estimation Using an IMU. In: Proceedings of the IEEE International Conference on Robotics and Automation (2005) 6. Shiu, Y.C., Ahmad, S.: Calibration of wrist-mounted robotic sensors by solving homogenous transform equations of the form AX = XB. IEEE Transactions on Robotics and Automation 5(1), 16–29 (1989) 7. Li, M.: Kinematic calibration of an active head-eye system. IEEE Transactions on Robotics and Automation 14(1), 153–157 (1998)
554
J. Cui et al.
8. Park, F.C., Martin, B.J.: Robot sensor calibration: Solving AX = XB on the Euclidean group. IEEE T-RA 10(5), 717–721 (1994) 9. Neubert, J., Ferrier, N.J.: Robust active stereo calibration. In: Proceedings of the IEEE International Conference on Robotics and Automation, vol. 3, pp. 2525–2531 (2002) 10. Sato, J.: Recovering Multiple View Geometry from Mutual Projections of Multiple Cameras. Int. J. Comput. Vision 66(2), 123–140 (2006) 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. IEEE Trans. Pattern Analysis Machine Intell. 25(5), 564–575 (2003) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
Optical Flow–Driven Motion Model with Automatic Variance Adjustment for Adaptive Tracking Kazuhiko Kawamoto Kyushu Institute of Technology, 1-1 Sensui-cho, Tobata-ku, Kitakyushu 804-8550, Japan
[email protected]
Abstract. We propose a statistical motion model for sequential Bayesian tracking, called the optical flow–driven motion model, and show an adaptive particle filter algorithm with the motion model. It predicts the current state with the help of optical flows, i.e., it explores the state space with information based on the current and previous images of an image sequence. In addition, we introduce an automatic method for adjusting the variance of the motion model, which parameter is manually determined in most particle filters. In experiments with synthetic and real image sequences, we compare the proposed motion model with a random walk model, which is a widely used model for tracking, and show the proposed model outperform the random walk model in terms of accuracy even though their execution times are almost the same.
1
Introduction
Particle filters [1] have proven to be a powerful and popular tool for visual tracking. One strength of particle filters is the ability to deal with a wide range of statistical models in sequential Bayesian estimation. In particle filters, a filtering distribution is approximately represented by a finite number of weighted samples, referred to as particles, and is updated by propagating particles through time. The most common probability distribution used for propagation may be a prior model [2,3,4], which describes the state dynamics. However this often gives a poor estimate if unexpected motions happen, because the model explores the state space without any additional information on the current state. For adaptive particle propagation, we propose a statistical motion model which predicts the current state with the help of sparse optical flows. We call it optical flow–driven motion model. Due to recent computer power, the real–time computation of sparse optical flows has become possible even if sophisticated methods, such as robust estimation and hierarchical search, are employed. Hence a construction of the motion model is not expensive. In addition we introduce a method for adjusting the variance of the motion model. The variance affects the robustness to unexpected motions and the accuracy of Monte Carlo approximation. This adjustment is based on an error propagation technique [5] and is fully automatic, i.e., the manual setting of the variance is not required. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 555–564, 2007. c Springer-Verlag Berlin Heidelberg 2007
556
K. Kawamoto
This motion model becomes more effective in combination with global image feature–based observation models such as color histogram based models [6], because a combination of optical flow and such a global image feature is complementary. In experiments, we implement the particle filter algorithm using a color histogram based observation model. This paper is organized as follow. In section 2, we review particle filters and related works. In section 3, we propose the optical flow–driven motion model and show how to construct the model. In section 4, we show experimental results with synthetic and real image sequences.
2
Related Works
Visual tracking can be formulated as the problem of estimating recursively in time the filtering distribution p(xt |y 1:t ) of the state xt , given the sequence of observations y 1:t ≡ {yk |k = 1, 2, . . . , t}. In the context of visual tracking, xt represents the state of a target, such as the position, the rotational angle, and the velocity, and y t might be intensity pattern, feature points, or histogram. The states xk , k = 1, 2, . . . , t, are assumed to be Markovian given an initial distribution p(x0 ) and a transition distribution p(xt |xt−1 ). The observations y k , k = 1, 2, . . . , t, are conditionally independent of distribution p(y k |xk ) given the state xk . A recursive estimation of the filtering distribution p(xt |y 1:t ) can be achieved by iterating two steps: (1) prediction : p(xt |y 1:t−1 ) = p(xt |xt−1 )p(xt−1 |y 1:t−1 )dxt−1 , filtering : p(xt |y 1:t ) ∝ p(y t |xt )p(xt |y 1:t−1 ).
(2)
For linear–Gaussian state space models, the Kalman filter gives an analytical solution to this recursive estimation. However, for more general models including the model studied in the paper, the analytical evaluation is impossible. Particle filters give a numerical solution to the recursive estimation with Monte Carlo methods. The basic idea of particle filtering is that one approximately represents the filtering distribution in the pointwise form p(xt |y 1:t ) ≈ N (i) (i) N (i) = 1, where δ(·) denotes the delta-Dirac funci=1 wt δ(xt − xt ), i=1 wt (i) tion, and xt is an independent and identically distributed ith random sample at (i) time t, called particle, drawn from p(xt |y 1:t ), and wt is the normalized weight (i) associated with xk . There are some sampling methods for the prediction step in the recursive estimation. One of the most widely used sampling methods might be to draw (i) particles xt , i = 1, . . . , N, from a prior model p(xt |xt−1 ) [2,3,4]. As a prior model, “smooth” motion models such as random walk constant velocity model
xt+1 = xt + v t
(3)
xt+1 = 2xt − xt−1 + v t
(4)
Title Suppressed Due to Excessive Length
557
are often used for visual tracking, where v t is a white Gaussian noise. Although these motion models in eqs. (3) and (4) works well in many situations, they cannot deal with the rapid motion change of a target because of its nature. Thus, if the rapid motion change happens, the particles drawn from such a smooth model can give a poor Monte Carlo approximation of the filtering distribution. To improve the Monte Carlo approximation, adaptive sampling methods might be useful. The auxiliary particle (AP) filter [7], the self–organizing state space model, and the sequential importance sampling (SIS) filter [8,9] (and their variants) include the mechanism of adaptive sampling. The AP filter [7] draws particles from (i)
(i)
(i)
q(xt , i|y1:t ) ∝ wt−1 p(y t |μt )p(xt |xt−1 ), (i)
(5)
(i)
where μt is some characterization of xt , given xt−1 . This distribution is conditioned on the current observation y t . Thus the distribution adaptively changes according to y t . This adaptation leads to a more efficient sampling method, but, (i) if p(xt |xt−1 ) is far away from the true one, q(xt , i|y 1:t ) can also be poor. The self–organizing state space model [10,11] builds an augmented state vector ¯ x t ≡ (xt θ t ), where θ t is unknown parameters of the system, called hyper– parameter. The dynamics of the augmented state can be decomposed into xt−1 ) = p(xt |xt−1 , θ t )p(θ t |θt−1 ) p(¯ xt |¯
(6)
assuming p(xt |xt−1 , θt , θ t−1 ) = p(xt |xt−1 , θt ) and p(θt |xt−1 , θt−1 ) = p(θ t |θt−1 ). Particles are therefore drawn from the distribution conditioned on the hyper– parameter θt . Specifically θt can be taken as the variance of the prior model. In (i) (i) this case, each particle xt is diffused based on its own variance θt . Since the particles with large variances are expected to widely diffuse in the state space, such particles are likely to capture the rapid motion change. As a result, the self–organizing state space model is robust to unexpected motions. Although the model is a flexible approach for adaptive sampling, it renders the dimensionality of the state space twofold. This leads to inefficiency of sampling because more particles are required. The SIS filter [8,9] draws particles from a proposal distribution q(xt |x1:t−1 , y 1:t ). This distribution includes the prior model p(xt |xt−1 ) and eq. (5) as a special case. The optimal proposal distribution is (i)
(i)
p(xt |xt−1 , y t ) ∝ p(y t |xt )p(xt |xt−1 ).
(7)
In visual tracking, this optimal distribution is often not available because p(y t |xt ) is unknown in most cases. Hence an alternative proposal distribution is necessary. ICondensation [12] construct a proposal distribution by detecting skin color regions in the image for hand tracking. However such specific knowledge about objects is not always available for a wide range of objects.
558
3
K. Kawamoto
Optical Flow-Driven Motion Model
We propose a statistical motion model, formally expressed by xt = f (xt−1 , ut , v t ),
v t ∼ N (0, Σ v ),
(8)
where v t is a white Gaussian noise with mean 0 and covariance matrix Σ v . Unlike the smooth models in eqs. (3) and (4), this model includes the term ut which is estimated from sparse optical flows between successive two images. The term ut helps the particles to capture the object of interest by guiding the particles based on the current and previous images. We call this optical flow– driven motion model, because the state is mainly driven by optical flows. In experiments the object of interest is represented by its bounding box (more general object models are also available). The reference bounding box to be tracked is specified by the user at time 0. The bounding box is modeled by (w, h, tx , ty ), where w and h are the width and the length of the bounding box, respectively, and (tx , ty ) is the center of the bounding box. The state vector xt at time t is therefore defined by xt = (wt , ht , txt , tyt ) . The covariance matrix 2 2 2 2 , σv,h , σv,t , σv,t ) in what Σ v is assumed to be a diagonal matrix Σ v = diag(σv,w x y follows. In this setting, eq. (8) can be specified by ⎛ ⎞ (uwt + vwt )wt−1 ⎜ (uht + vht )ht−1 ⎟ ⎟ xt = ⎜ (9) ⎝ txt−1 + utx t + vtx t ⎠ tyt−1 + uty t + vty t where v t = (vwt vht vtx t vty t ) and ut = (uwt uht utx t uty t ) which is estimated from optical flows. The underlying motion in eq. (9) is anisotropic similar transformations, which include planar rigid transformations. Since the elements of ut changes at every time, the model in eq.(9) is adaptively updated. 3.1
Robust Estimation of u t from Optical Flows
The estimation of ut from optical flows is a common problem in computer vision. Let r αt = (xαt , yαt ) , α = 1, . . . , M, 1 denote the αth feature point in the image at time t and assume that r αt is a feature point of the object. The geometrical relation between r αt−1 and r αt is written by
uwt 0 utx t rαt = Dt r αt−1 + tt , where D t = and tt = . (10) 0 uht uty t Therefore, if more than a pair of corresponding points, i.e., r αt ↔ r αt−1 , α = 1, 2, . . ., are available, the least–squares estimate of ut is obtained by solving the optimization problem M
r αt − (Dt rαt−1 + tt )2 → min .
(11)
α=1 1
The number of feature points M may vary from time to time, but the time subscript is suppressed for the sake of brevity.
Title Suppressed Due to Excessive Length
The solution to eq. (11) is calculated by
M M M 1 uwt = M xαt−1 xαt − xαt−1 xαt , Δx α=1 α=1 α=1
M M M 1 uht = M yαt−1 yαt − yαt−1 yαt , Δy α=1 α=1 α=1
M M M M 1 2 utx t = x xαt − xαt−1 xαt−1 xαt , Δx α=1 αt−1 α=1 α=1 α=1
M M M M 1 2 uty t = y yαt − yαt−1 yαt−1 yαt , Δy α=1 αt−1 α=1 α=1 α=1
559
(12)
where Δx = M
M α=1
x2αt−1 − (
M
α=1
xαt−1 )2 , Δy = M
M α=1
2 yαt−1 −(
M
yαt−1 )2 . (13)
α=1
In practice a proportion of feature points, r αt , α = 1, . . . , M, may be outliers, i.e., some of them may be feature points of background or other objects. Since the least-squares estimate to eq. (12) is sensitive to outliers, a robust method for estimating ut is necessary. In order to remove outliers we employ the RANSAC (Random Sample Consensus) algorithm [13]. After removing outliers, we can obtain the least-squares estimate recalculated from only inliers. For RANSAC, an important parameter is the distance threshold d which is used to classify a given data into inliers and outliers. If the measurement error deviation σr , a of rαt is isotropic and Gaussian with zero mean and standard √ reasonable choice of the threshold is d = χ22 (0.95) σr ≈ 5.99 σr , where χ2m (α) is the α × 100 percentile of the χ2 distribution with m degrees of freedom. In practice σr is empirically determined because it is usually unknown (we set σr = 3 in the experiments). In what follows we assume the outliers are removed and all of r αt , α = 1, . . . , M, are inliers for simple notation. 3.2
Automatic Variance Adjustment by Error Propagation
The variance Σ v of the stochastic term v t in eq. (9) is important to an appropriate diffusion of particles. If the variance is too large, most of the particles may not contribute to the approximation of the filtering distribution. This situation results in inefficient sampling. If the variance is too small, the particles may not accurately capture characteristics of the filtering distribution because of a loss of diversity of particles. In order to adjust Σ v automatically, we assume the uncertainty of the model in eq. (8) is almost the same as that of ut , i.e., Σ v ≈ Σ u , where Σ u is the covariance matrix of ut and is assumed to be a diagonal matrix Σ u = 2 2 2 2 diag(σu,w , σu,h , σu,t , σu,t ). In face we take the variance of v t as that of ut . We x y
560
K. Kawamoto
estimate the variance of ut from optical flows using an error propagation technique [5]. If optical flows are inaccurate, the variance of ut increases accordingly, and vice versa. The error propagation is generally based on the relation
2 ∂f 2 σθ = σx2 (14) ∂x where x and θ are related by θ = f (x) and σx2 , σθ2 are the variances of x and θ. From eqs. (12) and (14), we estimate the variances of ut by
M M 2 2 σ ˆu,w = ˆu,h = σr2 , σ σr2 , Δx Δy
M M 1 2 1 2 2 2 2 x ˆu,ty = y (15) σr , σ σr2 . σ ˆu,tx = Δx α=1 αt−1 Δy α=1 αt−1 The variance σr2 of r αt is usually unknown, but the unbiased estimate of σr2 is calculated by 2 (16) σ ˆr2 = 2M − 2 where 2 is the sum of squares of the residuals to eq. (11). We therefore obtain the 2 2 2 2 ,σ ˆu,h ,σ ˆu,t ,σ ˆu,t by substituting σ ˆr2 in eq. (16) for σr2 in eq. (15). variances σ ˆu,w x y 3.3
Redetection of Feature Points
The number of feature points r αt , α = 1, . . . , M, decreases over time, because tracking some feature points may fail and RANSAC may remove some feature points as outliers. With a small number of feature points, the estimates in eq. (12) may be inaccurate. In the worst case, no feature points may exist in images, and thus it is impossible to estimate ut . This means the proposed tracking algorithm will no longer work well. Hence the redetection of feature points is necessary. To this end, feature points are redetected within the bounding box corresponding to the mode of the filtering (i) distribution, which is calculated by xt = arg maxx(i) p(y t |xt ), if the number of t feature points is less than a threshold (in the experiments we set it to 10).
4
Experiments with Synthetic and Real Image Sequences
We compare the performance of the proposed motion model with a prior model using synthetic and real image sequences. Specifically the random walk model in eq. (3) is used as a prior model in both experiments. The elements of noise v in the random walk model are assumed to be independent and white Gaussian 2 2 2 2 = 0.01w0 , σv,h = 0.01h0 , σv,t = σv,t = 3.02 (pixel), noises with variances σv,w x y where w0 and h0 are the width and length of the reference bounding box, respectively.
Title Suppressed Due to Excessive Length
561
The observations are taken to be the normalized histograms of the bounding box in the RGB channels, denoted by y = (hR , hG , hB ). Then the observation model is defined as
R G G B B 2 2 , h ) + B (h , h ) + B (h , h ) B 2 (hR r r r p(y t |xt ) ∝ exp , (17) 2σ 2 G B where (hR r , hr , hr ) is the normalized histograms of the reference bounding box and B(h, h ) is the Bhattacharyya distance between the normalized histograms Nh with Nh bins, defined as B 2 (h, h ) = (1 − i=1 hi hi ). In both experiments, the parameter σ in eq. (17) is set to σ = 0.001. The number of particles is set to 200. The two particle filters are implemented on a computer with Pentium 4 (3.4GHz) and 2 GB main memory.
4.1
Synthetic Example
The purpose of this example is to quantitatively evaluate the performance difference between the proposed model and the prior model. To this end, the ground truth is necessary. Then we generate the synthetic image sequence by transforming the “Lena” image in Fig. 1 (left) with the motion parameters in Fig. 1 (right). The white bounding box in Fig. 1 (left) is a manually selected region at time 0 in size 90 × 100 pixels and the feature points within the box are used for tracking the region by the proposed model. The feature points are detected by the Harris detector [14] and they are tracked by the Lucas–Kanade–Tomasi tracker [15] with pyramidal implementation [16]. We perform the 100 simulations and evaluate the root mean squared errors ¯t = (RMSE) between the mean estimates of the state, which is calculated by x N (i) 1/N i=1 xt , and the ground truth. In Fig. 2 the RMSEs with the two models are present. The results show the RMSEs by the prior model sharply increase at time 6. This increase is caused by the large displacement of the target region from time 5 to 6 ((305, 214) → (320, 200)), as shown in Fig. 1 (right), i.e., the prior model can not follow the target region because of the unexpected motion. On the other hand, the proposed model provides more accurate and stable estimates. The average execution times with the proposed model and the prior model are 39
t wt ht tx ty
0 90 100 300 220
1 90 100 301 219
2 90 100 302 218
3 90 100 303 217
4 90 100 304 216
5 90 100 305 215
6 90 100 320 200
7 90 100 321 199
8 90 100 322 198
9 90 100 323 197
10 90 100 324 196
Fig. 1. The “Lena” image (left) and the state true parameters (bottom) used for generating the synthetic image sequence
562
K. Kawamoto 18
SIR proposed
4
16
3.5
14
3
12 height(pixel)
width(pixel)
4.5
2.5 2
10 8
1.5
6
1
4
0.5
SIR proposed
2
0
0 0
2
4
6
8
10
0
2
4
time(frame)
(a) width wt 20
10
8
10
SIR proposed
18
16
16
14
14 y-translation(pixel)
x-translation(pixel)
8
(b) height ht 20
SIR proposed
18
6 time(frame)
12 10 8
12 10 8
6
6
4
4
2
2
0
0 0
2
4
6
8
time(frame)
(a) x–directional translation txt
10
0
2
4
6 time(frame)
(b) y–directional translation tyt
Fig. 2. Root mean squared errors (RMSE) between the estimated mean of the state and the ground truth for the synthetic image sequence
msec/frame and 36 msec/frame, respectively. Consequently the proposed model outperforms the prior model in terms of accuracy even though the execution times are almost the same. 4.2
Real Example
The real image sequence consists of 240 frames (8 sec) at 320 × 240 pixels resolution, in which a target object in size 80 × 100 pixels is moved by hand. Figures 3 show mean estimates (the white bounding box) provided by the prior model (upper) and the proposed model (lower) at time 126, 128, 130, 137, 139, 141. These results are selected as typical tracking examples to show the difference between the two models. The object being rapidly moved at time 126, the prior model loses track of the object, as shown at time 128 and 130. Similarly, after time 137, the prior model fails to track the object and captures a false object with relatively similar color pattern (the can on the desk), as shown at time 141. In contrast, the proposed model successfully tracks the object through the image sequence. The execution times with the proposed model and the prior model are 14 msec/frame and 12 msec/frame, respectively.
Title Suppressed Due to Excessive Length #126
#128
#130
#137
#139
#141
563
Fig. 3. Mean estimate of the state with the three proposal distributions
5
Conclusion
We propose an adaptive statistical motion model, called the optical flow–driven motion model, for particle filter based tracking. This motion model explores the state space with the help of sparse optical flows. This exploration can be carried out fast because optical flow can be estimated by a gradient method, which is a local search and is done fast. Furthermore, we introduce a variance adjustment method for the motion model. This adjustment is derived from the error propagation technique and fully automatic. The determination of the variance of motion model are important because the variance affects tracking accuracy and efficiency. It is however manually determined in most particle filters. The experimental results with the synthetic and real image sequences show the proposed model provides better performance than the random walk model.
564
K. Kawamoto
Acknowledgements This work is supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan, under a Grant-in-Aid (No.19700174).
References 1. Doucet, A., de Freitas, N., Gordon, N.J.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 2. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE Proc.–F 140(2), 107–113 (1993) 3. Kitagawa, G.: Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat. 5(1), 1–25 (1996) 4. Isard, M., Blake, A.: Condensation – Conditional density propagation for visual tracking. Int. J. Computer Vision 29(1), 5–28 (1998) 5. Haralick, R.M.: Propagating Covariance in Computer Vision. Int. J. Pattern Recognition and Artificial Intelligence 10(5), 561–572 (1996) 6. Nummiaro, K., Koller-Meier, E., Gool, L.V.: An adaptive color-based particle filter. Image and Vision Computing 21, 99–110 (2003) 7. Pitt, M.K., Shephard, N.: Filtering Via Simulation: Auxiliary Particle Filters. J. the American Statistical Association 94, 590–599 (1999) 8. Liu, J.S., Chen, R.: Sequential Monte Carlo methods for dynamical systems. J. the American Statistical Association 93(443), 1032–1044 (1998) 9. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10, 197–208 (2000) 10. Kitagawa, G.: Self–organizing state space model. J. the American Statistical Association 93(443), 1203–1215 (1998) 11. Ichimura, N.: Stochstic Filtering for Motion Trajectory in Image Sequences Using a Monte Carlo Filter with Estimation of Hyper-Parameters. Proc. Int. Conf. on Pattern Recognition IV, 68–73 (2002) 12. Isard, M., Blake, A.: ICondensation: Unifying low-level and high-level tracking in a stochastic framework. Proc. European Conf. Computer Vision 1, 893–908 (1998) 13. Fischer, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM 24(6), 381–395 (1981) 14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp.147–151 (August 1988) 15. Shi, J., Tomasi, C.: Good Features to Track, Proc. Computer Vision Pattern Recognition, pp.593–600 (1994) 16. Bouguet, J.Y.: Pyramidal Implementation of the Lucas Kanade Feature Tracker, Intel Corporation, Microprocessor Research Labs (2000)
A Noise-Insensitive Object Tracking Algorithm Chunsheng Hua1,2 , Qian Chen1 , Haiyuan Wu1 , and Toshikazu Wada1 1
2
Graduate School of Systems Enigneering, Wakayama University, 930 Sakaedani, Wakayama, 640-8510, Japan Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan
Abstract. In this paper, we brought out a noise-insensitive pixel-wise object tracking algorithm whose kernel is a new reliable data grouping algorithm that introduces the reliability evaluation into the existing K-means clustering (called as RK-means clustering). The RK-means clustering concentrates on two problems of the existing K-mean clustering algorithm: 1) the unreliable clustering result when the noise data exists; 2) the bad/wrong clustering result caused by the incorrectly assumed number of clusters. The first problem is solved by evaluating the reliability of classifying an unknown data vector according to the triangular relationship among it and its two nearest cluster centers. Noise data will be ignored by being assigned low reliability. The second problem is solved by introducing a new group merging method that can delete pairs of ”too near” data groups by checking their variance and average reliability, and then combining them together. We developed a video-rate object tracking system (called as RK-means tracker) with the proposed algorithm. The extensive experiments of tracking various objects in cluttered environments confirmed its effectiveness and advantages.
1
Introduction
Object tracking has drawn the attention of more and more researchers since the last decade, and numerous powerful tracking algorithms have been brought out, such as background subtraction [16], optical flow [17], CONDENSATION [8], template matching [15,19], mean shift [11], EM algorithm [10], dynamic Bayesian network [9], iterative clustering[6], Kalman filter [7], etc. The common feature of these algorithms is that: their success depends on checking the similarity between the target model and an unknown region/pixel. To measure such similarity, a threshold is usually applied into these algorithms. Since the target object may move under the cluttered condition, it is difficult to select the proper threshold to work stably under all conditions. Furthermore, there is no guarantee that if the object with the maximum similarity is really the target one or not. Collins [13] et al. bring out an idea that: while object tracking, the most important thing is the ability to discriminate the target object from its surrounding background. Not only the target feature but also the background feature should be processed while object tracking. They propose a method to switch the meanshift tracking algorithm among the different linear combination of the RGB Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 565–575, 2007. c Springer-Verlag Berlin Heidelberg 2007
566
C. Hua et al.
colors which can select the very features that distinguish the object most from the surrounding background. However, the color histogram has little identification power, and in the case of high dimensional features like textures, the large number of combination of colors (in fact, each image contains 49 such color combinations) will prevent their method from achieving the real-time performance. Similar ideas have been applied by Nguyen [14] and Zhang [18]. The performance of [18] will be unstable if the target object contains apertures, because they assume the target object to be solid. In [14], although the more powerful Gabor filter is used to discriminate the target from the background, the performance of [14] in the long term is suspected because the target is assumed to be solid. When the target is non-rigid, there is no guarantee that the update of target template is correct, and the multi-scale problem is also remained in [14]. Therefore, obviously almost all the mentioned tracking algorithms share one common problem: when the target object is non-rigid and/or contains apertures, background pixels will be mixed into the target object; when the target object moves under the cluttered background condition, the continuously mixed background pixels will greatly degrade the purity of the target feature (such as color or texture). In this paper, this phenomenon is called as background interfusion. In order to solve the background interfusion problem, Hua et al.[12] proposed a pixel-wise tracking algorithm called as “K-means tracker” which is based on applying K-means clustering into both the target and background samples to remove the mixed background pixels from the target object. However, two problems degrade the performance of K-means tracker: 1) K-means clustering will wrongly classify the noise data into some pre-defined clusters; 2) the wrongly assumed number of clusters sometimes leads to the wrong clustering result. Although some efforts [1,2,3,4,5] were made to K-means clustering, while object tracking, such problems still affect the performance of K-means tracker. In this paper, we solve these two problems by introducing the reliability estimation into K-means clustering. On considering the triangular relationship among an unknown data and its two nearest cluster centers, each data will be given a reliability value, and the noise data will be ignored by being assigned low reliability. With a new merging method which is based on the average reliability and variance of each cluster, the second problem is also solved.
2 2.1
The RK-Means Clustering Reliability Evaluation
Because the noise data is usually distant from any cluster center, the distance from a noise data to any cluster center is always longer than that of a normal data. Thus, such distance can be regarded as the feature that tells noise data from the normal data. However, since it is difficult to get a proper threshold to examine such feature (as we claimed before), we prefer the triangular relationship among an unknown data and its two nearest cluster centers to measure if it is reliable or not to classify this data into some clusters. High reliability will be sent to the normal data and low reliability to the noise data.
A Noise-Insensitive Object Tracking Algorithm
567
Fig. 1. The relationship between data vectors and their two closest cluster centers
As shown in Fig.1, w1 and w2 are the cluster centers, x1 and x2 are the unknown data vectors. Only according to d11 and d21 , we can only say that x2 is closer to w1 than x1 , but not judge x1 and x2 to be the noise data vector or not. However, according to the shape of triangle x1 w1 w2 and x2 w1 w2 , we can judge that, compared with x2 , x1 is more like to be a noise data vector. While clustering an arbitrary data vector xk of data set X ( {xk ; k = 1, . . . , n}), the reliability value of xk is defined as the ratio of the distance between its two closest cluster centers to the sum of the distance from xk to the two cluster centers: wf (xk ) − ws(xk ) R(xk ) = , (1) dkf + dks dkf = xk − wf (xk ) , f (xk ) = argmin (xk − wi ) ,
dks = xk − ws(xk ) . s(xk ) =
i=1,...,c
argmin
(xk − wi ) .
(2)
i=1,...,c,i=f (xk )
f (xk ) and s(xk ) are the subscript of the closest and the secondly closest cluster centers to xk . The degree (μkf ) of a data vector xk belonging to its closest cluster f (xk ) is computed from dkf and dks : μkf =
dks . dkf + dks
(3)
Since R(xk ) indicates how reliable that xk can be classified, the probability that xk belongs to its closest cluster can be computed as: tkf = R(xk ) ∗ μkf
(4)
tkf denotes that: under the reliability R(xk ), the probability that xk belongs to its closest cluster. Given the number of clusters and the initial cluster centers, the data grouping in RK-means clustering algorithm is carried out by two steps: 1). For each data vector xk , compute its probability of belonging to its nearest cluster center as Eq(4); 2). Update the clusters by minimizing the following objective function: Jrkm (w) =
n k=1
tkf xk − wf (xk ) 2 .
(5)
568
C. Hua et al.
Mistake area
(a) Input image
(b) K-means clustering (c) RK-means clustering
Fig. 2. RK-means clustering can detect the outliers with the reliability evaluation
The cluster centers w are obtained by solving the equation ∂Jrkm (w) = 0. ∂w
(6)
The existence of the solution to Eq.(6) can be proved easily if the Euclidean distance is assumed. The cluster centers w can be separately updated as: n δj (xk )tkf xk 1 if j = f (xk ) k=1 wj = n , δj (xk ) = . (7) 0 otherwise δ (x )t k kf k=1 j The output of Eq.(7) will be the initial value for step one. The step one and two are performed iteratively until w converges. 2.2
Redundant Cluster Deletion
When the assumed number of clusters is greater than that real number of a dataset, there will be some redundant clusters. Such redundant clusters will scramble for the data vectors that should belong to one cluster, thus one cluster may be divided into two or more clusters forcibly, which makes the clustering process become unstable and unreliable. To solve this problem, we merge two redundant clusters into one cluster according to their variance and average reliability. (A) of Fig.3 illustrates a reliability field in a two-dimension space around two cluster centers. The data vectors located on the line connecting the two
(A) Reliability field
(B) Initial state
(C) Iteration: 5
(D) Iteration: 9
Fig. 3. The changes of two cluster centers in one crowd of data vectors during updating
A Noise-Insensitive Object Tracking Algorithm
569
cluster centers will have higher reliability than the others. Such data vectors will attempt to attract the two cluster centers together, and such attraction will become stronger when the two cluster centers really get closer. (B) ∼ (D) of Fig.3 show an example of clustering one crowd of data vectors with RK-means clustering, when given two initial clusters. Because the two cluster centers (red circle) get closer as the iteration increases, the average reliability of the two clusters will decrease continuously according to Eq.(1). Therefore, we considered it is possible to judge if two clusters should be merged or not by checking their average reliability and variance. Here we describe the dispersion of cluster i as: 2 xk ∈cluster(i) (xk − wi ) , (8) v(i) = N where N is the number of data vectors in cluster i. The average reliability of cluster i can be computed by: xk ∈cluster(i) R(xk ) r(i) = . (9) N In Fig.4, Case 1∼4 show the clustering results of data sets with RK-means clustering under different distribution. In order to check if two clusters should be merged or not, we compare the dispersion and the average reliability of two clusters before and after merging. The result of this comparison is obtained as: Rv (i, j) =
v(i) + v(j) , v(i ∪ j)
Rr (i, j) =
r(i) + r(j) , r(i ∪ j)
(10)
where i ∪ j indicates the merged cluster from cluster i and j. The graph in the right part of Fig.4 shows the results of Rv and Rr of the two clusters. By analyzing these results, we discovered that the ratio of Rr to Rv can be used to judge if two clusters should be merged: M erge(i, j) = Pm (i, j) =
Rr (i, j) . Rv (i, j)
Fig. 4. Investigation on the possibility of merging two clusters
(11)
570
C. Hua et al.
If Pm (i, j) ≤ 1 then the cluster i and the cluster j should be merged. This procedure should be executed after each update of cluster centers. Therefore, we add the redundant cluster deletion procedure as the third step into the process of data grouping described in subsection 2.2. The RK-means clustering algorithm is summarized as: 1) Initialization: give the number of clusters c and the initial value to each cluster center wi , i = 1, . . . , c. 2) Iteration of data grouping: i) calculate f (xk ) and s(xk ) for each xk . ii) update wi , i = 1, . . . , c by solving Eq.(6). iii) delete redundant cluster with the method described in subsection 2.2
3
Object Tracking Using RK-means Algorithm
While object tracking, we solve the background interfusion by applying the RKmeans clustering to target and background samples. To describe the image feature properly, we assume that: each object in the image is composed of one or several regions, all the pixels within each region should : 1) contain similar colors; 2) be close to each other in the 2D space. Therefore, according to this assumption, the image feature is described by a constructed color-position 5D feature vector f . In fact, any kind of color system can be applied in Eq(12) (such as RGB, Y CbCr and Y U V color system). ⎤⎡ ⎤ ⎡ ⎤ ⎡ Y 100 0 0 f1 ⎢ f2 ⎥ ⎢ 0 1 0 0 0 ⎥ ⎢ U ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎥ ⎢ (12) f =⎢ ⎢ f3 ⎥ = ⎢ 0 0 1 0 0 ⎥ ⎢ V ⎥ , ⎣ f4 ⎦ ⎣ 0 0 0 α 0 ⎦ ⎣ x ⎦ y 000 0 α f5 where α is weight factor for adjusting the importance of the position information relative to the color information. Then, as shown in Fig.5, the minimum dissimilarities from an unknown data vector fu to the target clusters and to the surrounding background samples (selected from the ellipse contour) will be calculated as: dT = min fu − fT (i), i=1∼K
dB = min fu − fB (j), j=1∼m
(13)
correspondently, we can get the nearest target cluster center fT (i) and nearest background center fB (j) to fu . The dissimilarity between fT (i) and fB (j) is calculated as follows: dT B = fT (i) − fB (j). (14) The reliability of fu is estimated as: R(fu ) =
dT B , dT + dB
(15)
A Noise-Insensitive Object Tracking Algorithm Cross point
Unknown pixel
Target centers
Nearest point
dB
571
dTB
dT
Fig. 5. RK-means clustering on multi-color object with target and background samples
Then the probability that fu belongs to target cluster i is calculated as: (i)
μT (fu ) = R(fu ) ∗
dB . dT + dB
(16)
According to the Eq(7), the target centers will be smoothly updated in both the color and position space. How to select the background samples and update the search area, the tracking failure detection and recovery, as well as the initialization process have been reported in [12] in details.
4
Experiment and Discussion
4.1
Evaluating the Effectiveness of RK-Means
To confirm the effectiveness of the RK-means clustering (RKM), we compare it with the existing K-means (EKM) and fuzzy K-means (FKM) clustering. All the algorithms are given the same condition (e.g. initial cluster centers and iteration). The upper row of Fig.6 shows a comparative experiment where four clusters exist but the initial number of clusters is three. Here, the sky blue “” denotes the initial cluster centers and the yellow “•” shows the clustering result during each iteration and the final results are pointed by arrows. The EKM only succeeds in setting one cluster center correctly. The FKM also fails to give good clustering result. Our RKM successfully classifies three clusters according to the initial cluster centers because the data vectors of the fourth redundant cluster are given extremely low reliability by RKM. This means they can not be classified reliably and should not be assigned to any of given clusters. The lower row of Fig.6 shows another case that the real number of clusters is smaller than the assumed one. The hollow “” indicates the initial cluster center, and the solid sky blue “” shows the resulted cluster center. The EKM and FKM algorithm brutally divided the data vectors of group 2 into two separated clusters. Meanwhile, the RKM successfully classifies the three clusters by reducing one redundant assumed cluster as the way described in Section 2.2. In order to confirm the convergence of RKM, we applied it to the IRIS dataset1 as shown in (A) of Fig.7. The pink curve shows the convergence of RKM and blue 1
http://www.ics.uci.edu/˜mlearn/databases/iris/iris.data
572
C. Hua et al. 300
300
300
"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_CKM.dat"
250
"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_FKM.dat"
250
200
200
200
150
150
150
100
100
100
50
50
50
0
0
0
50
100
EKM
150
200
250
300
350
400
0 0
450
50
100
150
200
250
300
350
400
450
300 "Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_CKM.dat"
250
150
200
250
300
350
400
450
100
50
50
Group 3
Group 2 150
200
250
300
350
400
Group 3
Group 2
0 450
Group 1
150
100
50
"Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_RKM.dat"
200
Group 1
150
100
EKM
100
250
200
Group 1
100
50
300
"Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_FKM.dat"
250
200
50
0
FKM RKM Real number of clusters is larger than the assumed number.
300
150
"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_RKM.dat"
250
Group 3
Group 2 0
0 50
100
150
200
250
300
350
400
450
50
100
150
200
250
300
350
400
450
RKM FKM Real number of clusters is smaller than the assumed number. Fig. 6. Simulation experiment under different conditions
(A)convergence
(B)input
(C)iteration 1 (D)iteration 2 (E)iteration 3
Fig. 7. (A): Comparing the convergence of RKM and EKM. (B) ∼ (E): Image segmentation with redundant cluster deletion.
curve for EKM. We confirm that the convergence of the RKM is as good as the EKM, and its converging speed is not slower than the EKM. The (B) ∼ (E) of Fig.7 show an image segmentation experiment of the RKM with the redundant cluster deletion. (B) is extracted from the lower row (frame 285) of Fig.8. The yellow cross indicates the target cluster centers. Although three initial cluster centers are given, according to the redundant cluster deletion described in Section 2.2, the RKM correctly merge them into one cluster. 4.2
Object Tracking with RK-Means Tracker
Because Hua[12] have compared2 K-means tracker with some famous tracking algorithms, here we only compare our RK-means tracker with the K-means tracker. 2
http://vrl.sys.wakayama-u.ac.jp/VRL/studyresult/study result 3 en.html
A Noise-Insensitive Object Tracking Algorithm
573
Results of the K-means tracker.
Frame 220
Results of the RK-means tracker. Frame 258 Frame 285
Frame 305
Fig. 8. Results of comparative experiment with K-means tracker[12] and RK-means tracker under complex scenes
Frame 055
Frame 075
Frame 107
Frame 117
Frame 125
Frame 138
Frame 142
Frame 168
Fig. 9. Performance of RK-means tracker on PETS2004 where two people are fighting
Fig.8 shows a sequence of hand tracking. The K-means tracker fails to work since frame 285. In frame 285, since the color of some surrounding background parts (e.g. a corrugated carton) is similar to the skin color for some degree, K-means tracker mistakenly takes them as the target pixels. This causes the incorrect update of the search area and leads to the tracking failure. As for our RK-means tracker, in frame 285, the influence of such background parts is effectively repressed through the reliability evaluation. Therefore, the RK-means tracker can detect the target area (e.g. hand) and update the search area correctly. Fig.9 shows the performance of RK-means tracker on the public database PETS2004. In this sequence, although two persons are fighting and get entangled
574
C. Hua et al.
in each other, the RK-means tracker successfully treats the mixed person as noise data by giving him low reliability and ignores those noise data. All the experiments were performed with a desktop PC with a 3.06GHZ Intel XEON CPU, and the image size was 640 × 480 pixels. When the target size varied from 100 × 100 ∼ 200 × 200 pixels, target colors varied from 1 ∼ 6, the processing time of our algorithm was about 9 ∼ 15ms/frame.
5
Conclusion
In this paper, we have proposed a robust pixel-wise object tracking algorithm which is based on a new reliability-based K-means clustering algorithm (called as RK-means tracker). By considering the triangular relationship among an unknown data and its two nearest cluster centers, each data is assigned with a reliability value, and the noise data will be ignored by being given the low reliability value. When the number of clusters is incorrectly assumed, by checking the variance and average reliability of each cluster, a new group merging method is brought out to delete the redundant clusters. Through the extensive experiments, we have confirmed that the proposed RK-means tracker can work more robustly than other algorithms when the background is cluttered. Besides object tracking, we also confirm that the proposed RK-means clustering algorithm can be applied to image segmentation. Acknowledgement. This research is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientific Research (A)(2), 16200014 and (C)18500131, and (C)19500150.
References 1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum Press, New York (1981) 2. Krishnapuram, R., Keller, J.M.: A Possibilistic Approach to Clustering. IEEE Trans. Fuz. Sys. 1(2), 98–110 (1993) 3. Chintalapudi, K.K., Kam, M.: A noise-resistant fuzzy C means algorithm for clustering. FUZZ-IEEE 2, 1458–1463 (1998) 4. Jolion, J.M., et al.: Robust Clustering with Applications in Computer Vision. PAMI 13(8) (1991) 5. Zass, R., Shashua, A.: A Unifying approach to Hard and Probabilistic Clustering. ICCV 1, 294–301 (2005) 6. Hartigan, J., Wong, M.: Algorithm AS136: A K-means clustering algorithm. Applied Statistics 28, 100–108 (1979) 7. Peterfreund, N.: Robust tracking of position and velocity with Kalman snakes. PAMI 22, 564–569 (2000) 8. Isard, M., Blake, A.: CONDENSATION-Conditional density propagation for visual tracking. IJCV 29(1), 5–28 (1998) 9. Toyama, K., Blake, A.: Probabilistic Tracking in a Metric Space. ICCV 2, 50–57 (2001)
A Noise-Insensitive Object Tracking Algorithm
575
10. Japson, A.D., et al.: Robust Online Appearance Model for Visual Tracking. PAMI 25(10) (2003) 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. PAMI 25(5), 564–577 (2003) 12. Hua, C., Wu, H., Wada, T., Chen, Q.: K-means Tracking with Variable Ellipse Model. IPSJ Transactions on CVIM 46(Sig 15 CVIM12), 59–68 (2005) 13. Collins, R., Liu, Y.: On-line Selection of Discriminative Tracking Feature. ICCV 2, 346–352 (2003) 14. Nguyen, H.T., Semeulders, A.: Tracking aspects of the foreground against the background. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 446–456. Springer, Heidelberg (2004) 15. Rosenfeld, A., Kak, A.C.: Digital Picture Processing, Computer Science and Applied Mathematics. Academic Press, New York (1976) 16. Stauffer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for Real-time Tracking. CVPR, pp. 246–252 (1999) 17. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. IJCV 2(1), 42–77 (1994) 18. Zhang, C., Rui, Y.: Robust Visual Tracking via Pixels Classification and Integration. ICPR, 37–42 (2006) 19. Gr¨ aβl, C., et al.: Illumination Insensitive Template Matching with Hyperplanes. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 273–280. Springer, Heidelberg (2003)
Discriminative Mean Shift Tracking with Auxiliary Particles Junqiu Wang and Yasushi Yagi The Institute of Scientific and Industrial Research, OSAKA University 8-1 Mihogaoka, Ibaraki, Osaka, Japan
[email protected]
Abstract. We present a new approach towards efficient and robust tracking by incorporating the efficiency of the mean shift algorithm with the robustness of the particle filtering. The mean shift tracking algorithm is robust and effective when the representation of a target is sufficiently discriminative, the target does not jump beyond the bandwidth, and no serious distractions exist. In case of sudden motion, the particle filtering outperforms the mean shift algorithm at the expense of using a large particle set. In our approach, the mean shift algorithm is used as long as it provides reasonable performance. Auxiliary particles are introduced to conquer the distraction and sudden motion problems when such threats are detected. Moreover, discriminative features are selected according to the separation of the foreground and background distributions. We demonstrate the performance of our approach by comparing it with other trackers on challenging image sequences.
1
Introduction
Tracking objects through image sequences is one of the fundamental problems in computer vision. Among the algorithms developed in the pursuit of robust and efficient tracking, two major successful approaches are the mean shift algorithm [1][5], which focuses on Target Representation and Localization, and particle filtering [7][9], which is developed based on Filtering and Data Association. Both of them have their respective advantages and drawbacks. This paper aims at developing a robust and efficient tracker that incorporates the efficiency of the mean shift algorithm with the multi-hypothesis characteristics of the particle filtering. The mean shift algorithm is a robust non-parametric probability density estimation method. Comaniciu et al. [5] define a spatially-smooth similarity function and reduce the state estimation problem to a search of the basin of attraction of this function. Since the similarity function is smooth, a gradient optimization method leading to fast localization is applied. Despite its efficiency and robustness, the mean shift algorithm is not good at coping with quick motions. The distractions in the neighborhood of the target are threats to successful tracking. In addition, the basic mean-shift algorithm assumes that the target representation is sufficiently discriminative against the background. This assumption is Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 576–585, 2007. c Springer-Verlag Berlin Heidelberg 2007
Discriminative Mean Shift Tracking with Auxiliary Particles
577
not always true especially when tracking is carried out in a dynamic background such as surveillance with a moving camera. We introduce particles to deal with the first two problems because they are able to provide multiple hypothesis. Adaptive tracking is one possible solution to alleviate the third problem [3]. We update the target model according to the separation of the foreground and background distributions. Particle filtering stands out in filtering-based techniques due to its ability to represent multi-modal probability distributions using a weighted sample set S = {(s(n) , π (n) )|n = 1, . . . , N } that keeps multiple hypothesis of the states of targets [7] [9]. When the tracking is performed in a cluttered environment where multiple objects similar to the target can present, particle filters are able to find the target by validation and association of the measurements. However, since the number of particles can be large, a potential drawback of particle filtering is the high computational cost. Moreover, the particle set can degenerate and diffuse in a long sequence. Only few particles with high weights are useful after the tracking in certain frames. Accurate models of shape and motion learned from examples have been used to deal with these problems [9]. One of the drawbacks of this method though is that the construction of explicit models sometimes is hardly achievable because of viewpoint changes. Blake et al. [10] proposed the ICONDENSATION algorithm in which high and low-level information are combined using importance sampling. However, it is complicated to model the dynamic characteristics accurately in an uncontrolled environment. Sullivan and Rittscher [14] noticed the advantages of the mean shift and particle filter algorithms. They proposed a particle filter-based tracking guided by deterministic search based on a SSD type cost function. The size of particle set is adjusted according to the difficulty of the problem at hand, which is indicated by motion. Deterministic search using mean-shift has also been applied in a hand tracking algorithm by embedding the mean-shift optimization into particle filtering to move particles to local peaks in the likelihood, which improves the sampling efficiency [13]. Although the mean-shift and particle filters have been combined in various ways in previous works, none of them deal with occlusions and distractions explicitly. Cai et al. [2] embed the meanshift algorithm into the particle filter framework to stabilize the trajectories of the targets. It is necessary to learn classifiers for the targets in their work, which is not always possible in tracking applications. The mean shift tracking algorithm outperforms the particle filter when the representation of a target is discriminative enough, the target does not jump beyond the bandwidth, and no serious distractions exist. Although it seems that these conditions are too strict, we observed that they can be met in a large percentage of real image sequences captured for surveillance or other applications. In this work, the mean-shift algorithm is adopted as the main tracker as long as these conditions are met. In other words, only one particle driven by the mean-shift searching is used to estimate the state of the target. Auxiliary particles are introduced when sudden motion or distractions are detected. We compute log likelihood ratios of class conditional sample densities of the target
578
J. Wang and Y. Yagi
and its background. These ratios are applied in feature selection and distraction detection. The target model is updated according to feature selection results. Sudden motions are estimated using the efficient motion filters [16]. The proposed method offers several advantages. It achieves high efficiency when the target moves smoothly. When sudden motions or distractions are detected, auxiliary particles are initialized to support the mean shift tracker. The help from particle filtering partially solves the problems resulted from sudden motions or distractions. The remainder of the paper is organized as follows. Section 2 gives a brief introduction of the target model. Section 3 describes the feature selection and model updating methods. Section 4 introduces motion estimation and distraction detection. Section 5 discusses the use of auxiliary particles. The performance of the proposed method is evaluated in Section 6 and conclusions are given in Section 7.
2
Target Modeling
The target model should be as discriminative as possible to distinguish between complex target and background. We use an adaptive target model represented by the best features selected from shape-texture and color cues [17]. Color histograms are computed in three color spaces: RGB, HSV and normalized rg. There are 7 color features (R, G, B, H, S, r, g) in the candidate feature set. These color channels are quantized into 12 bins respectively. A color histogram is calculated using a weighting scheme in which the Epanechnikov kernel is applied [5]. A shape-texture cue is described by an orientation histogram, which is computed based on image derivatives. The orientations are also quantized into 12 bins. Each orientation is weighted and assigned to one of two adjacent bins according to its distance from the bin centers. The similarity between the model and its candidates is measured by Bhattacharya distance [5].
3 3.1
Feature Selection and Model Updating Log-Likelihood Ratio Images
To determine the descriptive ability of different features, we compute loglikelihood ratio images [3] [15] based on the histograms of the target and its background. Log-likelihood ratio images are also employed in detecting possible threats to the target. The likelihood ratio produces a function that maps feature values associated with the target to positive values and those associated with the background to negative values. The frequency of the pixels that appear in a histogram bin (b ) (b ) (b ) (b ) (p(bin ) ) is calculated as ζf in = pf in /nf and ζb in = pb in /nb , where nf is the pixel number of the target region and nb the pixel number of the background.
Discriminative Mean Shift Tracking with Auxiliary Particles
579
The log-likelihood ratio of a feature value is given by L(bin ) = max(−1, min(1, log
(bin )
, δL )
(bin )
, δL )
max(ζf max(ζb
)),
(1)
where δL is a very small number (δL is set to 0.001 in this work). The likelihood image for each feature is created by back-projecting the ratio into each pixel in the image. 3.2
Feature Selection
Given md features for tracking, the purpose of the feature selection module is to find the best subset feature of size mm , and mm < md . Feature selection can help minimize the tracking error and maximize the descriptive ability of the feature set. We find the features with the largest corresponding variances. Following the method in [3], based on the equality var(x) = E[x2 ] − (E[x])2 , the variance of Equation(1) is computed as var(L; p) = E[(Lbin )2 ] − (E[Lbin ])2 . The variance ratio of the likelihood function is defined as [3]: VR = 3.3
var(L; (pf + pb )/2) var(B ∪ F ) = . var(F ) + var(B) var(L; pf ) + var(L; pb )
(2)
Updating the Target Model
It is necessary to update the target model due to the fact that the appearance of a target tends to change during a tracking process. Unfortunately, updating the target model adaptively may lead to tracking drift because of the imperfect classification of the target and background. To reliably update the target model, we propose an approach based on similarities between the initial and current appearance of the target. Similarity θ is measured by a simple correlation based template matching performed between the initial and current frames. The updating is done according to the similarity θ: Hm = (1 − θ)Hi + θHc ,
(3)
where the Hi is the histogram computed on the initial target; the Hc the histogram of the target current appearance, the Hm the updated histogram of the target. Template matching is performed between the initial model and the current candidates. Since we do not use the search window that is necessary in template matching-based tracking, the matching process is efficient and brings little computational cost to our algorithm. In unstable tracking period (When sudden motions or distractions are detected), the classification of the target and background is not reliable. It is difficult to reliably update the target model at these moments. Thus the model is updated when the tracker is in stable states.
580
4 4.1
J. Wang and Y. Yagi
Motion Estimation and Distraction Detection Motion Estimation
The number of particles is adjusted according to motion information of the target. Discriminative mean shift tracking is sufficient to determine the position of a target when it moves smoothly and slowly. More particles are necessary to estimate the correct position of the target when it moves quickly. We use the efficient motion filters that have been applied in pedestrian detection [16]. We estimate the motion of foreground and background region simultaneously and partially solve the problem brought by dynamic background. There are five motion filters computed on 5 image pairs: 1 τi |It (x) − It+1 (x)|, (4) Δi = nRg x∈Rg where It and It+1 are consequential images, nRg is the number of pixels in a specific region, and τi ∈ {, ←, →, ↑, ↓} which are image shift operators denoting no shift, shift left, shift right, shift up, and shift down for one pixel respectively. The motion filters are computed on the target and its background region respectively. The results of the last four motion filters (Δi , i ∈ {1, 2, 3, 4}) are compared with the absolute differences Δ0 : Mif = |Δfi − Δf0 |, Mib = |Δbi − Δb0 |
(5)
Mi represent the likelihood that a particular region is moving in a given direction. We compute the maximum motion likelihood to determine the number of particles for the tracking: Mmax = max(|Mif − Mib |)i=1,2,3,4. .
(6)
Given the high efficiency of the estimation method, it is performed in each frame before tracking is carried out. 4.2
Distraction Detection
Distractions in the neighborhood of the target have similar appearance to the target. They are possible threats to successful tracking. When the similarity between the target model and its candidate is less than a certain value (ρT ), distraction detection is performed using spatial reasoning [3] to find peaks besides the target in the log-likelihood ratio images. Note that the log-likelihood ratio images here are back-projection results of the conditional distributions based on selected features. Assuming that the region RT actually contains the target and the region RD is a possible distraction, we want to find the region that have maximum strength of threat to the target. A certain region where the sum of its log-likelihood ratios
Discriminative Mean Shift Tracking with Auxiliary Particles
581
has minimum difference with that in the target region is the distraction we want to find: min(| L(bin ) − L(bin ) |) (7) RD
RX
where RX is a region in the neighborhood of the target. It is too expensive to compare the sums of log-likelihood ratio in all the possible regions with that in the target region. The searching process can be accelerated using a Gaussian kernel [3]. The value at each pixel in the convolved log-likelihood ratio image with a Gaussian kernel is a weighted sum of the log-likelihood ratios in a circular region surrounding it, normalized by the total weight pixels in that region. First, the log-likelihood image is convolved using a Gaussian kernel. The peak DT which represents the target region can be found in the convolved image. Second, the target region in the log-likelihood image is removed and the result is convolved using a Gaussian kernel again. The most dangerous distraction is detected by searching for the peak DD in the convolved image. The difference between the two peaks represents the threat strength of the distraction: ρ = |DD − DT |, (8) The distraction may attract the mean shift tracker to the incorrect position if it is strong enough. We initialize a auxiliary particle set to track the distraction region when ρ is less than the given threshold ρT .
5
Auxiliary Particle Filtering
Particle filtering implements recursive Bayesian filter by Monte Carlo simulations. In the implementation, the posterior density is approximated by a weighted (n) (n) (n) (n) particle set {st , πt }n=1,···,J , where πt = p(zt |xt = st ). We initialize auxiliary particles when sudden motion or distraction are detected. Different strategies are adopted for the generation of particles under these two circumstances. 5.1
Particle Filtering for Sudden Motion
When a sudden motion is detected Np particles are generated using a stochastic motion model. The number of particles is determined from to the motion computed: (9) JS = max(min(J0 Mmax , Jmax ), Jmin ), where J0 is the coefficient; Jmin is the smallest number of particles and Jmax the largest number of particles to maintain reasonable particles. The motion model is a normal density centered on the previous pose with a constant shift vector: (10) xjt = xt−1 + xc + ujt ; where ujt is a standard normal random vector and xc a constant shift vector from the previous position according to the motion estimation results (it is set to one pixel to the motion direction).
582
5.2
J. Wang and Y. Yagi
Particle Filtering for Distraction
After distractions are detected, a joint particle filter with an MRF motion model is initialized [12]. The motion interaction between the target and the distraction ψ(Xit , Xjt ) is described by the Gibbs distribution ψ(Xit , Xjt ) ∝ exp(−g(Xit , Xjt ), where g(Xit , Xjt ) is a penalty function approximated by the distance between the target and the distraction. The posterior on the joint state Xt is approximated as a set of J weighted samples: (J) (J) ψ(Xit , Xjt ) πt−1 P (Xit |Xi(t−1) ), P (Xt |Z t ) ≈ kP (Zt |Xt ) ij∈E
J
i
where the samples are drawn from the joint proposal distribution; k is a normalizing constant that does not depend on the state variables; E is edges in the MRF model; the samples are weighted according to the factored likelihood function: 2 (s) (s) (s) (s) P (Zit |Xit ) ψ(Xit , Xjt ). πt = i
ij∈E
where Zit are measurement nodes. 5.3
Algorithm Summary
In summary, the detailed steps of the proposed tracking algorithm are: Algorithm: Discriminative Mean-Shift Tracking with Auxiliary Particles Input:
t video frames I1 , . . . , It ; Initial target region given in the first frame I1 target regions in I2 , . . . , It
Output: Initialization in I1 1. Save the initial target appearance for model updating; 2. Compute the similarity (S1 ) between the target model and the candidate. For each new frame Ij : Estimate the motion (Mj ) on the consequential frames; IF Mj > MT THEN initialize particles according to the motion estimated. ELSE IF the similarity is less than a given threshold (Sj−1 < S T ) THEN detect distractions in the neighborhood of the target If Distraction is detected (ρ < ρT ) Initialize MRF particles; Else Update the target model. End If End If End If Estimate the position of the target. Compute the similarity Sj for next frame. End For
Discriminative Mean Shift Tracking with Auxiliary Particles
2 Particle filtering
90
90
80 70 60 50
50 30
20
20
10
10
0
0
3
4
5
25
60
30
Tracking approaches
20 15 10 5
1
2
4
3
0
5
Tracking approaches
90
30 25
14 12 10
20
8
15
6
10
4
5
2
0
0
3
4
5
Dataset tracked(%)
100
16
Dataset tracked(%)
18
35
Tracking approaches
(d)
4
3
5
(c)
40
2
2
Tracking approaches
45
1
1
(b)
(a)
Dataset tracked(%)
5 The proposed tracker
Peak difference
70
40
2
4
80
40
1
3 Variance ratio
Dataset tracked(%)
100
100
Dataset tracked(%)
Dataset tracked(%)
1 Basic mean-shift
583
80 70 60 50 40 30 20 10
1
2
4
3
5
Tracking approaches
(e)
0
1
2
4
3
5
Tracking approaches
(f)
Fig. 1. Tracking results using different tracking approaches. Tests are performed on (a) EgTest01; (b) EgTest02; (c) EgTest03; (d) EgTest04; (e) EgTest05; and (f) Redteam.
6
Experimental Results
To illustrate the performance of the proposed tracker, we have implemented and tested it on a wide variety of challenging image sequences in different environments and applications. Due to space limitation, we only show the results on the public CMU datasets with ground truth [4]. The datasets include 6 sequences: EgTest01, EgTest02, EgTest03, EgTest04, EgTest05 and Redteam. There are different factors that make the tracking challenging: different viewpoints (these sequences are captured by moving cameras); similar objects nearby; sudden motions; illumination changes; reflectance variations of the targets; and partial occlusions. The tracking results are compared with the basic mean shift and particle filtering trackers. Since the proposed tracker updates the target model based on feature selection, it is reasonable to compare it with those adaptive trackers. The variance ratio and peak difference [3] trackers are included for this purpose. In the particle filtering tracker, the target model is represented by 12 × 12 × 12-bin RGB histograms. There are 100 samples in the sample set. RGB histograms are also adopted in the basic mean shift algorithm. The similarity measure is Bhattacharya distance between the model and its candidate. The most important criterion for the comparison is the percentage of dataset tracked, which is the number of the tracked frames divided by the total number of frames. The track is considered to be lost if the bounding box does not overlap the ground truth. The tracking success rates achieved by each tracker are compared and the results are shown in Fig. 1. The proposed tracker gives the
584
J. Wang and Y. Yagi
(b) f=516
(a) f=1
(c) f=1300
(d) Target appearance changes
Fig. 2. Tracking results of the EgTest02 sequence
best results (or has same results with another tracker) in all the test sequences. Those comparisons demonstrate that the proposed tracking algorithm has better performance than other trackers. In Fig. 2, the tracking results for EgTest02 are shown. Despite the distractions and sudden motions in the sequence, the proposed tracker completes the tracking successfully. Fig. 2.(d) illustrates how the appearance of the target changes over time. There are sudden motions and image blur in the EgTest04, which leads to the failure of the basic mean-shift tracker. The proposed tracker detects this motion successfully and initializes auxiliary particles. These particles help the proposed tracker to conquer the problem brought by the sudden motion. The running time of the proposed tracker depends on the difficulty level of the image sequence being tracked. If sudden motions or distractions happen frequently, its efficiency is low. Otherwise it has high efficiency because the mean shift algorithm is adopted in most cases. The current implementation ran 16 frames per second (average speed) on a Intel Centrino 1.6GHz laptop with 1G RAM when applied to images of size 640×480. The average running time includes time to do the main tracking algorithm, to read image file from a USB disk, and to display color images with the object bounding box overlaid.
7
Conclusion and Future Work
We describe a discriminative mean shift tracking algorithm with auxiliary particles in the pursuit of robust and efficient tracking. The arrangement of the particle filtering and the mean shift algorithm is based on the difficulty of the tracking which is indicated by sudden motions and distractions. The model updating strategy in our tracker can effectively deal with appearance changes of targets. The proposed approach provides better performance than those of the mean shift, particle filtering and other trackers.
Discriminative Mean Shift Tracking with Auxiliary Particles
585
We are going to investigate how to extend the proposed method to multitarget tracking, in which multiple mean shift searching is necessary.
References 1. Bradski, G.R.: Computer Vision Face Tracking as a Component of a Perceptural User Interface. In: Proc. of the IEEE Workshop Applications of Computer Vision, pp. 214–219 (1998) 2. Cai, Y., Freitas, N., Little, J.: Robust Visual Tracking for Multiple Targets. In: Little, J. (ed.) Proc. of 6th Europearn Conf. on Computer Vision, pp. 893–908 (2006) 3. Collins, R.T., Liu, Y.: On-line Selection of Discriminative Tracking Features. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005) 4. Collins, R.T., Zhou, X., Teh, S.K.: An Open Source Tracking Testbed and Evaluation Web Site. In: PETS 2005. IEEE Int’l Workshop on Performance Evaluation of Tracking and Surveillance (January 2005) 5. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based Object Tracking. IEEE Trans. Pattern Analysis Machine Intelligence 25(5), 564–577 (2003) 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and Sons Press, Chichester (1991) 7. Gordon, N., Salmond, D., Smith, A.: Novel approach to nonlinear/non-Gaussain Bayesian state estimation. IEEE Proc. 140(2), 107–113 (1993) 8. Gevers, T., Smeulders, A.W.M.: Color based object recognition. Pattern Recognition 32(3), 453–464 (1999) 9. Isard, M., Blake, A.: Condensation - conditional density propagation for tracking. Int’l Journal of Computer Vision 29(1), 2–28 (1998) 10. Isard, M., Blake, A.: ICONDENSATION: unifying low-level and high-level tracking in a stochastic framework. In: Proc. of 5th Europearn Conf. on Computer Vision, vol. I, pp. 893–908 (1998) 11. Jhne, B., Scharr, H., Krkel, S.: Handbook of Computer Vision and Applications. In: Jhne, B., Hauecker, H., Geiler, P. (eds.), vol. 2, pp. 125–151. Academic Press, London (1999) 12. Khan, Z., Balch, T., Dellaert, F.: An MCMC-based particle filter for tracking multiple interacting targets. In: Proc. of 5th Europearn Conf. on Computer Vision, vol. I, pp. 893–908 (2004) 13. Shan, C., Tan, T., Wei, Y.: Real-time hand tracking using a mean shift embedded particle filter. Pattern Recognition 40(7), 1958–1970 (2007) 14. Sullivan, J., Rittscher, J.: Guiding Random Particles by Deterministic Search. In: Proc. of Eighth IEEE Int’l Conf. on Computer Vision, vol. I, pp. 323–330 (2001) 15. Swain, M., Ballard, D.: Color Indexing. Int’l Journal of Computer Vision 7, 11–32 (1991) 16. Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. Int’l Journal of Computer Vision 63(2), 153–161 (2005) 17. Wang, J., Yagi, Y.: Integrating Shape and Color Features for Adaptive Real-time Object Tracking. In: IEEE Int’l Conf. on Robotics and Biomimetics 2006 (2006)
Efficient Search in Document Image Collections Anand Kumar1 , C.V. Jawahar1, and R. Manmatha2 1 Center for Visual Information Technology International Institute of Information Technology Hyderabad, India - 500032
[email protected],
[email protected] 2 Department of Computer Science University of Massachusetts Amherst, MA 01003, USA
[email protected]
Abstract. This paper presents an efficient indexing and retrieval scheme for searching in document image databases. In many non-European languages, optical character recognizers are not very accurate. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search, tend to be slow. Here indexing is done using locality sensitive hashing (LSH) - a technique which computes multiple hashes - using word image features computed at word level. Efficiency and scalability is achieved by content-sensitive hashing implemented through approximate nearest neighbor computation. We demonstrate that the technique achieves high precision and recall (in the 90% range), using a large image corpus consisting of seven Kalidasa’s (a well known Indian poet of antiquity) books in the Telugu language. The accuracy is comparable to using dynamic time warping and nearest neighbor search while the speed is orders of magnitude better - 20000 word images can be searched in milliseconds.
1
Introduction
Many document image collections are now being scanned and made available over the Internet or in digital libraries. Effective access to such information sources is limited by the lack of efficient retrieval schemes. The use of text search methods requires efficient and robust optical character recognizers(OCR), which are presently unavailable for Indian languages [1]. Another possibility is to search in the image domain using word spotting [2,3,4]. Direct matching of images is inefficient due to the complexity of matching and thus impractical for large databases. We solve this problem by directly hashing word image representations. We present an efficient mechanism for indexing and retrieval in large document image collections. First, words are automatically segmented. Then features are computed at word level and indexed. Word retrieval is done very efficiently by using an approximate nearest neighbor retrieval technique called locality sensitive hashing (LSH). Word images are hashed into multiple tables with features computed at word level. Content-sensitive hash functions are used to hash words Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 586–595, 2007. c Springer-Verlag Berlin Heidelberg 2007
Efficient Search in Document Image Collections
587
such that the probability of grouping similar words in the same index of the hash table is high. The sub-linear time content-sensitive hashing scheme makes the search very fast without degrading the accuracy. Experiments on a collection of Kalidasa’s - the classical Indian poet of antiquity - books in Telugu demonstrate that 20,000 word images may be searched in a few milliseconds. The approach thus makes searching large document image collections practical. There are essentially three classes of techniques to search document image collections. The first is based on using a recognizer to convert an image to text and then searching the results using a text search engine. An example is the gHMM approach of Chan et al. [5], suggested for printed and handwritten Arabic documents. It uses gHMMs with a bi-gram letter transition model, and KPCA/LDA for letter discrimination. In this approach segmentation and recognition go hand in hand. The words are modeled at letter level, where the likelihood of a word given a segment is used for discriminating words. The Byblos system [6] also uses a similar approach to recognize documents where a line is first segmented out and then divided in to image strips. Each line is then recognized using an HMM and a bi-gram letter transition model. The second used by Rath et al. [7], involves the automatic annotation of word images with a lexicon and probabilities using a relevance-based language model. Here, words are segmented out and then each word image is annotated using a statistical model with the entire lexicon and probabilities. A language model retrieval approach is then used to search the documents. The technique was successfully used to build a 1000 page demonstration for George Washington’s handwritten manuscripts. The third approach proposed by Rath and Manmatha [2,3] involves using what is called word spotting, where word images are matched with each other and then clustered. Each cluster is then annotated by a person. Alternatively, Jawahar et al. [4] showed that in the case of printed books one can synthesize the query image from a textual query for making the system more usable. Word spotting has been tried for many different kinds of documents both handwritten and print. Rath and Manmatha [2] used dynamic time warping (DTW) to compute image similarities for handwriting. The word similarities are then used for clustering using K-means or agglomerative clustering techniques. This approach was adopted in Jawahar et al. [4] for printed Indian language document images. To simplify the process of querying, a word image is generated for each query and the cluster corresponding to this word is identified. In such methods, efficiency is achieved by significant offline computation. Ataer and Duygulu [8] tried word spotting for handwritten Ottoman documents where they use successive pruning stages to eliminate irrelevant words. Gatos et al. [9] used word spotting for old Greek typewritten manuscripts for which OCRs did not work. One advantage of word spotting over traditional OCR methods is that they take advantage of the fact that within corpora such as books the word images are likely to be much more similar, which traditional OCRs do not do. In addition, techniques that work at the symbol level of word images, like [5], are very sensitive to segmentation errors. Segmentation of Indian language document images at symbol level is very difficult due to the complexity of the scripts.
588
A. Kumar, C.V. Jawahar, and R. Manmatha
Many of these techniques (for example DTW) are computationally expensive and do not scale very well. Inspite of this, Sankar et al. [10] successfully indexed 500 books in Indian languages using this approach by doing virtually all the computation off-line. Avoiding DTW, Rath et al. [3] demonstrated the use of direct clustering of word image features on historical handwritten manuscripts. However, clustering is itself an expensive operation. Image matching often involves offline nearest neighbor computations and storage for efficient access. These nearest neighbor techniques are expensive in high dimensions even when computed off-line. Indyk and Motwani [11] proposed an approximate nearest neighbor search technique called locality sensitive hashing (LSH) which is much more efficient. LSH has been applied to a number of problems including some in computer vision. For example, LSH is used to efficiently index high dimensional pose examples by Shakhnarovich et al. [12]. Matei et al. [13] use LSH for 3D object indexing. LSH is different from the geometric hashing approaches used in model based recognition of 3-D objects in occluded scenes from 2-D gray scale images [14] and also for finding documents from a set of camera-based document images [15].
Fig. 1. Sample document images from Kalidasa’s books in Telugu
Our work mainly aims at addressing some of the issues involved in effective and efficient retrieval in document images with effective representations of the word images. We demonstrate efficient retrieval through content-sensitive hashing on a collection of Kalidasa’s writings. Sample pages from the Kalidasa collection are shown in Figure 1.
2
Content Sensitive Hashing
In the proposed retrieval technique, the index is built by hashing word level features of document images. The features are hashed using content sensitive hash functions, such that the probability of finding words with similar content in the same bucket is high. The same content sensitive hash functions are used to
Efficient Search in Document Image Collections Input
589
Output
11 00 00 11 Document Images Textual Query
Relevant Documents
Pre−processing Hashed Words
Word Rendering
Segementation and Word detection
Feature Extraction
Hashing
Online Process
Feature Extraction Offline Process
Fig. 2. Hashing Method: Word image hashing for efficient search. Showing offline preprocessing and on-line query processing stages.
query similar words during the search. The major challenges in efficient indexing and retrieval are the preprocessing and word matching times. We overcome these challenges with the use of hashing. A conceptual block diagram of the technique is shown in Figure 2. Books are scanned and processed to index the document pages. The textual word query is first converted to an image by rendering, features are extracted from these images and then search is carried out to retrieve relevant word images. To facilitate searching, scanned document images are preprocessed and segmented at word level. A set of features are extracted as representatives of word images to be indexed. Content-sensitive hash functions are used to hash the features such that similar word images are grouped in the same index of the hash table. 2.1
Word Image Representation
We employ a combination of scalar, profile, structural and transform domain feature extraction methods as used in [2,3,4]. Scalar features include the number of ascenders, descenders and the aspect ratio. The profile and structural features include: projection profiles, background to ink transitions and upper and lower word profiles. Fixed length description of the features are obtained by computing lower order coefficients of a DFT (Discrete Fourier Transform) - discarding noisy high order coefficients makes the representation more robust. We use 84 Fourier coefficients of the segmented profiles and ink transition features to represent the word images. Finding similar word images is now equivalent to the nearest neighbor search (NNS) problem: Given a set of n points P = p1 , . . . , pn in some metric space X, we preprocess P so as to efficiently answer queries, which require finding the
590
A. Kumar, C.V. Jawahar, and R. Manmatha
point in P closest to a query point q ∈ X. Traditional data structures for similarity search suffer from the curse of dimensionality. Locality sensitive hashing (LSH) is a state-of-the-art technique introduced by Indyk and Motwani [11] to alleviate the problem of high dimensional similarity search in large databases. The main idea in LSH is to hash points into bins based on a probability of collision. Thus, points that are far in the parameter space will have a high probability of landing into different bins, while close points will go into the same bucket. It has been shown that LSH out-performs tree-based structures such as the Sphere/Rectangle-tree (SR-tree) by at least an order of magnitude. 2.2
Hashing Technique
Let P = {x1 , x2 , . . . , xn } be the words in the document image collection. A word is represented by a feature vector x = {f1 , . . . , fD }, represented as a point x ∈ RD in feature space, where fj is computed by extracting features that describe the content of the word images. The extracted features satisfy the following assumptions. 1. A distance function d is given which measures the content level similarity of the words, and a radius R in the feature space is given such that x1 , x2 are considered similar iff d(x1 , x2 ) < R. 2. For a randomly chosen word image, there exists a word image with high probability and similar feature values in the collection. 3. There are no significant variations in feature vectors of the words with similar content, or the feature extraction process is unbiased. The distance function and the similarity threshold are dependent on the particular task, and often reflect perceptual similarities between the words. The last assumption implies that there are no significant sources of variation in the word features for words that are similar in content. The content similarity search is done by efficient nearest neighbor searching with content-sensitive hashing algorithm. The content-sensitive hashing is achieved by hashing words using a number of hash functions from a family H = {h : S → U } of functions. H is called content-sensitive if for any q, the function p(t) = P rH [h(q) = h(x) : ||q − v|| = t]
(1)
is strictly decreasing in t. That is, the probability of collision of points q and x is decreasing with content dissimilarity (distance) between them. We concatenate several hash functions h ∈ H. In particular define a function family G = {g : S → U k } such that, g(x) = (h1 (x), . . . , hk (x)). For an integer L, the algorithm chooses L functions g1 , . . . , gL from G, independently and uniformly at random. During preprocessing, the algorithm stores each input point in buckets gj (x), for all j = 1, . . . , L. Since the total number of buckets may be large, the algorithm retains only the non-empty buckets by resorting to hashing.
Efficient Search in Document Image Collections
591
Algorithm 1. Content Sensitive Hashing Input: Word Images Wj , j = 1, . . . , n Output: Hash Tables Ti , i = 1, . . . , l 1: for each i = 1, . . . , l do 2: Initialize hash table Ti with hash functions gi 3: end for 4: for each i = 1, . . . , l do 5: for each j = 1, . . . , n do 6: Pre-process word image Wj (noise removal etc). 7: Extract features Fj of word image Wj . 8: Compute hash bucket I = gi (Fj ) 9: Store word image Wj on bucket I of hash table Ti 10: end for 11: end for
A D dimensional word feature x is mapped onto a set of integers by each hash function ha,b (x). Each hash function in the family is indexed by a choice of random a and b, where a is a D dimensional vector with entries chosen independently from a p-stable distribution and b is a real number chosen uniformly from the range[0, w]. For a fixed a, b the hash function ha,b is given by, ha,b (x) =
a · x + b w
(2)
Generally w = 4. The dot product a · x projects each vector onto a real line. The real line is chopped into equi-width segments of appropriate size w and hash values are assigned to vectors based on which segment they project onto. The value of k is chosen such that tc + tg is minimal, where tc is the mean query time and tg is the time to compute the hash functions in L hash tables. The values of k is determined by estimating the times on a sample data set S ⊆ P . The details about such parameter settings and the hash functions are presented in [11,16]. Algorithm 1 summarizes the major steps of content-sensitive hashing. Given a query word image, it is represented with the set of features q. The first level k hash functions are calculated and concatenated to get bucket id’s gi (q), i = 1, . . . , L in L hash tables. Then all the features, and the corresponding words, in the buckets of L tables are retrieved as the query results. Thus the problem of finding nearest neighbor boils down to searching only the vectors in the bucket that have the same hash index value as the query. Algorithm 2 summarizes the major steps of querying. The hash based search in a collection of document images is faster as compared to other approaches, like exhaustive search with DTW and nearest neighbor techniques. Approaches presented in the literature take a long time in building the index and retrieval due to preprocessing and complex matching procedures. This computational time can be reduced with the elimination of costly processes, like clustering. We achieve this by employing faster content-sensitive hashing technique. We achieve interactive retrieval with retrieval speed in milliseconds. Time
592
A. Kumar, C.V. Jawahar, and R. Manmatha
Algorithm 2. Word Retrieval Input: Query Word Image w Output: Similar word images 1: O ← φ 2: for each i = 1, . . . , l do 3: Pre-process word image w (noise removal etc). 4: Extract features Fw of word image w. 5: Compute hash function I = gi (Fw ) 6: O ← O ∪ {points found in index I of Ti } 7: end for 8: Return similar words O by linear search.
inefficient offline processing of the data is not required for creating the index. The hashing technique avoids complex image matching methods and searches in sub-linear time.
3
Results and Discussions
We evaluated the proposed hash based retrieval scheme on word image data sets obtained from a collection of 7 Kalidasa books. The books are printed in Telugu, an Indian language. The document images were scanned and preprocessed to get segmented words with little manual effort to remove segmentation errors. Then, the words were represented by a set of features. Around 20 words were annotated in each book for experimentation and performance evaluation purpose. Given a textual query word, an image is rendered (generated). Features are extracted from the query image and hashed to search and retrieve the relevant words. The book-wise performance measured using precision, recall and F-score values are shown in Table 1. The query image and example search results are shown in Figure 3. The first two rows show correct results. Sometimes other words may also appear somewhat visually similar and the last column in the last two rows shows examples of such words being retrieved. Table 1. Search Performance: Precision, recall and F-score values for retrieval experiments conducted on each book from Kalidasa collection Book # Pages Maalavikaagnimitra 292 Vikramuurvashiyam 286 Abhijnanasakuntalam 312 Ritusamhara 142 Kumarasambhava 282 Raghuvamsha 300 Meghaduta 238
# Words Precision Recall F-score 22,500 100.00 91.72 95.68 23, 600 100.00 95.58 97.74 22,500 96.79 91.27 93.96 11,000 94.65 93.67 94.16 56,100 92.37 90.21 91.27 36,000 93.23 92.6 92.91 44,000 96.15 93.53 94.82
Efficient Search in Document Image Collections
Query Image
593
Some of the Retrieved Images
Fig. 3. Results: Example (Telugu) words searched for input queries
Fig. 4. Results: Words with small variations in style and size are retrieved
Examples of queries containing words of different sizes and style types are shown in Figure 4. Such results are obtained by querying the same word in multiple books of the collection. Using the same query across two different books of the collection retrieves words which are content-wise similar. Indian language words have small form variations. For example, the same word may have different case endings. Such words are also searched correctly using the proposed solution. Example results of such queries are shown in Figure 5 (row 2). The retrieved words have the same stem, which is due to the similarity in image content. There are limits to the font variations that can be handled by the proposed retrieval technique. Experiments show that we cannot use combinations of different font words but such combinations are very unlikely to occur in books. The proposed hashed based search is sub-linear and much faster than exhaustive nearest neighbor search. The plot in Figure 6(a) shows the time efficiency of our hash based search versus nearest neighbor search. The experiments were conducted on data sets of increasing size (by 5,000 words) in each iteration. The maximum number of words used were around 45,000. With the use of the maximum size data set, the maximum time to search relevant words was of the order of milliseconds. The experiments were conducted on an AMD Athlon 64 bit processor using 512 MB memory. The precision and recall values are controlled by the query radius (distance) value. Experiments were conducted on synthetic images of Telugu language to see
594
A. Kumar, C.V. Jawahar, and R. Manmatha
Query Image
Some of the Retrieved Images
Fig. 5. Results: Words with small form variations are retrieved as relevant
100 Precision Recall
% Precision & Recall
Time (Sec)
0.2
Exhaustive NNS Hash based search
0.15 0.1
0.05
95
90
85
1
1.5
2
2.5 3 Data Size
(a)
3.5
4 4
x 10
0.2
0.25 Query Radius
0.3
0.35
(b)
Fig. 6. Performance comparison: (a) Hashing and exhaustive nearest neighbor search. (b) Effect of distance: Precision and recall change with the query radius.
the effect of radius on the performance. Around 6,000 synthetic word images of Telugu language were used for the experiments. Each word was repeated around 4-10 times in the whole collection. Figure 6(b) shows the change in precision and recall values with the radius. The degradation in performance with the increase in radius indicates that many irrelevant words are added to group of similar words. Similar results were obtained with different font datasets for Telugu language. Therefore query distance has to be determined experimentally. Table 2 compares this approach to one based on using the DTW score as a similarity measure. It shows that our method is much faster than the DTW based exhaustive matching and search procedure while the accuracy is similar. Table 2. Performance: DTW based exhaustive search is much slower. Accuracy of the proposed method similar to the DTW matching. Book
Hash Based Search DTW Based NNS Precision Recall Time(sec) Precision Recall Time(sec) Abhijnanasakuntalam 96.79 91.27 0.005 95.27 93.71 650 Ritusamhara 94.65 93.67 0.003 93.33 96.63 216
Efficient Search in Document Image Collections
4
595
Conclusion and Future Work
We presented an efficient indexing and retrieval scheme for searching in large document image databases. Efficiency and scalability along with high precision and recall values are achieved by content-sensitive hashing. The retrieval speed is orders of magnitude better - the technique can search 20,000 word images in milliseconds. We have demonstrated that this technique is practical for searching printed documents rapidly. Future improvements could include feature selection using machine learning techniques to include multiple fonts and styles.
References 1. Pal, U., Chaudhuri, B.: Indian script character recognition: A survey. Pattern Recognition 37, 1887–1899 (2004) 2. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Conference on Computer Vision and Pattern Recognition, vol. (2), pp. 521–527 (2003) 3. Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2), 139–152 (2007) 4. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 1–12. Springer, Heidelberg (2006) 5. Chan, J., Ziftci, C., Forsyth, D.A.: Searching off-line arabic documents. In: CVPR. Conference on Computer Vision and Pattern Recognition, vol. (2), pp. 1455–1462 (2006) 6. Lu, Z., Schwartz, R., Natarajan, P., Bazzi, I., Makhoul, J.: Advances in the bbn byblos ocr system. In: ICDAR, pp. 337–340 (1999) 7. Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: SIGIR, pp. 369–376 (2004) 8. Ataer, E., Duygulu, P.: Retrieval of ottoman documents. In: Multimedia Information Retrieval (MIR) workshop, pp. 155–162 (2006) 9. Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. IJDAR 9(2), 167–177 (2007) 10. Sankar, K.P., Jawahar, C.V.: Probabilistic reverse annotation for large scale image retrieval. In: Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007) 11. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: SOTC, pp. 604–613 (1998) 12. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parametersensitive hashing. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, pp. 750–757. Springer, Heidelberg (2005) 13. Matei, B., Shan, Y., Sawhney, H., Tan, Y., Kumar, R., Huber, D., Hebert, M.: Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation. IEEE Trans. PAMI 28(7), 1111–1126 (2006) 14. Lamdan, Y., Wolfson, H.: Geometric hashing: A general and efficient model-based recognition scheme. In: ICCV, pp. 238–249 (1988) 15. Nakai, T., Kise, K., Iwamura, M.: Use of affine invariants in locally likely arrangement hashing for camera-based document image retrieval. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 541–552. Springer, Heidelberg (2006) 16. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th VLDB conference, pp. 518–529 (1999)
Hand Posture Estimation in Complex Backgrounds by Considering Mis-match of Model Akihiro Imai1 , Nobutaka Shimada2 , and Yoshiaki Shirai2
2
1 Dept.of Computer-Controlled Mechanical Systems, Osaka University Yamadaoka, Suita, Osaka 565-0871, Japan Dept.of Human and Computer Intelligence, Ritumeikan University, Nojihigashi, Kusatsu, Shiga 525-8577, Japan
Abstract. This paper proposes a novel method of estimating 3-D hand posture from images observed in complex backgrounds. Conventional methods often cause mistakes by mis-matches of local image features. Our method considers possibility of the mis-match between each posture model appearance and the other model appearances in a Baysian stochastic estimation form by introducing a novel likelihood concept “Mistakenly Matching Likelihood (MML)“. The correct posture model is discriminated from mis-matches by MML-based posture candidate evaluation. The method is applied to hand tracking problem in complex backgrounds and its effectiveness is shown.
1
Introduction
Precise hand-finger shape estimation methods using visual cues have been developed [1][2][3][4][5][6] in order to implement the gestural interfaces in a touch-less manner which are utilized in interaction with virtual environments and automatic sign-language translation. One of the difficulties of implementing the interfaces based on the hand shape estimation exists in its situations where the interfaces are needed: its complex backgrounds like colorful and textured clothes, skin-colored region as human face and some desktops on which various tools and objects are scattered. Since the hand shape even in simple backgrounds is a tough problem due to its great varieties of posture, shape estimation with simultaneous segmentation is still a challenging problem. To solve the problem with feasible computing resources, some trials were reported from the following two viewpoints: 1) how to reduce the number of posture candidates to consider (i.e. how to predict the posture), 2) how to evaluate the matching degree between the posture candidate and observed image features. From the viewpoint 1), Active Shape Model [7] is proposed, which learns acceptable shape deformations and tracks the region contour or texture assuming smooth deformation and motion. Non-smooth deformation can be treated by introducing Switching Linear Dynamics [8]. 3-D model-based shape prediction and tracking, not based on appearance learning, is also proposed [5]. Most of Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 596–607, 2007. c Springer-Verlag Berlin Heidelberg 2007
Hand Posture Estimation in Complex Backgrounds
(a) Estimation result
(b) Edges of estimation result put on input image
(c) Shape similar to input image
(d) Edges of similar shape put on input image
Fig. 1. Mistake of conventional method
597
Fig. 2. Corresponding of edges
those methods employ a parallel search scheme in tracking, like beam-search or particle filter for robustness against temporal mis-estimation and tracking failure [5][9][10][11][12][13][14]. While many improvements from the first viewpoint are reported, those from the second viewpoint, concerned with the evaluation of matching degree, are comparatively few and most implementations employ a simple feature corresponding and evaluating method: chamfer matching [15][16]. Chamfer matching makes correspondences between the features with the least distance in the image, and evaluates the matching degree by the sum of the distances (chamfer distance). This simple matching scheme, of course, causes often a wrong shape estimate on the complicated backgrounds. Fig. 1 is an example of wrong estimate of hand posture caused by chamfer matching. Because many edge textures are observed in the hand region, the finger tips of the posture model are mistakenly corresponding to the inner edges (see Fig. 2) and as a result its chamfer distance is evaluated too small. While this problem is hard to avoid as long as using chamfer matching, no more appropriate matching method is found other than chamfer matching. Therefore the matching degree should be evaluated under the consideration of that such a wrong matching often happens. Nevertheless, the existence of the mismatch caused by chamfer matching does not directly mean its uselessness. Embedding approach [15] evaluates the matching degree between an input image feature and not only one posture model but also several other reference models. For example of Fig. 1, in addition to the correct posture candidate (c), the candidate (a) also has so high matching
598
A. Imai, N. Shimada, and Y. Shirai
degree that (a) is picked up as the estimate. However, if only (a) has the high matching degree when (a) should be the correct matching, these two cases can be discriminated by evaluating the matching degrees with both reference models (a) and (c). Since Embedding approach only uses an ad-hoc way by evaluating the squared sum of the matching degrees of all reference models, its estimate is not the optimum in Bayesian point of view. This paper mathematically derives the Baysian form of Embedding approach. In its derivation, a novel concept of likelihood is introduced: Mistakenly Matching Likelihood (MML), which predicts the high evaluation caused by wrong matching and gives the ability to discriminate the true estimate from the false match in a stochastic way. The derived MML-based candidate evaluation is applied to hand tracking problem in complex backgrounds and its effectiveness is experimentally shown.
2
Acquisition of Typical Hand Posture Images
3-D Hand model used in our research is originally wireframe model. The model is modified into a shaded model. The joint bending angles are denoted by θb,t,1 , θb,t,2 , θb,t,3 , θb,i,1 , θb,i,2 , θb,i,3 , . . . and opening angles at a base joint of the fingers are denoted by θo,t , θo,i , . . . (shown in Fig.3). The posture of the whole hand model is represented by translation tx , ty , tz and Ritalin θr,x , θr,y , θr,z . As a whole, the shape of the hand model has 26 degrees of freedom θ = (θb,t,1 , . . . , θr,z ).
(1)
CG images of typical hand models are shown in Fig. 4. The finger joints dependently moves in natural actions [5]. In index, middle, ring, pinky fingers, adjacent joint angles are usually similar. This kind of joint constraints reduce
θo,r
θo,m θo,i
θb,i,3 θb,i,2
θo,p θo,t
θb,i,1 t y θy t z θz (a) Wireframe model
t x θx
(b) Model after shading Fig. 3. Hand model
Hand Posture Estimation in Complex Backgrounds
(a)
(b)
(c)
(d)
599
(e)
Fig. 4. CG images of typical hand postures
Fig. 5. Edge images of hand model
Table 1. Quantization width from search center c Δθbend [◦ ] c Δθopen [◦ ] c Δθo,m [◦ ] c Δθr,x [◦ ] c Δθr,y [◦ ] c Δθr,z [◦ ] c c Δtx , Δty [mm] Δtcz [mm]
−ζ -6 -30 -15 -6 -8 -40
0 0 0 0 0 0 0 0
ζ 6 30 15 6 8 40
the number of possible postures. Fig.5 shows the edge images generated from the CG images of the typical hand postures under the constraints. The posture parameters θ to be estimated are quantized. The change is shown c , that of θo other than in Table 1 (The change of θb is represented by Δθbend c middle is represented by Δθopen ). Each parameter of θb is assumed to change 0 or ±ζ. Each parameter has own ζ between 9 deg to 15 deg.
3
Matching Method
The system has several hand models with various dimensions: i.e. lengths and widths of the palm and fingers. Since input image sequences are assumed to start from a predefined simple shape and an initial position, the dimensions are easily initialized at the first frame. The posture parameters to be estimated are around the posture estimate at the previous image frame. When each input image is obtained, the best-matched model is determined by the maximum likelihood criterion. Let I denote edges and skin regions extracted from an input image and Θj denote a quantized parameter vector of the j th model. The criterion is ˆ = arg max P (I|Θj ) Θ (2) Θj
where P (I|Θj ) denotes the likelihood of the j th model for the input. The likelihood is defined based on the difference of the image I and the the appearance of the shape model.
600
3.1
A. Imai, N. Shimada, and Y. Shirai
Difference of Image and Appearance of Shape Model
In this paper, edge image I (e) and skin-color region image I (s) are used as the image features of the input I. The difference between the silhouette of a typ(s) ical hand model and that of an image is computed. Let Aθ be the silhouette generated from a typical hand model θ. The Difference of the silhouette, (s) (s) fskin (Aθ ; I (s) ), is defined as the area of Aθ which does not overlap with I (s) . (e) (e) The difference between the I (e) and the edge appearance Aθ , fdist (Aθ ; I (e) ), is computed by a modified chamfer matching, in which the edge points are classified by gradient direction and the edges is matched by the original chamfer matching in each direction class[13][18]. The distance is weighted by the edge contrast and as a result fdist is defined as follows: (e) fdist (Aθ ; I (e) ) = wθ,j min (||xθ,j − xI (e) ,k || + fI (e) ,k + g(j, k)) (3) j
k
where xθ,j and xI (e) ,k denotes the j th edge point of the model and the k th edge point of an input edge image. || · || is 2-dimensional Euclidean norm, wθ,j is a weight constant, fI (e) ,k is a penalty for an edge with a low contrast, dθ,j wθ,j = . l dθ,l
(4)
fI (e) ,k = −wd dI (e) ,k
(5)
where wd represents weight constant. The difference of gradient direction g(j, k) is defined in terms of gradient direction φθ,j of the model edge and gradient direction φI (e) ,k as (6) g(j, k) = wφ ||φθ,j − φI (e) ,k ||. where wφ represents weight constant. All weights are experimentally determined. The modified chamfer matching is computed by using distance transformation as fast as the original chamfer matching. 3.2
Discrimination Principle
The example of mis-matching by a conventional method has been shown as Fig. 1 in section 1. Let Θa and Θc respectively denote the posture parameter of model (a) (fist shape) and (c) (flat shape). Suppose one case that the true hand posture should be Θc . In this section, we describe the principle to discriminate the true match from the wrong matches due to the complicated skin textures and the backgrounds and finally introduce its stochastic forms giving the discriminating criterion. Then the probability of that the appearance AΘc is matched to the input image, p(AΘc |Θc ), should be large enough because the true posture is the same as the one which generates the appearance, Θc . On the other hand, the probability of AΘa (small fist shape), p(AΘa |Θc ) can be also large in spite of the posture difference between Θa and Θc because
Hand Posture Estimation in Complex Backgrounds
601
p(AΘa |Θc ) ∼ = p(AΘa |Θa ) AΘa
hand with Θa
AΘc
hand with Θc p(AΘc |Θa ) < p(AΘc |Θc ) Fig. 6. Likelihood for edge images
almost all of the area of AΘa is included and the inner texture edges can be wrongly matched to the finger contours. Therefore the conventional likelihood maximization can often choose Θa mistakenly for flat hand shapes like Θc due to the image capture noise, inaccuracy of the 3-D shape model and quantization errors of the posture parameters. In order to solve the mis-matches, we carefully analyze the behaviours of two more probabilities: p(AΘa |Θa ) and p(AΘc |Θa ). p(AΘa |Θa ) should be large and p(AΘc |Θa ) should be small because AΘc protrudes from the area of Θa . The four probabilities take the behaviours as follows: while p(AΘa |Θc ), p(AΘc |Θc ) are large together for the posture Θc , p(AΘa |Θa ) is large and p(AΘc |Θa ) is small for the posture Θa (see Fig. 6). Therefore when the appearance AΘa and AΘc are observed together, that posture should be estimated as Θc . When AΘa is observed alone, that posture should be Θa . When the likelihood of an appearance AΘk to a model Θj , p(AΘk |Θj ), is obtained for each possible combination of k and j in advance, the appropriate model can be chosen by taking the all p(AΘk |Θj ) values into account like the above discussion. When k and j are identical, p(AΘk |Θj ) is equivalent to the conventional likelihood function. Otherwise, it means the likelihood of that an appearance AΘk comes from a mistakenly chosen model Θj . We call the likelihood as ”mistakenly matching likelihood” (MML). 3.3
Model Selection using Mistakenly Matching Likelihood
We introduce the stochastic discrimination criterion from the principle described in the previous section by employing Bayesian estimation framework. Let AΘ1 , AΘ2 , . . . denote appearances of typical hand models. Assuming AΘ1 , AΘ2 , . . . are exclusive under each Θj , the likelihood of Θj for I can be expanded as below: p(I|Θj ) = k p(I, AΘk |Θj ) = k p(I|AΘk , Θj )p(AΘk |Θj ). (7)
602
A. Imai, N. Shimada, and Y. Shirai
Assuming the appearance AΘk has all information to generate the observed image I, condition Θj can be removed, p(I|Θj ) = k p(I|AΘk )p(AΘk |Θj ). (8) In the conventional maximum likelihood estimation method, the likelihood for the case of k = j is considered alone. In contrast, we additionally consider the MML for the case of k = j. Assuming that I (e) and I (s) are independent when a certain Θj is specified, p(I|Θj ) = p(I (e) |Θj )p(I (s) |Θj )
(9)
is derived as the discrimination criterion in stochastic form. The likelihood p(I (e) |Θj ) and p(I (s) |Θj ) are respectively derived from the following equations: (e) (e) (10) p(I (e) |Θj ) = k p(I (e) |AΘk )p(AΘk |Θj ) p(I (s) |Θj ) =
(s)
k
(s)
p(I (s) |AΘk )p(AΘk |Θj )
(e)
(11)
(e)
The probabilistic distributions p(AΘk |Θj ) and p(I (e) |AΘk ) for edge images is (s)
introduced the following sections. Those for skin color silhouette, p(AΘk |Θj ) (s)
and p(I (s) |AΘk ) can be introduced in the same manner of those for edge image. 3.4
Likelihood of Typical Hand Models for Appearances
The likelihood of typical hand model is obtained as the following form because of quantization errors of Θj . (e) (e) p(AΘk |Θj ) = Θj p(AΘk , θj |Θj )dθj (e) = Θj p(AΘk |θj , Θj )p(θj |Θj )dθj (12) (e) = Θj p(AΘk |θj )p(θj |Θj )dθj . The sampling distribution of p(θj |Θj ) can be assumed as a uniform distribu(e) tion under each Θj . Assuming p(AΘk |θj ) is constant for each j since the the quantization interval of Θj is small enough, p(AΘk |θj ) is reduced to p(AΘk |θj∗ ), (e)
(e)
where θj∗ is the mean value of the interval Θj . p(AΘk |θj∗ ) is derived as follows from the definition of fdist in sec.3.1 and assuming that fdist obeys a gaussian distributionfdist [12][13][17]: (e)
p(AΘk |θj∗ ) = αθ∗ exp(−(dM (k, j))2 ) (e)
(e)
(e)
j
(13)
(e)
where Ir (θ) is the edge image rendered from the posture θ, and (e)
dM (k, j) =
fdist (AΘ ;Ir(e) (θj∗ )) (e)
k (e)
σM
.
(14)
Hand Posture Estimation in Complex Backgrounds (e) 2
σM
603
is the variance of the value of fdist (AΘk ; Ir (θj∗ )). σM is experimentally (e)
(e)
(e)
(e)
determined. αθ∗ is normalization constant, j
(e)
αθ∗ =
k
j
−1 (e) exp(−(dM (k, j))2 ) .
(15)
In the same manner as the above, p(AΘk |θj∗ ) = αθ∗ exp(−(dM (k, j))2 ) (s)
(s)
(s)
(16)
j
(s)
where Ir (θ) is the silhouette generated from θ, and fskin (AΘ ;Ir(s) (θj∗ )) (s)
(s)
dM (k, j) =
k (s)
σM
(s) 2
(17)
.
σM is the variance of the value of fskin (AΘk ; Ir (θj∗ )). αθ∗ is normalization j constant, −1 (s) (s) 2 (18) αθ∗ = . k exp(−(dM (k, j)) ) (s)
(s)
(s)
j
3.5
Likelihood of Appearance
In this section, we explain the evaluation of the likelihood of an appearance (e) p(I (e) |AΘk ). The likelihood is defined based on the definition of fdist as p(I (e) 2
where, σI
(e)
(e) |AΘk )
=
(e) βΘk exp
2 (e) (fdist (AΘ ;I (e) )) k − . (e) 2 σI
(19)
(e)
is the variance of (fdist (AΘk ; I (e) )). It is experimentally determined.
The normalization constant abilistic distributions:
(e) βΘk
is derived from the integral condition of prob(e)
p(i(e) |AΘk )di(e) = 1
(20)
(e)
Assuming that p(i(e) |AΘk ) can be large value only for i(e) of hand images and is 0 for most of other i(e) , (e) (e) (e) p(i(e) |AΘk )di(e) ≈ p(Ir (θl )|AΘk )dθl (e) ∗ (e) ≈ l p(Ir (θl )|AΘk ) · δ (21) (e) (e) = βΘk l exp(−(dM (k, l))2 ) · δ ≡1 where δ is the range of the quantization of Θ. −1 (e) (e) 2 βΘk = l exp(−(dM (k, l)) ) · δ
(22)
604
A. Imai, N. Shimada, and Y. Shirai (e)
(e)
(e)
When AΘk wrongly matches to many of Ir (θl )(l = k), βΘk becomes small. On (e) βΘk
becomes large. It means that ambiguous the other hand when a few of those, appearance model, which is easy to mis-match to other posture’s appearances, are automatically low evaluated.
4
Estimation of More Accurate Posture Parameters
Posture parameters of the best-matched model are slightly different from that of the hand of an input image due to quantization errors of posture parameters. Thus, more accurate parameters must be estimated. The wireframe CG model of the hand is deformed so that the model is matched to an input image, and the 3-D hand shape is reconstructed from the deformed model[19]. In this method, while the curved surface shape of the hand is reconstructed, posture parameters are not estimated. We deform the CG model so that the appearance of the model is matched to those of the input image by using this method. The accurate posture parameters are estimated from coordinates of the vertices of the triangle patches of the deformed wireframe model. Parameters are estimated by the following steps of a procedure. 1. We make correspondences of edges of the best-matched model to those of the input image. 2. The change of the appearance is evaluated from the correspondences so that the edges of the model move toward those of the input image. 3. In order to reduce the huge search region of posture parameters due to the high DOF of human hand, available deformations of surface mesh of the CG model are learned by PCA in advance for each of typical postures, and then the best approximated mesh deformation is estimated by the projection to the PCA subspace. 4. Return to 1. We make correspondences of edges of the CG model deformed at 3. to those of the input image. CG model is deformed by the change of appearance evaluated from the correspondence, again. Repeat these processes. 5. We evaluate the 2-dimensional Euclidean norm of the vertices of the triangle patches between the deformed CG model and CG model generated by the posture parameters. The sum of the norms is minimized using steepest descent method. The posture parameters with minimized sum of the norms are the posture estimate.
5
Experiment
We did the experiment of posture tracking for 250 hand images. The resolution of the images is 320 × 240. The images are captured by 30 fps. In the conventional (e) (s) method, where p(I|Θj ) = p(I|AΘj )p(I|AΘj ) is used as a matching criterion, 70.4% images are correctly matched. In our method, 82.0% images are correctly matched. The success rates show effectiveness of our method.
Hand Posture Estimation in Complex Backgrounds
605
Fig. 7. Experimental result
Fig. 9. Experimental result
Fig. 8. Experimental result
Fig. 10. Experimental result
The example of the image which is correctly matched in our method while mis-matched in the conventional method, is shown in Fig. 7. While the wrong fist hand shape is matched in conventional method, the correct flat shape is matched in our method. Fig. 8 shows the results of an image sequence. In the input images, fingers are partially occluded and the edges of the background are confusingly appeared near the fingers. Such cases causing mismatches are correctly matched in our method. Fig. 9 and Fig. 10 show the tracking results for other hand shapes. These images are also correctly matched in our method.
6
Conclusion and Discussion
The paper introduces a Bayesian form of evaluation of posture candidates for hand tracking in complex backgrounds. The novel concept of Mistakenly
606
A. Imai, N. Shimada, and Y. Shirai
Matching Likelihood (MML) enables to discriminate the true posture candidate from other confusing ones when the mismatch of image features frequently occurs. Experimental results for tracking of the real human hand show the effectiveness of this evaluation method. Additional image features like optical flows or range other than edges and silhouette should be considered on this framework as future work.
Acknowledgment This work is supported in part by Grant-in-Aid for Scientific Research from Ministry of Education, Science, Sports, and Culture, Japanese Government, No.15300058. The 3-D model of the real human hand was provided by courtesy of Prof. F. Kishino and Prof. Y. Kitamura, Osaka University.
References 1. Liu, X., Fujimura, K.: Hand Gesture Recognition using Depth Data. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, pp. 529–534 (2004) 2. Iwai, Y., Yagi, Y., Yachida, M.: Estimation of Hand Motion and Position from Monocular Image Sequence. In: Li, S., Teoh, E.K., Mital, D., Wang, H. (eds.) ACCV1995. LNCS, vol. 1035, pp. 230–234. Springer, Heidelberg (1996) 3. Lee, S.U., Cohen, I.: 3D Hand Reconstruction from a Monocular View. In: Proc. 17th Int. Conf. on Pattern Recognition, vol. 3, pp. 310–313 (1995) 4. Kameda, Y., Minoh, M., Ikeda, K.: Three Dimensional Pose Estimation of an Articulated Object from its Silhouette Image. In: ACCV 1993, pp. 612–615 (1993) 5. Shimada, N., Kimura, K., Shirai, Y.: Real-time 3-D Hand Posture Estimation based on 2-D Appearance Retrieval Using Monocular Camera. In: Proc. Int. Workshop on RATFG-RTS, pp. 23–30 (2001) 6. Imai, A., Shimada, N., Shirai, Y.: 3-D Hand Posture Recognition by Training Contour Variation. In: Proc. of The 6th Int. Conf. on Automatic Face and Gesture Recognition, pp. 895–900 (2004) 7. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models-Their Training and Application. COMPUTER VISION AND IMAGE UNDERSTANDING 61(1), 38–59 (1995) 8. Jeong, M., Kuno, Y., Shimada, N., Shirai, Y.: Recognition of shape-changing hand gestures. IEICE Trans. inf.Syst. E85-D(10), 1678–1687 (2002) 9. Isard, M., Blake, A.: Visual tracking by stochastic propagation of conditional density. In: Proc. European Conf. Computer Vision, pp. 343–356 (1996) 10. Isard, M., Blake, A.: ICONDENSATION:Unifying low-level and high-level tracking in a stochastic framework. In: Proc. European Conf. Computer Vision, pp. 767–781 (1996) 11. Heap, T., Hogg, D.: Wormholes in Shape Space:Tracking through Discontinuous Changes in Shape. In: 6th Int. Conf. on Computer Vision, pp. 344–349 (1998) 12. Zhou, H., Huand, T.S.: Tracking Articulated Hand Motion with Eigen Dynamics Analysis. 9th Int. Conf. on Computer Vision 2, 1102–1109 (2003)
Hand Posture Estimation in Complex Backgrounds
607
13. Stenger, B., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Model-Based Hand Tracking Using a Hierarchical Bayesian Filter. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1372–1384 (2006) 14. Wu, Y., Lin, J., Huang, T.S.: Analyzing and Capturing Articulated Hand Motion in Image Sequences. IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 27(12), 1910–1922 (2005) 15. Athitsos, V., Sclaroff, S.: Estimating 3D Hand Pose from a Cluttered Image. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. II, pp. 432–439. IEEE Computer Society Press, Los Alamitos (2003) 16. Barrow, H.G., Tenenbaum, J.M., Bolles, R.C., Wolf, H.C.: Parametric correspondence and chamfer matching: Two new techniques for image matching. In: Proc. 5th Int. Joint Conf. Artificial Intelligence, pp. 659–663 (1977) 17. Blake, A., Isard, M.: Active Contours. Springer, Heidelberg (1998) 18. Navaratnam, R., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Hierarchical PartBased Human Body Pose Estimation. In: Proc. British machine Vision Conference (2005) 19. Heap, T., Hogg, D.: Towards 3D Hand Tracking using a Deformable Model. In: 2nd Int. Conf. on Automatic Face and Gesture Recognition, pp. 140–145 (1996)
Learning Generative Models for Monocular Body Pose Estimation Tobias Jaeggli1 , Esther Koller-Meier1 , and Luc Van Gool1,2 1
2
ETH Zurich, D-ITET/BIWI, CH-8092 Zurich Katholieke Universiteit Leuven, ESAT/VISICS, B-3001 Leuven
[email protected]
Abstract. We consider the problem of monocular 3d body pose tracking from video sequences. This task is inherently ambiguous. We propose to learn a generative model of the relationship of body pose and image appearance using a sparse kernel regressor. Within a particle filtering framework, the potentially multimodal posterior probability distributions can then be inferred. The 2d bounding box location of the person in the image is estimated along with its body pose. Body poses are modelled on a low-dimensional manifold, obtained by LLE dimensionality reduction. In addition to the appearance model, we learn a prior model of likely body poses and a nonlinear dynamical model, making both pose and bounding box estimation more robust. The approach is evaluated on a number of challenging video sequences, showing the ability of the approach to deal with low-resolution images and noise.
1 Introduction Monocular body pose estimation is difficult, because a certain input image can often be interpreted in different ways. Image features computed from the silhouette of the tracked figure hold rich information about the body pose, but silhouettes are inherently ambiguous, e.g. due to the Necker reversal. Through the use of prior models this problem can be alleviated to a certain degree, but in many cases the interpretation is ambiguous and multi-valued throughout the sequence. Several approaches have been proposed to tackle this problem, they can be divided into discriminative and generative methods. Discriminative approaches directly infer body poses given an appearance descriptor, whereas generative approaches provide a mechanism to predict the appearance features given a pose hypothesis, which is then used in a generative inference framework such as particle filtering or numerical optimisation. Recently, statistical methods have been introduced that can learn the relationship of pose and appearance from a training data set. They often follow a discriminative approach and have to deal explicitly with the nonfunctional nature of the multi-valued mapping from appearance to pose [1,2,3,4]. Generative approaches on the other hand typically use hand crafted geometric body models to predict image appearances (e.g. [5], see [6,7] for an overview). We propose to combine the generative methodology with a learning based statistical approach. The mapping from pose to appearance is single-valued and can thus be seen Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 608–617, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning Generative Models for Monocular Body Pose Estimation
609
as a nonlinear regression problem. We approximate the mapping with a RVM kernel regressor [8] that is efficient due to its sparsity. The human body has many degrees of freedom, leading to high dimensional pose parametrisations. In oder to avoid the difficulties of high dimensionality in both the learning and the inference stage, we apply a nonlinear dimensionality reduction algorithm [9] to a set of motion capture data containing walking and running movements. 1.1 Related Work Statistical approaches to the monocular pose estimation problem include [1,2,3,4,10,11]. In [10] the focus lies on the appearance descriptor, and the discriminative mapping from appearance to pose is assumed to be single-valued and thus modelled with a single linear regressor. The one-to-many discriminative mapping is explicitly addressed in [1,2,3,4] by learning multiple mappings in parallel as a mixture of regressors. In order to choose between the different hypotheses that the different regressors deliver, [1,2] use a geometric model that is projected into the image to verify the hypotheses. Inference is performed for each frame independently in [1]. In [2] a temporal model is included using a bank of Kalman filters. In [3,4] gating functions are learned along with the regressors in order to pick the right regressor(s) for a given appearance descriptor. The distribution is propagated analytically in [3], and temporal aspects are included in the learned discriminative mapping, whereas [4] adopts a generative sampling-based tracking algorithm with a first-order autoregressive dynamic model. These discriminative approaches work in a bottom-up fashion, starting with the computation of the image descriptor, which requires the location of the figure in the images to be known beforehand. When including 2d bounding box estimation in the tracking problem, a learned dynamical model might help the bounding box tracking, and avoid loosing the subject when it is temporarily occluded. To this end, [12] learns a subjectspecific dynamic appearance model from a small set of initial frames, consisting of a low-dimensional embedding of the appearances and a motion model. This model is used to predict location and appearance of the figure in future frames, within a CONDENSATION tracking framework. Similarly, low-dimensional embeddings of appearance (silhouette) manifolds are found using LLE in [11], where additionally the mapping from the appearance manifold to 3d pose in body joint space is learned using RBF interpolants, allowing for pose inference from sequences of silhouettes. Instead of modelling manifolds in appearance space, [13,14,15] work with low dimensional embeddings of body poses. In [13], the low-dimensional pose representation, its dynamics, and the mapping back to the original pose space are learned in a unified framework. This approach does not include statistical models of image appearance. In a similar fashion, we also chose to model manifolds in pose space rather than appearance space, because the pose manifold has fewer self-intersections than the appearance manifold, making the dynamics and tracking less ambiguous. In contrast to [13,14,15], our model includes a learned generative likelihood model. When compared to [1,2,3,4,10,11], our approach can simultaneously estimate pose and bounding box, and learning a single regressor is more easily manageable than a mixture of regressors. The paper is structured as follows. Section 2 and 3 introduce our learned models and the inference algorithm, and in Section 4 we show experimental results.
610
T. Jaeggli, E. Koller-Meier, and L. Van Gool
2 Learning Figure 1 a) shows an overview of the tracking framework. Central element is the lowdimensional body pose parametrisation, with learned mappings back to the original pose space and into the appearance space. In this section all elements of the framework will be described in detail. Our models were trained on real motion capture data sets of different subjects, running and walking at different speeds. 2.1 Pose and Motion Prior Representations for the full body pose configuration are high dimensional by nature; our current representation is based on 3d joint locations of 20 body locations such as hips, knees and ankles, but any other representation (e.g. based on relative orientations between neighbouring limbs) can easily be plugged into the framework. To alleviate the difficulties of high dimensionality in both the learning and inference stages, a dimensionality reduction step identifies a low dimensional embedding of the body pose representations. We use Locally Linear Embedding (LLE) [9], which approximately maintains the local neighbourhood relationships of each data point and allows for global deformations (e.g. unrolling) of the dataset/manifold. LLE dimensionality reduction is performed on all poses in the data set and expresses each data point in a space of desired low dimensionality. We currently use a 4-dimensional embedding. However, LLE does
Body Pose
X : Body Pose (high dim.)
learn LLE dim. red.
reconstruct pose, eq. (1)
x : Body Pose (low dim.)
generative mapping
dynamic prior eq. (4)
Y : Image (high dim.)
BPCA reconstruction eq. (5)
(b) eq. (6)
y : Appearance Descriptor: (low dim.)
learn BPCA dim. red.
Appearance
(a)
(c)
Fig. 1. a) An overview of the tracking framework. Solid arrows represent signal flow during inference, the dashed arrow stands for LLE resp. BPCA dimensionality reduction during training. The figure refers to equations in Section 2. b) Body pose representation as a number of 3d joint locations. c) Corresponding synthetically generated silhouette, as used for training the appearance model.
Learning Generative Models for Monocular Body Pose Estimation
611
not provide explicit mappings between the high-dimensional and the low-dimensional space, that would allow to project new data points (that were not contained in the original data set) between the two spaces. Therefore, we model the reconstruction projection from the low-dimensional LLE space to the original pose space with a kernel regressor. X = fp (x) = Wp Φp (x)
(1)
Here, X and x are the body pose representations in original resp. LLE-reduced spaces, Φp is a vector of kernel functions, and Wp is a sparse matrix of weights, which are learned with a Relevance Vector Machine (RVM). We use Gaussian kernel functions, computed at the training data locations. The training examples form a periodic twisted ’ring’ in LLE space, with a curvature that varies with the phase within the periodic movement. A linear dynamical model, as often used in tracking applications, is not suitable to predict future poses on this curved manifold. We view the nonlinear dynamics as a regression problem, and model it using another RVM regressor, yielding the following dynamic prior, pd (xt |xt−1 ) = N (xt ; xt−1 + fd (xt−1 )ΔT , Σd ),
(2)
where fd (xt−1 ) = Wd Φd (xt−1 ) is the nonlinear mapping from poses to local velocities in LLE pose space, ΔT is the time interval between the subsequent discrete timesteps t − 1 and t, and Σd is the variance of the prediction errors of the mapping, computed on a hold-out data set that was not used for the estimation of the mapping itself. Not all body poses that can be expressed using the LLE pose parameterisation do correspond to valid body configurations that can be reached with a human body. The motion model described so far does only include information about the temporal evolution of the pose, but no information about how likely a certain body pose is to occur in general. In other words, it does not yet provide any means to restrict our tracking to feasible body poses. Worse, the learned regressors can produce erroneous outputs when they are applied to unfeasible input poses, since the extrapolation capabilities of kernel regressors to regions without any training data is limited. The additional prior knowledge about feasible body poses is introduced as a static prior that is modelled with a Gaussian Mixture Model (GMM). ps (x) =
C
pc N (x; μc , Σc ),
(3)
c
with C the number of mixture components. We obtain the following formulation for the temporal prior by combination with the dynamic prior pd (xt |xt−1 ). p(xt |xt−1 ) ∝ pd (xt |xt−1 ) ps (xt )
(4)
2.2 Likelihood Model The representation of the subject’s image appearance is based on a rough figure-ground segmentation. Under realistic imaging conditions, it is not possible to get a clean silhouette, therefore the image descriptor has to be robust to noisy segmentations to a certain degree. In order to obtain a compact representation of the appearance of a person, we apply Binary PCA [16] to the binary foreground images. The descriptors are
612
T. Jaeggli, E. Koller-Meier, and L. Van Gool
computed from the content of a bounding box around the centroid of the figure, and 10 to 20 BPCA components are kept to yield good reconstructions. The projection of a new bounding box into the BPCA subspace is done in an iterative fashion, as described in [16]. Since we model appearance in a generative top-down fashion, we also consider the inverse operation that projects the low-dimensional image descriptors y back into high dimensional pixel space and transforms it into binary images or foreground probability maps. By linearly projecting y back to the high-dimensional space using the mean μ and basis vectors V of the Binary PCA, we obtain a continuous representation Yc that is then converted back into a binary image by looking at its signs, or into a foreground probability map via the sigmoid function σ(Yc ). p(Y = f g|y) ∝ σ(V T y + μ)
(5)
Now we will look how the image appearance is linked to the LLE body pose representation x. We model the generative mapping fa from pose x to image descriptors y that allows to predict image appearance given pose hypotheses and fits well into generative inference algorithms such as particle filtering. In addition to the local body pose x, the appearance depends on the global body orientation ω relative to the camera, around the vertical axis. First, we map the pose x, ω into low dimensional appearance space y, fa (x, ω) = Wa Φa (x, ω)
(6)
where the functional mapping fa (x, ω) is approximated by a sparse kernel regressor (RVM) with weight matrix Wa and kernel functions Φa (x). By plugging (6) into (5), we obtain a discrete 2d probability distribution of foreground probabilities Seg(p) over the pixels p in the bounding box. Seg(p) = p(p = f g|fa (x, ω))
(7)
From this pdf, a likelihood measure can then be derived by comparing it to the actually observed segmented image Yobs , also viewed as a discrete pdf Obs(p), using the Bhattacharyya similarity measure [17] which measures the affinity between distributions. Obs(p) =p(p = f g|Yobs ) Seg(p)Obs(p) BC(x, ω, Yobs ) =
(8)
p
We model the likelihood measure as a zero mean Gaussian distribution of the Bhattacharyya distance dBh = −ln(BC(x, ω, Yobs )), and obtain the observation likelihood p(Yobs |x, ω) ∝ exp(−
ln(BC(x, ω, Yobs ))2 ) 2 2σBC
(9)
3 Inference In this section we will show how the 2d image position, body orientation, and body pose of the subject are simultaneously estimated given a video sequence, by using the learned models from the previous section within the framework of particle filtering. The pose estimation as well as the image localisation can benefit from the coupling of pose
Learning Generative Models for Monocular Body Pose Estimation
613
and image location. For example, the known current pose and motion pattern can help to distinguish subjects from each other and track them through occlusions. We therefore believe that tracking should happen jointly in the entire state space Θ, Θt = [ωt , ut , vt , wt , ht , xt ],
(10)
consisting of the orientation ω, the 2d bounding box parameters (position, width and height) u, v, w, h, and the body pose x. Despite the reduced number of pose dimensions, we face an inference problem in 9-dimensional space. Having a good sample proposal mechanism like our dynamical model is crucial for the Bayesian recursive sampling to run efficiently with a moderate number of samples. For the monocular sequences we consider, the posteriors can be highly multimodal. For instance a typical walking sequence, e.g. observed from a side view, has two obvious posterior modes, shifted 180 degrees in phase, corresponding to the left resp. the right leg swinging forward. When taking the orientation of the figure into account, the situation gets even worse, and the modes are no longer well separated in state space, but can be close in both pose and orientation. Our experiments have shown that a strong dynamical model is necessary to avoid confusion between these posterior modes and reduce ambiguities. Some posterior multimodalities do however remain, since they correspond to a small number of different interpretations of the images, which are all valid and feasible motion patterns. The precise inference algorithm is very similar to classical CONDENSATION [18], with normalisation of the weights and resampling at each time step. The prior and likelihood for our inference problem are obtained by extending (4) and (9) to the full state i ) serves as the sample space Θ. In our implementation, the dynamical prior pd (Θti |Θt−1 proposal function. It consists of the learned dynamical prior from eq. (2), and a simple motion model for the remaining state variables θ = [ωt , ut , vt , wt , ht ]. i i pd (Θti |Θt−1 ) = pd (xit |xit−1 )N (θti ; θt−1 , Σθ )
(11)
In practice, one may want to use a standard autoregressive model for propagating θ, omitted here for notational simplicity. The static prior over likely body poses (3) and the likelihood (9) are then used for assigning weights wi to the samples. wti ∝ p(Yti |Θti )ps (Θti ) = p(Yti |xit , ωti )ps (xit )
(12)
Here, i is the sample index, and Yti is the foreground probability map contained in the sampled bounding box (uit , vti , wti , hit ) of the actually observed image. Note that our choice for sample proposal and weighting functions differs from CONDENSATION in that we only use one component (pd ) of the prior (4) as a proposal function, whereas the other component (ps ) is incorporated in the weighting function.
4 Experiments We evaluated our tracking algorithm on a number of different sequences. The main goals were to show its ability to deal with noisy sequences with poor foreground segmentation, very low resolution, and varying viewpoints. Particle filtering was performed using a set of 500 samples, leading to a computation time of approx. 2-3 seconds per image frame in unoptimised Matlab code. The
614
T. Jaeggli, E. Koller-Meier, and L. Van Gool
Fig. 2. Circular walking sequence from [5]. The figure shows full frames (top), and cutouts with bounding box in original or segmented input images and estimated poses. Darker limbs are closer in depth.
Fig. 3. Diagonal walking sequence. Estimated bounding boxes and poses. The intensity of the stick figure limbs encodes depth; lighter limbs are further away.
sample set is initialised in the first frame as follows. Hypotheses for the 2d bounding box locations are either derived from the output of a pedestrian detector that is run on the first image, or from a simple procedure to find connected components in the
Learning Generative Models for Monocular Body Pose Estimation
615
Fig. 4. An extract from a soccer game. The figure shows original and segmented images and with estimated bounding boxes, and estimated 3d poses.
segmented image. Pose hypotheses xi1 are difficult to initialise, even manually, since the LLE parameterisation is not easily interpretable. Therefore, we randomly sample from the entire space of feasible poses in the reduced LLE space to generate the initial hypotheses. Thanks to the low-dimensional representation, this works well, and the sample set converges to a low number of clusters after a few time steps, as desired. The described models were trained on a database of motion sequences from 6 different subjects, walking and running at different speeds. The data was recorded using an optical motion capture system. The resulting sequences of body poses were normalised for limb lengths and used to animate a realistic computer graphics figure in order to create matching silhouettes for all training poses (see Fig. 1c). The figure was rendered from different view points, located every 10 degrees in a circle around the figure. Due to this choice of training data, our system currently assumes that the camera is in an approximately horizontal position. The training set consists of 4000 body poses in total. All the kernel regressors were trained using the Relevance Vector Machine algorithm (RVM) [8], with Gaussian Kernels. Different kernel widths were tested and compared using a crossvalidation set consisting of 50% of the training data, in order to avoid overfitting. 4 LLE dimensions were used, and 15 BPCA components. The first experiment (Fig. 2) shows tracking on a standard test sequence1 from [5], where a person walks in a circle. We segmented the images using background subtraction, yielding noisy foreground probability maps. The main challenge here is the varying viewing angle that is difficult to estimate from the noisy silhouettes. Tracking through another publicly available sequence from the HumanID corpus is shown in Figure 3. The subject walks in an angle of approx. 35 degrees to the camera plane. In addition it is viewed from a slight top-view and shows limb foreshortening due to the perspective projection. These are violations of the assumptions that are inherent in our 1
http://www.nada.kth.se/∼hedvig/data.html
616
T. Jaeggli, E. Koller-Meier, and L. Van Gool
Fig. 5. Traffic scene with low resolution images and noisy segmentation
choice of training data, where we used horizontal views and orthographic projection. Nevertheless the tracker performs well. Figure 4 shows an extract from a real soccer game with a running player. The sequence was obtained from www.youtube.com, therefore the resolution is low and the quality suffers from compression artefacts. We obtained a foreground segmentation by masking the color of the grass. In Figure 5 we show a real traffic scene that was recorded with a webcam of 320 × 240 pixels. The subjects are as small as 40 pixels in height. Noisy foreground segmentation was carried out by subtracting one of the frames at the beginning of the sequence. Our experiments have shown that the dynamical model is crucial for tracking through these sequences with unreliable segmentations and multimodal per-frame likelihoods.
5 Summary and Conclusion We have proposed a learning-based approach to the estimation of 3d body pose and image bounding boxes from monocular video sequences. The relationship between body pose and image appearance is learned in a generative manner. Inference is performed with a particle filter that samples in a low-dimensional body pose representation obtained by LLE. A nonlinear dynamical model is learned from training data as well. Our experiments show that the proposed approach can track walking and running persons through video sequences of low resolution and unfavourable image quality. Future work will include several extensions of the current method. We will explicitly consider multiple activity categories and perform action recognition along with the
Learning Generative Models for Monocular Body Pose Estimation
617
tracking. Also, we will investigate different image descriptors, that do extract the relevant image information more efficiently.
Acknowledgements This work is supported, in parts, by the EU Integrated Project DIRAC (IST-027787), the SNF project PICSEL and the SNF NCCR IM2.
References 1. Rosales, R., Sclaroff, S.: Learning body pose via specialized maps. In: NIPS (2001) 2. Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., Cipolla, R.: Multivariate relevance vector machines for tracking. In: Ninth European Conference on Computer Vision (2006) 3. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: CVPR (2005) 4. Agarwal, A., Triggs, B.: Monocular human motion capture with a mixture of regressors. In: CVPR. IEEE Workshop on Vision for Human-Computer Interaction, IEEE Computer Society Press, Los Alamitos (2005) 5. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3d human figures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 702–718. Springer, Heidelberg (2000) 6. Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J.D.R.: Computational studies of human motion: Part 1. Computer Graphics and Vision 1(2/3) (2006) 7. Moeslund, T.B., Hilton, A., Kr¨uger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006) 8. Tipping, M.: The relevance vector machine. In: NIPS (2000) 9. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 10. Agarwal, A., Triggs, B.: A local basis representation for estimating human pose from cluttered images. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, Springer, Heidelberg (2006) 11. Elgammal, A., Lee, C.S.: Inferring 3d body pose from silhouettes using activity manifold learning. In: CVPR (2004) 12. Lim, H., Camps, O.I., Sznaier, M., Morariu, V.I.: Dynamic appearance modeling for human tracking. In: Conference on Computer Vision and Pattern Recognition, pp. 751–757 (2006) 13. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models. Advances in Neural Information Processing Systems 18, 1441–1448 (2006) 14. Sminchisescu, C., Jepson, A.: Generative modeling for continuous non-linearly embedded visual inference. In: ICML. International Conference on Machine Learning (2004) 15. Li, R., Yang, M.H., Sclaroff, S., Tian, T.P.: Monocular tracking of 3d human motion with a coordinated mixture of factor analyzers. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 137–150. Springer, Heidelberg (2006) 16. Zivkovic, Z., Verbeek, J.: Transformation invariant component analysis for binary images. In: CVPR, vol. 1, pp. 254–259 (2006) 17. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math Soc. (1943) 18. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Computer Vision (1998)
Human Pose Estimation from Volume Data and Topological Graph Database Hidenori Tanaka1 , Atsushi Nakazawa2, and Haruo Takemura2 1
Graduate School of Information Science and Technology, Osaka University 1-32 Machikaneyama, Toyonaka-shi, Osaka, 560-0043 Japan
[email protected] 2 Cybermedia Center, Osaka University 1-32 Machikaneyama, Toyonaka-shi, Osaka, 560-0043 Japan {nakazawa,takemura}@cmc.osaka-u.ac.jp
Abstract. This paper proposes a novel volume-based motion capture method using a bottom-up analysis of volume data and an example topology database of the human body. By using a two-step graph matching algorithm with many example topological graphs corresponding to postures that a human body can take, the proposed method does not require any initial parameters or iterative convergence processes, and it can solve the changing topology problem of the human body. First, three-dimensional curved lines (skeleton) are extracted from the captured volume data using the thinning process. The skeleton is then converted into an attributed graph. By using a graph matching algorithm with a large amount of example data, we can identify the body parts from each curved line in the skeleton. The proposed method is evaluated using several video sequences of a single person and multiple people, and we can confirm the validity of our approach.
1
Introduction
Motion capture is widely used in the research fields of computer graphics, humancomputer interaction, medical applications, and robotics. However, in conventional motion capture systems, the actors must attach markers or special devices to their bodies. To solve this problem, markerless motion capture methods have been thoroughly studied by computer vision researchers [1,2]. Many studies have introduced articulated human body models and solve the problem through error minimization frameworks. These studies use different types of three-dimensional primitives to express body parts, for example cylinders and ellipsoids [3], colored blobs [4], and so on. And Kehl et al. [5] have introduced a dense surface model, where joint parameter estimation is carried out through error minimization between model features and input image features, such as silhouettes, optical flows, and contours. These top-down approaches pose some difficulties with regard to the computation time for the convergence process, initial parameter estimation, and recovery from tracking error. In reality, the initial body parameters are provided manually Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 618–627, 2007. c Springer-Verlag Berlin Heidelberg 2007
Human Pose Estimation from Volume Data
619
in all methods, or users are required to form specific poses at the start frame. Moreover, the estimation is performed by the tracking framework; this implies that the estimation at one frame depends on the results of previous frames. Therefore, the tracking cannot be continued once an error has occurred. To avoid these problems, some methods employ a bottom-up approach. Cheung et al. [6] proposed an algorithm to segment volume data into articulated rigid parts and acquire human kinematics. Other methods directly analyze the input volume data and obtain the medial axes, termed “skeleton”, of the 3D shape. Then, joint positions are estimated from the skeletons. The analysis of each frame is performed by a bottom-up process and is independent of the other frames; this can avoid the problems that occur in the top-down approach. To segment volume data and obtain the skeleton, Chu et al. [7] have used Isomaps. Another approach used by Sundaresan et al. [8] uses the Laplacian eigenspace and fitting of spline curves. During human movement, the topology of the body can change significantly, for example when both hands touch each other or touch the body. This changing topology should be taken into consideration to ensure robust estimation. However, previous studies have only considered a limited number (one or two) of topologies. Our method can be categorized as one of the latter bottom-up methods. Skeletons are extracted from captured volume data and the joint positions are estimated frame-by-frame. In order to consider many topologies in the same manner, example skeletons and a matching framework are used. These steps are explained in the following sections.
2
Proposed Method
Figure 1 shows an overview of our method. First, we capture the time series of the volume data using a visual hull based method (2.1). Then, we apply the volume thinning process and obtain 3D curved lines (2.2),. Next, we identify the body parts formed by the curves using the model graph database (2.3). Finally, the joint positions are estimated using the body part information along with time-series curvature analysis of the skeletons (2.4). In contrast to previous methods, our method has the following advantages: – By adopting a bottom-up approach, we can avoid the difficulty of parameter initialization and having to recover from tracking failure. – The computational cost of our process is less than that of model-based approaches or subspace projections because we do not use any iterative convergence algorithms. – The changing topology problem of the human body is solved by developing many possible examples of the human body with a graph matching and decision tree framework. This approach is very general and there are few heuristic rules. We can easily expand this method for identifying many varieties of topologies, noisy data, multiple person tracking, and other articulated objects.
620
H. Tanaka, A. Nakazawa, and H. Takemura
Volume Reconstruction
Estimation of Joint Positions
Capture Video by 8 cams
3D Volume Thinning
Silhouette-based Visual Hull
Skeleton Conversion
Find Local Maxima from Summed up Curvature
Estimate Joint Positions for All Frames Graph Matching
Model Graph Database Learn
Body Part Identification
Example Graphs
Fig. 1. Algorithm Overview
Fig. 2. Captured volume data and their skeletons
2.1
Volume Reconstruction
We use a voxel-based visual hull algorithm for capturing time-series volume data of the human body[9]. The actor’s image regions are detected by using background subtraction. In this step, the CIELab color space is used to remove the shadows on the floor. Here, the weight used for the L value is smaller than that used for a and b. Finally, a visual hull algorithm is applied to reconstruct the target volume data (figure 2). 2.2
Three Dimensional Volume Thinning
We apply a 3D volume thinning process to the captured data and obtain a skeleton of the human body. Though there are many algorithms to extract skeleton from volume data [10,11], we use Saitoh’s algorithm [10] because it is fast and simple. First, we calculate the depth value (minimum distance between a voxel and the original surface) for each voxel. Then, the surface voxels are sequentially removed beginning from smaller depth voxels. This process is continued until the line width becomes one voxel while preserving the original topology. The resulting skeleton consists of a set of 3D curved lines whose topology is the same as that of the original volume data, passing through the middle of the original shape (figure 2). In this process, some unnecessary short lines are produced. To solve this problem, we first apply a thresholding technique for their removal. In order to remove longer noise lines, we add example graphs into the model graph database as described below.
Human Pose Estimation from Volume Data
a
b
c
d
e
f
621
g
Fig. 3. Variations in the topology of the human body. (a,b): Two arms are connected at the same position on the body, (c,d): Loop structure, (e,f,g): Some body parts merge into each other.
2.3
Identification of Body Parts Using Model Graph Database
In this step, we identify the body parts formed by curves in the skeleton. Here, we must consider the variation in the topologies of the human body. In this section, we first describe the topology of the human body and then describe a method to identify the topology and body parts based on the input skeleton. The topologies of the skeletons may differ from the normal topology of the human (figure 3-a) in the following ways: 1. arms and feet may not be connected at the same position (figure 3-b), 2. multiple body parts contact with each other (figure 3- c,d), 3. multiple body parts merge into each other (figure 3-e,f). In these cases, we apply a split-and-merge algorithm for some of the curves to determine the joint positions. In addition, multiple skeletons, while having the same topology, may differ in the way body parts touch the body (touch pattern) such as figure 3-b and g. So, we must consider not only the topologies, but also other features of each curved line for proper identification. To solve this problem, we introduce an example based approach (figure 4). We develop several example skeletons with different topologies in advance. They are converted into attributed graphs, and body part IDs are manually assigned to their nodes which are then stored in the model graph database (MGDB). The input skeleton is also converted into an attributed graph and then matched with the example graphs. Based on the node-to-node correspondence, we can identify body parts from the input graph. Skeletons are represented as attributed graphs; the line segments become nodes and the intersection points become links (figure 5). As attribute values for the node, we use the length of the curve, volume of the original 3D volume data, and variance of the depth value among the point in the curve. The length attribute value is the length of the line segment normalized over the whole skeleton. The volume is obtained by summing the squares of the depth values (described in Section 2.2) of line voxels. Normalization is also applied to the volume, resulting in the final volume attribute value. Example graphs are obtained from test sequences and stored into the MGDB. IDs and attribute values are assigned to the nodes in the graph. These graph representations of skeletons are invariant with human posture (joint angles). Thus, we only need to develop a limited number of examples, despite the high number of possible human postures.
622
H. Tanaka, A. Nakazawa, and H. Takemura
Input Skeleton
Graph Conversion
Example Graphs
Attributed Graph Model Graph Database Example Graphs
Topology Matching
HEAD
...
ARM_A
ARM_A
BODY
Matching Results Decision Tree Database
LEG_B
ARM_B
LEG_A
LEG_B
HEAD
LEG_A
Example Graphs Example Graphs Topology 2 Topology 1
BODY
Feature Vectors
Feature Vectors
Learning
Learning
...
Learning
Decision Tree Filter
Decision Tree Filter
...
Decision Tree Filter
ARM_B
Decision Tree filter HEAD ARM_A
Node-to-node Correspondence LEG_A
ARM_B
Graphs ... Example Topology n
...
Feature Vectors
BODY
LEG_B
Fig. 4. Left: our body part identification algorithm that uses the MGDB. Right: learning of the decision tree filter.
Two step graph matching algorithm. Graph matching consists of two steps and is performed on the input graph with the example graphs in the MGDB (figure 4-left). First, the topology of the input graph is identified using Messmer’s method [12], which is a combination of graph subdivision and subgraph matching. This algorithm increases the speed of matching by using a network that contains hierarchical relations between subgraphs of the model graphs. However, this matching may produce multiple results and node-to-node correspondences because it compares only the graph topologies. Therefore, we employ decision tree based filters to narrow these down to one correct matching. The Decision Tree. To obtain the correct matching result and node-to-node correspondence, C4.5 decision trees [13] are made for each topology and touch pattern (figure 4-right). During the learning stage of the MGDB, we manually label the nodes in the example graphs and acquire feature vectors using the length and volume of each node. These feature vectors are collected from example graphs with the same touch pattern and used as positive samples (figure 6). We also prepare feature vectors for error patterns, such as the head and a hand being swapped and different touch patterns, and used them as negative samples. Then, C4.5 decision trees are constructed using these samples. During the matching step, several feature vectors are generated from one input graph according to the topology-matching results. After that, they are evaluated by the decision trees of the touch patterns and we can identify the body parts of the input nodes.
Human Pose Estimation from Volume Data
HEAD
HEAD
0
ARM_A
0
3
BODY_U
ARM_B
1 4
ARM_B LEG_A
2 5
6
LEG_B
0
4
8,17,53
ARM_B HEAD
0
3
ARM_B
BODY_L LEG_B
2
6
5
3
2
6
ARM_A
24,24,196
BODY_L LEG_A 6
⎛ HEAD ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ ARM_A ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_B ⎟ ⎜ ⎟ ⎝ LEG_A ⎠
0
BODY_U 4
ARM_B
3
2
BODY_L
5
LEG_B
HEAD
1
LEG_A 6
⎛ ARM_A ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ HEAD ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_B ⎟ ⎜ ⎟ ⎝ LEG_A ⎠
Positive Sample LEG_A 0
BODY_L
BODY_U
2
5
LEG_B
⎛ ARM_A ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ HEAD ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_A ⎟ ⎜ ⎟ ⎝ LEG_B ⎠
LEG_B
1
1
ARM_B
ARM_A
ARM_A 3
BODY_U 4
5
LEG_A
⎛ LEG_A ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_B ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ HEAD ⎟ ⎟ ⎜ ARM_A ⎠ ⎝
25,9,9
Skeleton
6
LEG_A BODY_L
26,26,171
BODY_L LEG_B
HEAD
HEAD
1
Positive Sample
Attribute Swapping
8,6,63
9,18,130
4
⎛ HEAD ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ ARM_A ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_A ⎟ ⎜ ⎟ ⎝ LEG_B ⎠
Sample Graph
volume 8 length 6 variance 63
5
LEG_A
BODY_L
2
0
BODY_U
1
4
ARM_A
ARM_A 3
BODY_U
623
4
ARM_B ARM_A
LEG_B 0
LEG_B
BODY_L
3 1
4
BODY_U
2
5
6
⎛ LEG_A ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_B ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ ARM_A ⎟ ⎟ ⎜ HEAD ⎠ ⎝
Attributed Graph
3
ARM_B
HEAD
HEAD
5
2
LEG_B
LEG_A
BODY_U 6
0
BODY_L
1
ARM_A
⎛ LEG_B ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_A ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ HEAD ⎟ ⎟ ⎜ ARM_A ⎠ ⎝
4
ARM_B ARM_A
5
LEG_A 3
1 2
BODY_U 6
HEAD
⎛ LEG_B ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_A ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ ARM_A ⎟ ⎟ ⎜ HEAD ⎠ ⎝
Feature Vectors
Fig. 5. Graph representation of Fig. 6. Making feature vectors from an example skeleton and attribute values graph. Positive samples are enclosed within blue frames.
ARM_A
ARM_A BODY
ARM_A ARM_B
ARM_B BODY
BODY
BODY
ARM_B
ARM_A
ARM_B
BODY
BODY
Fig. 7. Split and merge algorithm in joint estimation. Left: If there are several line segments identified as the body, they are merged together. Right: The loop structure is also removed by the same rule.
2.4
Estimation of Joint Positions
In this step, we estimate the joint positions from the curved lines that are used to identify the body parts. First, we introduce the split and merge algorithm using body part IDs. If a body part is split by connecting points (figure 7-left), they are removed and the line segments of that body part are merged together. If there is a loop structure (figure 7-right), the same rule is applied and we can obtain a one-to-one relation between the body parts and the curved lines. Next, we perform a time-series curvature analysis and determine the joint positions in each curved line. Our algorithm consists of the following steps: 1. The lengths of the curved lines are normalized, and their curvature at each frame is calculated. 2. The curvatures of each body portion are summed up for all frames. 3. The local maxima in the summed curvature graph are determined and considered as joint positions relative to the body part’s length. Here, we use the same number of maxima as the number of joints given by the body part attribute value.
624
H. Tanaka, A. Nakazawa, and H. Takemura
Fig. 8. Result of body part identification with subject 1. Up: Input image. Bottom: Skeleton color coded by body part.
4. The absolute joint positions are determined for all the frames according to the relative joint positions in the curved line. Through these steps, all joints will be determined if they are “bent” at least once within all the video frames. By using the split and merge algorithm and the timeseries curvature analysis based on the results of body part identification, we can obtain consistent kinematics of the human body in every frame.
3
Experimental Results
We tested our algorithm on 3 subjects (subjects 1, 2, and 3) individually and two subjects (subjects 1 and 4) together. We set up a studio that contained 8 cameras and blue screens. Each camera was connected to a PC and could synchronously capture XGA images at 30 fps. Volume reconstruction was performed at a resolution of 2 cm, and the following processes were done on a single PC. We prepared the MGDB with 190 example graphs (11 topologies) selected from the reconstruction results of several videos and 196 test graphs from these same videos. The body part IDs were manually assigned to the nodes. The results of body part identification with subject 1 are shown in figure 8. Captured images are shown in the top row and the colors of curves in the bottom row show identification of body parts. The results confirm that our method correctly identifys body parts even for loops and mergers. We can also confirm this method works well even when unnecessary curves exist thanks to the prepared example graphs. Figure 9 show the results of subjects 2 and 3 along with that of subjects 1 and 4 together in the scene. The experimental result also shows our method correctly identifys the body parts of two people even when their body parts touch each other. 170 graphs out of 196 test graphs were assigned topology and body part IDs matching the manual assignments. The others were assigned incorrect touch patterns. Figure 10 is a curvature graph of four body parts summed up over 291 frames from one video. Local maxima in the curved lines of arms become wrists (relative
Human Pose Estimation from Volume Data
625
Fig. 9. Result of body part identification with subject 2, 3 and with 2 subjects. Up: Input image. Bottom: Color coded skeleton.
㪘㪩㪤㪶㪘 㪘㪩㪤㪶㪙 㪣㪜㪞㪶㪘 㪣㪜㪞㪶㪙
㪪㫌㫄㩷㫆㪽㩷㫋㪿㪼㩷㪺㫌㫉㫍㪸㫋㫌㫉㪼
㪇㪅㪇㪇㪎 㪇㪅㪇㪇㪍 㪇㪅㪇㪇㪌 㪇㪅㪇㪇㪋 㪇㪅㪇㪇㪊 㪇㪅㪇㪇㪉 㪇㪅㪇㪇㪈 㪇 㪈
㪊㪋
㪍㪎
㪈㪇㪇
㪧㫆㫊㫀㫋㫀㫆㫅㩷㫀㫅㩷㫋㪿㪼㩷㪺㫌㫉㫍㪼㪻㩷㫃㫀㫅㪼㩷㩿㩼㪀
Fig. 10. Sum of the curvature of 291 skeletons
position from 9% to 11%), elbows (40% to 43%), and shoulders (84% to 85%), and those found in one of the legs become ankles (7% to 8%), knees (51% to 52%), and thighs (90% to 94%). The estimated joint positions are shown in figure 11. With regards to processing time, it took 5.75 seconds for volume reconstruction, 480 ms for thinning, 94 ms for graph conversion of the skeleton, and 4.2ms for graph matching and decision tree filtering. Joint estimation needs 40 ms per frame. 3.1
Discussion
The experimental results show that our method can successfully identify body parts even when the topology of the body shape changes in each consecutive frame. Since the algorithm performs the estimation independently for each frame, the estimation in one frame is not affected by errors in previous frames. The system can identify some unnecessary curved lines in the skeletons as noise and ignore them during body part identification. Analyzing the failure cases, we found that sometimes the upper arm is partially merged to the body. Our touch pattern classes did not contain such a
626
H. Tanaka, A. Nakazawa, and H. Takemura
Fig. 11. Result of the pose estimation
pattern; our sample patterns had the upper arm as either completely separate or completely merged with the body. Our example graph-based approach is very easy to expand. As shown in the experiment, we can easily apply our method to multiple people simply by developing new example graphs that express the topologies of multiple human bodies. The time-series curvature analysis effectively obtains consistent kinematics of the human body. Hence, we can obtain joint position data in a manner similar to that of conventional motion capture systems. Our method requires a total processing time of 6.4 s per frame. However, 90% of the total processing time takes place during reconstruction, while the volume thinning process and joint position estimation only require less than 0.7s. Thus, we are attempting to develop faster volume reconstruction and thinning methods to reach real-time estimation. In general, our example based approach is very useful because of its generality and fast processing.
4
Conclusion and Future Study
In this paper, we have proposed a novel markerless motion capture method using volume data. Our method employs the bottom up analysis of the volume data; therefore, we do not require any initial parameters of the human body, and fast estimation can be achieved by using a 3D volume thinning process. Additionally, we have solved the changing topology problem of the body part identification process using an example-based approach. The experimental results indicate that our method can successfully identify different topologies that consist of a single person as well as many persons and can correctly estimate joint positions using time-series curvature analysis. Thus far, it has been difficult to distinguish between left and right body parts because our graph representations do not contain geometric information. To solve this problem, we are currently attempting to incorporate some geometrical attributes into our graph representations.
Human Pose Estimation from Volume Data
627
Acknowledgement This work is supported by the Ministry of Education, Culture, Sports, Science and Technology under the “Development of fundamental software technologies for digital archives” project.
References 1. Gavrila, D.M.: The visual analysis of human movement: A survey. CVIU 73(1), 82–98 (1999) 2. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. CVIU 81(3), 231–268 (2001) 3. Miki´c, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. International Journal of Computer Vision 53(3), 199–223 (2003) 4. Caillette, F., Howard, T.: Real-Time Markerless Human Body Tracking with MultiView 3-D Voxel Reconstruction. In: Proc. of British Machine Vision Conference, vol. 2, pp. 597–606 (2004) 5. Kehl, M.B.R., Gool, L.V.: Full body tracking from multiple views using stochastic sampling. In: Proc. of CVPR 2005, vol. 2, pp. 129–136 (2005) 6. Cheung, G., Baker, S., Kanade, T.: Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In: Proc. of CVPR 2003, vol. 1, pp. 77–84 (2003) 7. Chu, C.W., Jenkins, O.C., Matari´c, M.J.: Markerless kinematic model and motion capture from volume sequences. Proc. of CVPR 2003 2, 475–482 (2003) 8. Sundaresan, A., Chellappa, R.: Segmentation and probabilistic registration of articulated body models. In: Proc. of ICPR 2006, vol. 2, pp. 92–96 (2006) 9. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. on PAMI 16(2), 150–162 (1994) 10. Saitoh, T., Mori, K., Toriwaki, J.: A sequential thinning algorithm for three dimensional digital pictures using the euclidean distance transformation and its properties. The transactions of the Institute of Electronics, Information and Communication Engineers. D-II J79-D-II (10), 1675–1685 (1996) 11. Brostow, G.J., Essa, I., Steedly, D., Kwatra, V.: Novel skeletal representation for articulated creatures. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 66–78. Springer, Heidelberg (2004) 12. Messmer, B.T., Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Trans. on PAMI 20, 493–504 (1998) 13. Quinlan, R.J.: Programs for Machine Learning. Morgan kaufmann Publishers, San Francisco (1993)
Logical DP Matching for Detecting Similar Subsequence Seiichi Uchida1 , Akihiro Mori2 , Ryo Kurazume1 , Rin-ichiro Taniguchi1 , and Tsutomu Hasegawa1 1
2
Kyushu University, Fukuoka, 819-0395, Japan
[email protected] http://human.is.kyushu-u.ac.jp/ Toshiba Corporation, Tokyo, 198-8710, Japan
Abstract. A logical dynamic programming (DP) matching algorithm is proposed for extracting similar subpatterns from two sequential patterns. In the proposed algorithm, local similarity between two patterns is measured by a logical function, called support. The DP matching with the support can extract all similar subpatterns simultaneously while compensating nonlinear fluctuation. The performance of the proposed algorithm was evaluated qualitatively and quantitatively via an experiment of extracting motion primitives, i.e., common subpatterns in gesture patterns of different classes.
1
Introduction
Detection of similar subsequences between two sequential patterns is one of fundamental problems for sequential pattern processing. Especially, this problem is very important when extracting frequent subsequences. For example, frequent subsequences of gesture patterns, called motion primitives, are widely used for analyzing human activities. In addition, frequent subsequences of biological sequences (such as DNA sequence and protein sequence), called motives, also plays quite important role in genome science [1,2]. In this paper, a new algorithm, called logical dynamic programming (DP) matching, is proposed for detecting similar subsequences. The proposed algorithm employs a logical function s(i, j), called support, to evaluate similarity between the ith frame of A = a1 , . . . , ai , . . . , aI and the jth frame of B = b1 , . . . , bj , . . . , bJ . Specifically, if a pair of frames ai and bj are similar, s(i, j) = true. Otherwise, s(i, j) = false. Similar subsequences between A and B can be determined as sets of consecutive frame pairs having true supports. The use of the support allows simultaneous detection of all similar subsequences while compensating nonlinear fluctuations in the subsequences. The remaining part of this paper is organized as follows. After reviewing related work in in Section 2, the logical DP matching algorithm is described in Section 3. In addition, the logical DP matching algorithm is applied to actual gesture patterns for evaluating its performance on the detection of similar subsequences among them. In Section 4, the performance of the proposed algorithm Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 628–637, 2007. c Springer-Verlag Berlin Heidelberg 2007
Logical DP Matching for Detecting Similar Subsequence
629
is further evaluated quantitatively via an experiment of extracting motion primitives from gesture patterns.
2
Related Work
The algorithms for detecting similar subsequences are classified into two groups. The first group is based on “rigid comparison” between two sequential patterns. In the algorithms of this group, sequential patterns are firstly described by rough representations, such as sequences of a limited number of symbols, by clustering or other quantization technique, and then identical subsequences are detected. The rough representation is necessary for compensating fluctuations between two sequences. Tanaka et al. [3] transformed a gesture sequence into a symbol sequence for extracting motion primitives under an MDL criterion. A similar algorithm can be found in [4]. A problem of this group is that the rough representation will lose exact boundaries of similar subsequences. In fact, a refinement step is introduced in [4] for finding exact boundaries. The second group is based on “elastic comparison”, where two sequential patterns are compared under an optimized nonlinear matching to compensate fluctuations. DP matching algorithm, or dynamic time warping (DTW) algorithm, will be the most popular algorithm in this group. DP matching possesses several useful properties, such as (i) the global optimality of its matching result, (ii) computational feasibility, (iii) high versatility, etc., and thus has been used various sequential pattern processing tasks, such as speech recognition [5] and character recognition [6] and gnome science [1,2]. The difference between the proposed and the conventional DP matching algorithms is the definition of frame (dis)similarity. Specifically, the proposed algorithm employs the logical support function s(i, j), whereas the conventional algorithm has been employed Euclidean distance between feature vectors of the ith and jth frames. This modification brings the following novel abilities to the DP matching: – The proposed algorithm can detect strictly similar subsequences. This means that all the paired frames in the similar subsequences are similar. In contrast, similar subsequences by the conventional DP matching algorithms are averagely similar; that is, several paired frames can be dissimilar if other paired frames are very similar. (This is because of the conventional algorithm is based on accumulated Euclidean distance.) – The proposed algorithm can detect all similar subsequences simultaneously.
3
Detecting Similar Subsequence by Logical DP Matching
This section describes the logical DP matching algorithm and its application to the simultaneous detection of similar subsequences. The performance of the algorithm is also discussed via a detection experiment.
630
S. Uchida et al. detected similar sub-sequence s(i,j) = 1
pattern A
θ1
i pattern B j
feature space
Fig. 1. Similar subsequence detection using support s(i, j)
3.1
Logical DP Matching
Support and Supported Path. Each sequential pattern can be represented as a trajectory in a feature space, as shown in Fig. 1. Similar subsequences between two sequential patterns will be observed as regions where two trajectories are close to each other. This closeness is evaluated by the following logical function, called support: true if d(i, j) < θ1 (1) s(i, j) = false otherwise, where θ1 is a positive constant and d(i, j) is a distance between the ith and the jth frames. In the followings, we will use Euclidean distance d(i, j) = ai − bj , unless otherwise noted. If two trajectories are close at the ith and the jth frames (namely, if ai and bj are similar), the support s(i, j) = true. Otherwise, s(i, j) = false. The dashed lines in Fig. 1 indicates frame pairs where two trajectories are close. Consider an i-j plane of Fig. 2 and assume that the support s(i, j) has been already calculated at each (i, j) node according to (1). Since similar subsequences between A and B can be determined as sets of consecutive frame pairs having true supports, the problem of finding similar subsequences is treated as the problem of finding paths connecting nodes with true supports. Hereafter, this path is called a supported path. A supported path from (is , js ) to (ie , je ) indicates that subsequences ais , . . . , aie and bjs , . . . , bje are similar throughout, that is, strictly similar. DP Recursion. All the supported paths can be found efficiently and simultaneously by a DP-based algorithm. In the algorithm, the following equation, so-called DP recursion, is calculated at each j from i = 1 to I: 3 gk (2) g(i, j) = s(i, j) ∧ k=1
where
⎧ ⎨ g1 = g(i − 1, j − 1) g2 = s(i, j − 1) ∧ g(i − 1, j − 2) ⎩ g3 = s(i − 1, j) ∧ g(i − 2, j − 1)
pattern B
Logical DP Matching for Detecting Similar Subsequence J
j
1
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
1
1
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
0
0
1
1
0
1
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
1
i
631
I pattern A
a pair of similar sub-patterns
Fig. 2. Representation of similar subsequences by supported paths. Here, “1” stands for true and “0” for false.
...
...
...
...
...
...
...
...
pattern B ... ... ...
... ...
...
J
...
ending node
... (d)...
starting node
j
g(i,j) (c)
1 1
...
(a)
...
(b)
... ... ... ... I i pattern A
Fig. 3. DP recursion and staring/ending node
and ∨ and ∧ denote logical OR and AND operations, respectively. The function g(i, j) is a logical function taking true or false and indicates whether there is at least one supported path arriving at the node (i, j) or not. The nodes (a), (b), and (c) of Fig. 3 correspond to the nodes with g1 , g2 , and g3 , respectively. The above recursion represents the trial to extend the supported paths to the nodes (a), (b), and (c) by connecting the node (i, j). The starting and the ending nodes of each supported path are also detected while calculating of the DP recursion (2). The node (i, j) is marked as a starting node iff g1 = g2 = g3 = false and s(i, j) = true. Similarly, the node (i, j) is marked as an ending node iff g(i, j) = true and s(i + 1, j + 1) = false1 . Backtracking from every ending node provides all the supported paths simultaneously. The supported paths shorter than θ2 are eliminated as noise. As shown in Fig. 2, supported paths often branch and share the same starting/ending node. In the following experiment, those branched supported paths are to be shrunk into a non-branched one by removing less reliable paths. The reliability of each branch is evaluated by the accumulated distance of d(i, j) along the branch. A branch with a higher accumulated distance is considered as a less reliable branch. 1
Any supported path passing (i, j) will also pass (i + 1, j + 1). Thus, if s(i + 1, j + 1) = false, no path can be extended from (i, j) and therefore (i, j) is an ending node.
hurrah
bye
stop
raise hand point
S. Uchida et al.
circle shrug
632
time
time Fig. 4. Snapshot of assumed gestures
After this removal, only one supported path will start from each starting node and end in each ending node. The computational complexity of the above algorithm is O(IJ). This is the same complexity as the conventional DP matching algorithms. 3.2
Experiment of Detecting Similar Subpatterns Between Gesture Patterns
Experimental Setup. The performance of the proposed algorithm has been evaluated via an experiment of detecting similar subsequences of gesture patterns. Gestures of 18 classes were used in the experiment. Fig. 4 shows snapshots of 7 gestures of them. Each gesture was performed 6 times by one person and thus 108 gesture patterns were prepared. The average frame length was 89. A six-dimensional feature vector was used for representing the performer’s posture at each frame. Specifically, the six elements of the vector were the threedimensional positions of both hands (relative to the head position), which were obtained by stereo measurement with two IEEE1394 cameras. The two parameters θ1 and θ2 were fixed at 200 and 8, respectively. The condition θ1 = 200 implies that the support s(i, j) becomes true if the difference of two postures, i.e., ai − bj , is less than 20cm. The condition θ2 = 8 implies that similar subsequences whose length is less than 500ms are eliminated as noise. Detection Result of Similar Subsequences. Fig. 5 shows five detection results. The images at the leftmost column show d(i, j) on the i–j plane and the images at the middle column show the support s(i, j). The images at the rightmost column show the supported paths detected by the proposed algorithm. Fig. 5 (a) is the result when an identical “(left-hand) bye” pattern was used as both A and B. Since A and B were identical, a long diagonal straight supported path was obtained. The other, more meaningful, two supported paths were obtained because a left-to-right hand motion is repeated twice in “bye” and the first motion and the second motion were correctly detected as similar subsequences. Fig. 5 (b) is the result when two different “bye” patterns were used for A and B. The repeated hand motions were correctly detected like (a).
Logical DP Matching for Detecting Similar Subsequence distance map
90 80 70 60 50 40 30 20 10 0
"left-hand bye" no.1
"left-hand bye" no.1
0
10 20 30 40 50 60 70 80 90
"left-hand bye" no.1 100
"left-hand bye" no.2
(b)
detected similar sub-sequences
support map
"left-hand bye" no.1
(a)
633
80 60 40 20 0
"left-hand bye" no.1
"left-hand bye" no.1
0
10 20 30 40 50 60 70 80 90
(c)
"raise right-hand"
"left-hand bye" no.1
"right-hand bye"
0
"hurrah"
10 20 30 40 50 60 70 80 90
"right-hand bye" 90 80 70 60 50 40 30 20 10 0
"circle"
(d)
"right-hand bye"
45 40 35 30 25 20 15 10 5 0
"hurrah"
0
20
40
60
80
100
80
100
"hurrah" 70 60 50 40
"stop"
(e)
30 20 10 0
"hurrah"
"hurrah"
0
20
40
60
"hurrah"
Fig. 5. Detection result of similar subsequences
Fig. 5 (c) is the result when “bye” and “raise (right-)hand” were used for A and B, respectively. Those gestures share two common motions. One is the beginning motion that the right hand is raised to shoulder height. The other is the ending motion that the right hand is lowered from shoulder height. Those common motions were successfully detected as the two supported paths around the beginning and the end of the gestures. Fig. 5 (d) is the result when “hurrah” and “circle” were used for A and B, respectively. Those gestures may seem similar but, actually, do not have any common motion, as shown in Fig. 4. The proposed algorithm could avoid to detect any false positive, i.e., spurious supported path.
634
S. Uchida et al.
feature space
Fig. 6. Statistical extension of support
Fig. 5 (e) is a failure result. Two different gestures “hurrah” and “stop” were used and their common motion (“raising both hands to shoulder height”) around their beginning part was not detected. This failure results indicates that the beginning parts of those gestures fluctuate more largely than other parts. One possible remedy to deal with this non-uniform fluctuation range is a statistical extension of d(i, j). Specifically, as illustrated in Fig. 6, if d(i, j) is evaluated according to the degree of unstability, the region where s(i, j) = true will be defined adaptively to the fluctuation range and thus it will be possible to improve the detection accuracy. In fact, the above missed common motion was detected correctly in the experimental result using the Bhattacharyya distance as d(i, j). (The detail of this experiment will be presented elsewhere.)
4
Application to Extraction of Motion Primitives
The logical DP algorithm can be applied to the extraction of motion primitives, which are subpatterns constituting gesture patterns. In this section, the performance of the logical DP matching was evaluated qualitatively and quantitatively via an experiment of extracting motion primitives. 4.1
Extracting Motion Primitives by Logical DP Matching
Gesture patterns are often decomposed as a sequence of motion primitives for analyzing how gesture patterns are comprised and share common motions. The extraction of the motion primitives can be done by the logical DP algorithm as follows: 1. Apply the logical DP algorithm to a pair of training gesture patterns of the assumed classes. 2. Decompose those two gesture patterns into similar subsequences and the remaining subsequences. All of those subsequences are candicates of motion primitives2 . 2
In this paper, we treat subsequences particular to only a certain gesture class as motion primitives. Thus, any gesture pattern can be decomposed into a sequence of motion primitives.
Logical DP Matching for Detecting Similar Subsequence
0:
1:
7:
9:
12:
22:
635
8:
Fig. 7. Snapshot of extracted motion primitives
3. Repeat the above two steps for all pairs. 4. Unify similar subsequences if their distance3 is smaller than θ3 . The subsequences which survive the unification are considered as motion primitives. Any gesture pattern (from the assumed classes) will be approximately represented as a sequence of the resulting motion primitives. Most past attempts for extracting motion primitives, such as Nakazawa et al. [7] and Fod et al. [8], have assumed that motion primitives can be obtained by segmenting gesture patterns at physically particular points, such as locally minimum speed points and zero-acceleration points. In contrast, the motion primitives extracted by the procedure have two different properties. First, they are extracted without using any physically particular point, that is, they are free from any assumption (or users’ prejudice) on motion primitives. Second, the extracted motion primitives totally depend on the assumed classes. In other words, if the classes of training gesture patterns change, the extracted motion primitives will also change. This property is especially useful for gesture recognition tasks, where the number of assumed gestures is often limited. 4.2
Extraction Results of Motion Primitives
For each of 18 gesture classes, one of 6 patterns was used as a training pattern for the extraction of motion primitives. According to the above procedure, 142 subsequences were firstly detected and then unified into 26 motion primitives. Fig. 7 shows snapshots of several motion primitives.Note that the parameter θ3 was determined via a preliminary experiment. For evaluating the extracted motion primitives, the remaing five gesture patterns of each class were decomposed into the motion primitives. If the five gestures are decomposed into the same motion primitive sequence, the validity of the motion primitives can be shown experimentally. The decomposition was done by a recognition-based segmentation algorithm [5,9], which performs segmentation and assignment of each segment to a motion primitive in an optimization framework. Fig. 8 (a) is the decomposition result of a pattern “hurrah.” The five gestures were correctly decomposed into the same motion primitive sequence, 3
Two subsequences may have different lengths and therefore their distance is evaluated by the conventional DP matching algorithm with Euclidean distance.
636
S. Uchida et al. (a)
(b)
motion primitive ID
25
25
20
20
15 10
10
5
1
0
0 0
20
40
0 80
100
motion primitive ID
0
120
10
20
20
20 15
30
40
8 50
60
70
80
no.2
90
22
22
15
12
12
10
10 5
1
0
0
20
40
1
0
60
5
9
7
8
0 80
100
0
120
25
motion primitive ID
9
7
0
60
25
0
20
40
60
80
100
no.3
25
20
20
15
12
22
22
15
12
10
10
5
1
0
0 0
20
1
0
40
5
9
7
8
no.4
0
60
80
0
100
25
motion primitive ID
5
1
25
10
20
30
40
50
60
70
80
90
25 20
20 15
12
22
22
15
12
10
10 5
1
0
0 0
20
40
60
5
1
0
9
7
8
no.5
0 80
100
0
120
25
motion primitive ID
22
22
15
12
12
20
40
60
80
100
25
20
20
15
22
22
15
12
12
10
10
5
1
0
1
0
0
5
9
7
8
0 0
20
40
60
"hurrah" no.2~6
80
100
0
20
40
60
80
100
no.6
"left-hand bye" no.2~6
Fig. 8. Segmentation of gesture pattern by extracted motion primitives
0→12→1→0→12→1. (Those numbers correspond to the motion primitives shown in Fig. 7.) As shown in Fig. 8 (b) the five gestures of “(left-hand) bye” were also correctly decomposed into the same sequence, 7→22→9→22→8. The same decomposition results were obtained at 13 classes among 18 classes. This stable result indicates that the motion primitives extracted by the procedure of Section 4.1 represent common and particular motions validly. Thus, this result also shows resonable accuracy of the logical DP algorithm. The failure results, i.e., different decomposition results of the same class, were mainly due to lax unification on selecting motion primitives. For example, several subsequences of “lowering both hands from shoulder height” were survived as motion primitives. This fact indicates that some gesture subsequences fluctuate more largely than others and thus the distance between them exceeded θ3 . The statistical extension pointed out in Section 3.2 will be useful also for tackling this fluctuation problem.
5
Conclusion
A logical DP matching algorithm has been proposed for detecting similar subsequences between two sequential patterns, such as gesture patterns. The algorithm
Logical DP Matching for Detecting Similar Subsequence
637
was examined via an experiment of extracting motion primitives from a set of gesture patterns. The result of the experiment showed that the proposed algorithm could detect the similar subsequences among gesture patterns successfully and provide stable motion primitives. Future work will focus on (i) a statistical extension of d(i, j), (ii) application to sequential patterns other than gesture, and (iii) utilization of the extracted motion primitives for practical tasks.
References 1. Durbin, R., Eddy, S., Korgh, A., Mitchison, G.: Biological sequence analysis. Camblidge University Press, Cambridge (1998) 2. Mount, D.: Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press (2004) 3. Tanaka, Y., Iwamoto, K., Uehara, K.: Discorvery of time-series motif from multidimensional data based on MDL principle. Machine Learning 58, 269–300 (2005) 4. Zhao, T., Wang, T., Shum, H.-Y.: Learning a highly structured motion model for 3D human tracking. In: Proc. Asian Conf. Comp. Vis. pp. 144–149 (2002) 5. Ney, H., Ortmanns, S.: Progress in dynamic programming search for LVCSR. Proc. IEEE 88(8), 1224–1240 (2000) 6. Uchida, S., Sakoe, H.: A survey of elastic matching techniques for handwritten character recognition. IEICE Trans. Inf. Syst. E88-D(8), 1781–1790 (2005) 7. Nakazawa, A., Nakaoka, K., Ikeuchi, S., Yokoi, K.: Imitating human dance motions through motion structure analysis. In: Proc. Int. Conf. Intell. Robots Syst. pp. 2539–2544 (2002) 8. Fod, A., Mataric, M.J., Jenkins, O.C.: Automated deviation of primitives for movement classification. Autonomous Robots 12(1), 39–54 (2002) 9. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)
Efficient Normalized Cross Correlation Based on Adaptive Multilevel Successive Elimination Shou-Der Wei and Shang-Hong Lai Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan {greco,lai}@cs.nthu.edu.tw
Abstract. In this paper we propose an efficient normalized cross correlation (NCC) algorithm for pattern matching based on adaptive multilevel successive elimination. This successive elimination scheme is applied in conjunction with an upper bound for the cross correlation derived from Cauchy-Schwarz inequality. To apply the successive elimination, we partition the summation of cross correlation into different levels with the partition order determined by the gradient energies of the partitioned regions in the template. Thus, this adaptive multi-level successive elimination scheme can be employed to early reject most candidates to reduce the computational cost. Experimental results show the proposed algorithm is very efficient for pattern matching under different lighting conditions. Keywords: Pattern matching, normalized cross correlation, successive elimination, multi-level successive elimination, fast algorithms.
1 Introduction Pattern matching is widely used in many applications related to computer vision and image processing, such as object tracking, object detection, pattern recognition and video compression, etc. The pattern matching problem can be formulated as follows: Given a source image I and a template image T of size M × N , the pattern matching problem is to find the best match of template T from the source image I with minimum distortion or maximum correlation. The most popular similarity measures are the sum of absolute differences (SAD), the sum of squared differences (SSD) and the normalized cross correlation (NCC). For some applications, such as the block motion estimation in video compression, the SAD and SSD measures have been widely used. For practical applications, a number of approximate block matching methods have been proposed [1][2][3] and some optimal block matching solutions have been proposed [4][5][6], which have the same solution as that of full search but with less operations by using the early termination in the computation of SAD, given by M
N
SAD( x, y ) = ∑∑ T (i, j ) − C ( x + i, y + j )
(1)
i =1 j =1
In [7], a coarse-to-fine pruning algorithm with the pruning threshold determined from the lower resolution search space was presented. This search algorithm can be Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 638–646, 2007. © Springer-Verlag Berlin Heidelberg 2007
Efficient NCC Based on Adaptive Multilevel Successive Elimination
639
proved to provide the global solution. Hel-Or and Hel-Or [8] proposed a fast template matching method based on accumulating the distortion on the Walsh-Hadamard domain in the order of the frequency of the Walsh-Hadamard basis. In general, a small number of the first few projections can capture most of the distortion energy. By using a predefined threshold, they can early reject most of the impossible candidates very efficient. Beside the SAD and SSD, the NCC is also a popular similarity measure. The NCC measure is more robust than SAD and SSD under linear illumination changes, so the NCC measure has been widely used in object recognition and industrial inspection. The definition of NCC is given as follows: M
NCC ( x, y ) =
N
∑∑ I ( x + i, y + j ) ⋅ T (i, j ) i =1 j =1
M
N
∑∑ I ( x + i, y + j ) i =1 j =1
2
⋅
M
(2)
N
∑∑ T (i, j )
2
i =1 j =1
The traditional NCC needs to compute the numerator and denominator, which is very time-consuming. Later, the sum table scheme [11][12][13] have been proposed to reduce the computation in the denominator. And some other methods have been proposed [9] [10] to reduce the computation in the numerator. In this paper, we propose an efficient NCC algorithm based on an adaptive MSEA procedure. The adaptive MSEA scheme determines the elimination order based on the sum of the gradient magnitudes of template and adapted the bound value derived from the Cauchy-Schwarz inequality to early reject the candidates. The rest of this paper is organized as follow: we first briefly review the successive elimination algorithm (SEA)-based method [4][5] as well as the upper bound for the cross correlation derived from Cauchy-Schwarz inequality[9][10]. Then, we present the proposed efficient NCC algorithm that performs the adaptive MSEA in section 3. The experimental results are shown in section 4. Finally, we conclude this paper in the last section.
2 Previous Works The Successive Elimination Algorithm (SEA) [4] used an upper bound for a block sum difference as the criterion to eliminate the impossible candidate blocks to reduce the computation of motion estimation based on the SAD criterion. Suppose (u , v ) is the current optimal motion vector in the previous search process, and the corresponding SAD value is denoted by SADmin . According to the inequality
a + b ≤ a + b , where a, b ∈ ℜ , it can be easily shown that the
following relation holds:
T − C (u, v) ≤ SAD(u, v) .
(3)
640
S.-D. Wei and S.-H. Lai
where T and C (u , v ) represent the sums of the image intensities for the template and the window of the source image I at the position (u , v ) , respectively, and SAD(u, v) denotes the corresponding SAD value computed for the window at the position (u, v). From inequality (2), we can conclude that a candidate corresponding to the source image I at position (u , v ) can not be a better-matching block if
T − C (u, v) ≥ SADmin . If SAD(u , v) is less than SADmin , then SADmin and (u , v ) is replaced by SAD(u , v) and (u , v) , respectively. The boundary value
BV = T − C (u , v) can be considered as the elimination criterion. For each candidate block, we can repeat this procedure to prune out a large portion of candidates. At the end, we can still find the best motion vector. The drawback of SEA is that the difference of block sums is not close enough to the SAD. Gao et al. extended the SEA to a multilevel successive elimination algorithm (MSEA) [5] that provides tighter and tighter boundary values from the lowest level to the highest level, as depicted in Figure 1. The relation of boundary values at different level is BV0 ≤ BV1 ≤ " ≤ BVlog 2 N = SAD . MSEA builds an image pyramid structure of the current and reference blocks with L = log 2 N levels. Using only level zero to eliminate impossible candidates is the same as the SEA, and the boundary value of the final level is the same as the SAD value. Level 0
SEA
Level 1
Level 2
Level 3
Level 4
SAD
Fig. 1. The levels of elimination order in MSEA. Using only level 0 as the elimination criterion is the same as SEA, and the final level is the same as SAD.
Although the similarity measure of NCC is more robust than SAD, but the computational cost of NCC is very high. The technique of sum table [11][12][13] can be used to reduce the computation of denominator. Any block sum in the source image can be accomplished with 4 operations from the pre-computed sum table. To reduce the computational cost in the numerator, Stefano and Mattoccia [9][10] derived upper bounds for the cross correlation based on Jensen’s and CauchySchwarz inequalities to early terminate some impossible search points. Because the bound is not tight enough, they partitioned the image into two blocks and compute the partial cross correlation for the first block (from row 1 to row k) with the other block bounded by the upper bound (from row k to row N). Then they used the SEA scheme to reject the impossible candidates successively with increasing the size of the first block. From Cauchy-Schwarz inequality [10] given in equation (4), the upper bound (UB) of the numerator, i.e. cross correlation, can be derived as in equation (5) and the boundary value of NCC is given in equation (6).
Efficient NCC Based on Adaptive Multilevel Successive Elimination
N
N
∑ ai ⋅ bi ≤
N
∑ ai2 ⋅
i =1
641
∑b
i =1
i =1
2 i
(4)
UB( x, y ) M
k
M
= ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) +
∑ ∑ I ( x + i, y + j )
i =1 j =1 M
2
M
k
M
i =1 j =1
N
∑ ∑ T (i, j )
⋅
i =1 j = k +1
≥ ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) + ∑ M
N
2
i =1 j = k +1
(5)
N
∑ I ( x + i, y + j ) ⋅ T (i, j )
i =1 j = k +1
N
= ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) i =1 j =1
UB( x, y ) I ( x, y ) ⋅ T
BV ( x, y ) =
(6)
Similar to the SEA scheme, the candidate at the position (x,y) of image I will be rejected if BV ( x, y ) < NCCmax and the NCCmax will be updated as NCC ( x, y ) if
NCC ( x, y ) > NCCmax .
3 Adaptive Successive Elimination for NCC The cross correlation can be bound by the Cauchy-Schwarz inequality as described above, but the bound is not tight enough. As equation (7) shows, if we can divide the block into many subblocks and calculate the summation of each block’s upper bound, we can get tighter bound. Following the partitioning scheme of MSEA, we have many upper bounds for different partitioning levels and the relation between the upper bounds for different levels are given in equation (8) and (9). At the final level the upper bound is equal to cross correlation as the following shown. N
∑a i =1
2 i
⋅
N
∑b i =1
2 i
≥
UBl ( x, y ) =
k
∑a i =1
2 i
⋅
⎛ ⎜⎜ a∈AllSubblock ⎝
∑
k
∑b i =1
2 i
+
∑I
ai i∈AllPixels
N
∑a
i = k +1
2 i
⋅
( x, y ) 2 ⋅
N
∑b
i = k +1
2 i
∑T
2 ai i∈AllPixels
UB0 ≥ UB1 ≥ " ≥ UBL = log 2 N = CC
N
≥ ∑ ai ⋅ bi
(7)
i =1
⎞ ⎟⎟ ⎠
(8) (9)
The MSEA scheme can provide tighter and tighter upper bounds as the partitioning level increases, but there are at most L = log 2 N levels. If we increase the partitioning levels we have better chance to early reject most candidates. The upper bound of a block
642
S.-D. Wei and S.-H. Lai
is determined by the summation of squared pixel values. If the block is homogeneous, the upper bound is close to cross correlation value. Thus, the partitioning within homogeneous area has less chance to reject non-optimal candidates. However, they increase operation counts for measuring the similarity. These unnecessary operations should be avoided. In other words, the blocks with large intensity variance normally contain more details. This means the block sum cannot present the details of block, so partitioning a block with larger variance may produce a larger decrease in the upper bound value. Consequently, in order to obtain a tighter bound in the early stage, it is reasonable to partition the blocks with large variances into sub-blocks first. We proposed the adaptive MSEA algorithm for adaptive block partitioning and successive elimination for NCC. The blocks with large variances normally contain more details. The block sum cannot represent the details of block, so partitioning a block with larger variance may produce a larger increase in the boundary value. Consequently, in order to obtain a tighter bound in the early stage, it is reasonable to partition the blocks with large variances into sub-blocks first. For simplicity, we determine the elimination order by the sum of gradient magnitudes for the subblocks in the template. The block with the current largest sum of gradient magnitudes is divided into 2x2 sub-blocks for consideration of further partitioning. It should be noted that a block will not be further partitioned into 2x2 sub-blocks if its sum of gradient magnitudes is less than a given threshold T . This partitioning process is repeated until the gradient magnitude sums of all subblocks are less than the threshold. For simplicity, we determine the block elimination order from the sum of gradient magnitudes instead of the variances. Figure 2 depicts an example of block partitioning by using the proposed algorithm. The adaptive algorithm of determining the block partitioning order is given in Algorithm 1. Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Fig. 2. An example of the block partitioning order Algorithm 1. Algorithm for determining adaptive block partitioning order Push the largest block into the queue REPEAT 1. Select the block with largest sum of gradient magnitudes from the queue. 2. Divide the selected block into four sub-blocks and calculate their sum of gradient magnitudes. 3. Check the four sub-blocks and push each sub-block into the queue if its sum of gradient magnitudes is greater than a given threshold T . UNTIL the queue is empty End FOR
With the block partitioning order obtained by using the above algorithm, we have the relation of upper bounds for different levels as UB0 ≥ UB1 ≥ " ≥ UBmax L ≥ CC . We can calculate the boundary values from equation (6) and have the relation of
Efficient NCC Based on Adaptive Multilevel Successive Elimination
643
≥ BV1 ≥ " ≥ BV max L ≥ NCC . The BVl
boundary values of different levels as BV0
value is closer to NCC as the level increases. If the BVl ( x, y ) < NCCmax , the candidate at the position (x,y) is rejected, else if
NCC ( x, y ) > NCCmax , NCCmax is updated
by NCC ( x, y ) . The following is the proposed adaptive multi-level elimination algorithm for fast finding the position with the optimal NCC in pattern matching. Algorithm 2. The proposed fast NCC pattern matching algorithm Step 1: Determine the elimination order by the Algorithm 1. Step 2: Calculate the norm of template |T| Step 3: Calculate initial NCCmax = NCC (Template, initial candidate) Step 4: Compute the integral image for the square of the search image I For each candidate C(x,y) do Step 5:Calculate the norm of current candidate ||C(x,y)|| from the integral image Step 6: Repeat 1. Retrieve the next partitioning level l from the queue 2. Calculate the
UBl
for level
BVl
3. Reject the candidate if
l . Compute BVl = UBl /( ||T|| ||C(x,y)||) < NCCmax .
Until the queue is empty. Step 7: 1. If the candidate passes all criteria of all levels, calculate NCCnew =NCC (T, C(x,y) ). 2. If ( NCCnew > NCCmax ) update
NCCmax by NCCnew .
End For
We can also apply the proposed scheme on the zero mean normalized cross correlation (ZNCC) by rewriting it in the form as equation (10). Note that calculating the terms
2
2
a , a , b and b in the equation by integral image is very efficient. n
∑ (a
ZNCC =
i =1
n
∑
i =1
∑a i =1
n
− a ) ⋅ ( bi − b ) n
∑ (b
( ai − a ) 2 ⋅
n
=
i
i
i =1
i
− b)2 (10)
⋅ bi − n a b n
( ∑ a i − n a ) ⋅ ( ∑ bi − n b ) 2
2
i =1
2
2
i =1
a=
n
1 1 n ai b = ∑ bi ∑ n i=1 , n i =1
(11)
644
S.-D. Wei and S.-H. Lai
4 Experimental Results In this section, we show the efficiency improvement of the proposed adaptive MSEA algorithm for NCC-based pattern matching. The proposed algorithm adaptively partitioned the image block into many subblocks to obtain tighter upper bounds for the cross correlation. To compare the efficiency of the proposed algorithm termed AdaMSEA_NCC, we also implement the multi-level SEA with fixed partitioning scheme and the results are termed as MSEA_NCC. In our experiment, we use the Lenna
(a)
(b)
Fig. 3. (a) The original Lenna image and (b) the noisy Lenna image added with Gaussian noise with σ =8
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. (a), (b), (c): The template images. (d), (e), (f): their brighter versions.
(a)
(b)
Fig. 5. The template images of size of 128x128
(c)
Efficient NCC Based on Adaptive Multilevel Successive Elimination
645
image of size 512-by-512 as the source image and six template images of size 64x64 inside the Lenna image as shown in Figure 3 and Figure 4, respectively. Figure 4(d)(e)(f) is the brighter version (increase 50% brightness) of Figure 4(a)(b)(c). To compare the robustness and the efficiency of the proposed algorithms, we add random Gaussian noises with σ =8 onto the search image shown in figure (b) and compare the performance of the pattern search on the noisy images. The experimental results of proposed algorithms and the original NCC are shown in Table 1 and 2. All these three algorithms used the sum table to reduce the computation of denominator in equation (2). For efficiently calculating the bound of the numerator, we also used the approach of BSAP[6] to build two block square sum pyramids for intensity image and the gradient map, respectively. The execution time shown in section includes the time of memory allocation for sum table and pyramids, and building sum table, pyramids and the gradient map. Table 3 shows the experimental results of applying different template of size 128x128 as shown in Figure 5. All of these experimental results show the significantly improved efficiency of the proposed adaptive MSEA algorithm for the NCC-based pattern matching compared to the previous MSEA algorithm. Table 1. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on six templates shown in Figure 4(a)~(f) and the source image shown in Figure 3(a). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.
NCC MSEA_NCC AdaMSEA_NCC
T(a) 3235(ms) 688 297
T(b) 3235 453 188
T(c) 3235 203 219
T(d) 3235 515 125
T(e) 3235 329 172
T(f) 3235 203 141
Table 2. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on six templates shown in Figure 4(a)~(f) and the noisy source image shown in Figure 3(b). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.
NCC MSEA_NCC AdaMSEA_NCC
T(a) 3235(ms) 1672 594
T(b) 3235 1485 359
T(c) 3235 265 141
T(d) 3235 1687 546
T(e) 3235 1515 438
T(f) 3235 313 172
Table 3. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on three templates shown in Figure 5(a)(b)(c) and the source image shown in Figure 3(a). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.
NCC MSEA_NCC AdaMSEA_NCC
T(a) 9813(ms) 2250 844
T(b) 9813 1032 407
T(c) 9813 469 234
646
S.-D. Wei and S.-H. Lai
5 Conclusion In this paper, we proposed a very efficient adaptive MSEA algorithm for fast pattern matching in an image based on normalized cross correlation. To achieve more effective successive elimination, we partition the summation of cross correlation into different levels with the partition order adaptively determined by the sum of gradient magnitudes for each partitioned regions in the template. The experimental results show the proposed adaptive MSEA algorithm is very efficient and robust for pattern matching under linear illumination change and noisy environments. Acknowledgments. This research work was supported in part by National Science Council, Taiwan, under grant 95-2220-E-007-028.
References 1. Zhu, S., Ma, K.K.: A new diamond search algorithm for fast block-matching motion estimation. Image Processing 9(2), 287–290 (2000) 2. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits Systems Video Technology 4(4), 438–442 (1994) 3. Po, L.M., Ma, W.C.: A novel four-step search algorithm for fast block motion estimation. IEEE Trans. Circuits Systems Video Technology 6(3), 313–317 (1996) 4. Li, W., Salari, E.: Successive elimination algorithm for motion estimation. IEEE Trans. Image Processing 4(1), 105–107 (1995) 5. Gao, X.Q., Duanmu, C.J., Zou, C.R.: A multilevel successive elimination algorithm for block matching motion estimation. IEEE Trans. Image Processing 9(3), 501–504 (2000) 6. Lee, C.-H., Chen, L.-H.: A fast motion estimation algorithm based on the block sum pyramid. IEEE Trans. Image Processing 6(11), 1587–1591 (1997) 7. Gharavi-Alkhansari, M.: A fast globally optimal algorithm for template matching using low-resolution pruning. IEEE Trans. on Image Processing 10(4), 526–533 (2001) 8. Hel-Or, Y., Hel-Or, H.: Real-time pattern matching using projection kernels. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9), 1430–1445 (2005) 9. Di Stefano, L., Mattoccia, S.: Fast Template Matching using Bounded Partial Correlation. Machine Vision and Applications (JMVA) 13(4), 213–221 (2003) 10. Di Stefano, L., Mattoccia, S.: A sufficient condition based on the Cauchy-Schwarz inequality for efficient template matching. In: Proc. Intern. Conf. Image Processing (2003) 11. Lewis, J.P.: Fast template matching Vision Interface, pp. 120–123 (1995) 12. Mc. Donnel, M.: Box-filtering techniques. Computer Graphics and Image Processing 17, 65–70 (1981) 13. Viola, P., Jones, M.: Robust real-time object detection. In: Proceeding of International Conf. on Computer Vision Workshop Statistical and Computation Theories of Vision (2001)
Exploiting Inter-frame Correlation for Fast Video to Reference Image Alignment Arif Mahmood and Sohaib Khan Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan {arifm,sohaib}@lums.edu.pk
Abstract. Strong temporal correlation between adjacent frames of a video signal has been successfully exploited in standard video compression algorithms. In this work, we show that the temporal correlation in a video signal can also be used for fast video to reference image alignment. To this end, we first divide the input video sequence into groups of pictures (GOPs). Then for each GOP, only one frame is completely correlated with the reference image, while for the remaining frames, upper and lower bounds on the correlation coefficient (ρ) are calculated. These newly proposed bounds are significantly tighter than the existing Cauchy-Schwartz inequality based bounds on ρ. These bounds are used to eliminate majority of the search locations and thus resulting in significant speedup, without effecting the value or location of the global maxima. In our experiments, up to 80% search locations are found to be eliminated and the speedup is up to five times the FFT based implementation and up to seven times the spatial domain techniques.
1
Introduction
A digital video signal consists of a sequence of frames and is usually characterized by strong temporal correlation between adjacent frames. This correlation has been successfully exploited in standard video codecs to achieve significant compression [1]. We show that the temporal correlation of a video sequence can also be used for fast video to reference image alignment. This results in an efficient, close to real time implementation of a number of applications in computer vision that use video to image alignment as key component. Such applications include automatic camera tracking, model based landmark extraction, activity monitoring and video geo-registration [2]. Although video to reference image alignment involves pattern matching, it is inherently different from block matching for motion compensation as used in video codecs [3]. This is because block matching algorithms for video codecs are applied to video frames that appear temporally close to each other and are acquired by the same sensor, hence the level of dissimilarity is low. On the other hand, video to reference image alignment requires pattern matching between frames in a video signal and a reference image. These two signals are usually Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 647–656, 2007. c Springer-Verlag Berlin Heidelberg 2007
648
A. Mahmood and S. Khan
Rio jo A Portion of Reference Image
U k io j o
U iko jo2
U iko jo2 U k k 2
B-Frame: Fk-2
U k k 1
C-Frame: Fk
B-Frame: Fk-1
U k k 1
B-Frame: Fk+1
U k k 2
B-Frame: Fk+2
Fig. 1. Every GOP must have at least one C-Frame while the remaining frames in the GOP are B-Frames. A GOP of length 5 frames is shown.
acquired at different times, under different illumination conditions, and by using sensors with different spectral responses, resulting in high dissimilarity between them. The inherent differences between block matching in video coding and video to reference image alignment renders the standard matching techniques [4], [5] used in the former to be inaccurate for the latter. In contrast, the correlation coefficient, which is usually criticized for its high complexity for block matching applications [6], [7], turns out to be a more accurate and robust similarity measure for video to reference image alignment [8], [9], [10]. While a number of schemes have been investigated to reduce the time complexity of the correlation, elimination of potential matching locations based on correlation bounds, to the best of our knowledge, has not been previously considered. In this work, we propose an elimination algorithm for reducing the number of search locations in correlation coefficient based video to reference image alignment. This reduction results in significant speedup over the currently used correlation coefficient based image alignment techniques. The reduction in search space is based upon the newly proposed bounds on the correlation coefficient, which are significantly tighter than the currently known Cauchy-Schwartz inequality based bounds. During the search space reduction process, elimination of a search location takes place when the upper bound on correlation coefficient at that location turns out to be smaller than the currently known maxima. In order to implement the elimination algorithm, we divide the input video sequence into groups of pictures (GOPs). In each GOP, only one frame is completely correlated with the reference image. For the remaining frames, the proposed bounds are evaluated at all search locations and unsuitable search locations are identified and eliminated from the search space. In the elimination algorithm, the length of GOP is an important parameter. An algorithm for automatic detection of optimal length of GOP is also proposed. In our experiments, the optimal length of GOP is found to be 7 frames. For this optimal length, on the average, 81% search locations are found to be eliminated. The execution time is compared
Exploiting Inter-frame Correlation for Fast Image Alignment
649
with the FFT [11], [12] based frequency domain implementation and the spatial domain implementations including the Bounded Partial Correlation (BPC) [13] technique. The execution time speedup is up to 7 times the spatial domain correlation, 5 times the FFT based implementation and 2.5 times the BPC based implementation.
2
Problem Definition
We consider a digital video signal as a sequence of frames F indexed at discrete time k, each of size m × n pixels. Each frame F k is to be matched at all valid search locations in the reference image R of size s × t pixels. For the purpose of matching, the reference image R is considered to be divided into overlapping rectangular blocks Rio jo , each of size m × n pixels, where (io , jo ) are the coordinates of the first pixel of the reference block. Each of the reference block Rio jo is a valid candidate search location. The similarity measure used to match the frame F k with search location Rio jo is the correlation coefficient defined as: n m
ρkio ,jo
(F k (i, j) − F k )(Rio jo (i, j) − Rio jo ) = m m n n k 2 (F (i, j) − F k ) (Rio jo (i, j) − Rio jo )2 i=1 j=1
i=1 j=1
(1)
i=1 j=1
where F k and Rio jo represent the mean of F k and Rio jo . For each frame F k , the maxima of ρkio ,jo will yield the best matching candidate search location defined as (k, imax , jmax ). The primary goal is to minimize overall computations for the full video sequence without changing the value or location of the global maxima for any individual frame. For this purpose, following information can be used: 1. Each frame F k exhibits nonzero correlation ρkk with each of its temporal neighbor F k ∈ {F k−p . . . F k−1 , F k+1 , . . . F k+p }, where p is an integer. 2. For a frame F k , the correlation coefficient values ρkio ,jo at all (io , jo ) are available to be used for each of its temporal neighbor F k . 3. For each F k , an initial guess of the best match location is available from a previously matched frame.
We intend to find exact bounds on ρkio ,jo using ρkio ,jo and ρkk such that complete calculations of ρkio ,jo can be avoided without having any effect on the value or location of global maxima for each frame F k .
3
Related Work
The correlation coefficient as given by Equation 1, has been computed in spatial domain with the computational complexity of the order of O mn(s − m + 1)(t − n + 1) and in frequency domain using FFT with the computational complexity of the order of O (s + m − 1)(t + n − 1) log2 [(s + m − 1)(t + n − 1)] [6]. The order
650
A. Mahmood and S. Khan
of computational complexity of frequency domain implementation is lesser than the spatial domain implementation, however the basic operations in frequency domain implementation are significantly more complex. In addition to complex basic operations, the over all complexity of the frequency domain implementation remains the same no matter how dissimilar the two images to be matched are. The frequency domain implementation cannot utilize the additional information available in the form of initial guess or inter frame correlation in the video to reference image alignment problem. Recently a fast spatial domain technique has been proposed in [13], the Bounded Partial Correlation (BPC) algorithm. In the BPC algorithm, each frame F k and the search location Rio jo is divided into cross-correlation area Ac and the bound area Ab . Partial cross-correlation (CC p ) is calculated over area Ac only, and using the Cauchy-Schwartz inequality, partial upper bound (Up ) is calculated on the area Ab only. It is shown that Up + CC p ≥ CC, where CC is the complete cross-correlation between frame F k and Rio jo . An upper-bound on ρkio jo is computed using Equation (1). During the match process, further calculations on the current search location are terminated if the upper-bound on ρkio jo evaluates to a value smaller than the yet known best maxima. Since the bound calculations are simpler than the correlation calculations, computations are reduced by selecting a larger bound are, Ab . However, a larger Ab yields the upper bound looser and it becomes harder to obtained elimination. In the limiting case, if the total block area is used as Ab , the Cauchy-Schwartz inequality yields an upper bound of +1, which will never become less than the known maxima and therefore no elimination will be obtained. Moreover, like FFT based algorithms, the BPC algorithm also cannot utilize the inter frame correlation for computational advantage. Many approximate fast block matching schemes have also been proposed in literature, for example: two dimensional logarithmic search, three step search, conjugate direction search, cross search, orthogonal search. In these approximate schemes, global maxima is not guaranteed to be found while the algorithms proposed in this paper are exact and preserve the value and the location of the global maxima.
4 4.1
Transitive Elimination Algorithm Basic Idea
The basic idea of the Transitive Elimination Algorithm (TEA) is similar to the video compression algorithms [3]. In video compression algorithms, an input video sequence is divided into Group of Pictures (GOP). Each GOP contains at least one intra-coded frame (I-Frame) and the remaining frames are predictive coded (P-Frames) or bidirectional predictive coded (B-Frames) [1]. In video to reference image alignment problem, we also propose to divide the input video sequence into GOPs. In each GOP we have one C-Frame (Correlated-Frame) and the remaining are B-Frames (Bounded-Frames). The C-Frame is correlated with complete reference image. For B-frames, transitive bounds are evaluated
Exploiting Inter-frame Correlation for Fast Image Alignment
Ȇ'
Ȇ
~k F
~ k' F
651
~ R io j o
T kk'
Tiko jo
I33 ' Fig. 2. Vector representation of the images and the angles between their mean subtracted versions
at all search locations. Then based upon sufficient conditions for elimination, unsuitable search locations are identified and eliminated from the search space. Note that there is no approximation involved in bound calculations or in search location elimination, therefore the value and location of global maxima is exactly preserved. In the following discussion, the C-Frames are denoted by F k and the B-Frames are denoted by F k . In each GOP, the middle frame is considered as the C-Frame and the temporal neighbors on each side are considered as B-Frames. 4.2
Transitive Bounds on the Correlation Coefficient
Transitive bounds are derived by considering the video frames and the search locations as vectors in Rm×n . Let the vectors Fk and Rio jo represent a C-Frame F k and a search location Rio jo . The correlation coefficient between Fk and Rio jo can be interpreted as the inner product of their mean subtracted and unit magnitude normalized versions: ˜k ˜i j F R ρkio ,jo = · o o , ˜ k R ˜ io jo F
(2)
˜ i j = Ri j − Ri j and . denotes the magnitude ˜ k = Fk − Fk and R where F o o o o o o ˜ k and of each vector. Let θiko jo be the magnitude of the smaller angle between F ˜ i j measured in the plane Π containing both of these vectors (Figure 2), such R o o that: 0o ≤ θiko jo ≤ 180o. Then the correlation coefficient between Fk and Rio ,jo is given by: (3) ρkio ,jo = cos θiko jo .
Let Fk be a vector representing a B-Frame in the temporal neighborhood of ˜ k and F ˜ k measured in plane Π Fk and θkk be the smaller angle between F containing both of these vectors, as shown in Figure 2. Like θiko jo , θkk is also constrained between 0o and 180o: 0o ≤ θkk ≤ 180o. The correlation coefficient
652
A. Mahmood and S. Khan
between vectors Fk and Fk is given by ρkk = cos θkk . Using the orientation of vectors and the planes shown in Figure 2, we are interested in finding the bounds on θiko jo that will be used to bound ρkio jo . k, F k and The planes Π and Π in Figure 2 are uniquely defined by vectors F Π io jo . Let φ be the magnitude of the smaller angle between the planes Π and R Π o o o Π Π . Like θiko jo and θkk , φΠ Π is also bounded between 0 and 180 : 0 ≤ φΠ ≤ 180o. The magnitude of θiko jo is functionally dependent upon the magnitude Π k of φΠ Π . If the magnitude of φΠ is zero, then the magnitude of θio jo is equal o to |θkk − θiko jo |. At the other extreme, if magnitude of φΠ Π is 180 , then the magnitude of θiko jo is equal to θkk + θiko jo . For all intermediate values of φΠ Π , the magnitude of θiko jo will remain within these bounds:
|θkk − θiko jo | ≤ θiko jo ≤ θkk + θiko jo .
(4)
Since θiko jo is also bounded between 0o and 180o, therefore actual bounds on θiko jo turn out to be:
|θkk − θiko jo | ≤ θiko jo ≤ min(θkk + θiko jo , 180o − (θkk + θiko jo )). o
(5)
o
For θ = 0 to 180 , cos(θ) is a monotonic decreasing function therefore, taking cosine of both sides of Equation 5, changes the direction of inequality. Using cosine function properties: cos(−θ) = cos(θ) and cos(180o − θ) = cos(θ), we get: (6) cos θkk − θiko jo ≥ cos θiko jo ≥ cos θkk + θiko jo . Using the trigonometric identities, following expression for the upper bound αkio jo , and the lower bound βiko jo , on ρkio jo can be easily calculated: (7) αkio jo = ρkk ρkio jo + (1 − (ρkk )2 )(1 − (ρkio jo )2 ), (8) βiko jo = ρkk ρkio jo − (1 − (ρkk )2 )(1 − (ρkio jo )2 ).
We have experimentally studied the characteristics of αkio jo and βiko jo . Figure 3 shows the variation of αkio jo and βiko jo with the variation of ρkk and ρkio jo on a real dataset. In Figure 3, each pair of αkio jo and βiko jo corresponds to a fixed value of ρkk , while the variation along the x-axis is due to the variation in ρkio jo on consecutive pixel positions in the reference image. From Figure 3, it can be observed that if both of the angles, θkk and θiko jo , are small (or both of the correlations, ρkk and ρkio jo , are large), then both of the bounds, αkio jo and βiko jo , become tight. If one of the two angles is small and the other is large, the upper bound remains tight because cos(θiko jo − θkk ) in Equation 6, results in a smaller value, however the lower bound becomes loose because cos(θiko jo + θkk ) also results in a small value. On the other hand, if both of the angles are large i.e both correlations are small, then both upper and lower bounds become loose. Therefore in order to get a useful upper bound on ρkio jo , at least one of the two bounding correlations should be large.
Exploiting Inter-frame Correlation for Fast Image Alignment
RHO / Upper Bound / Lower Bound
1
653
1 2 3 4 5
0.8 0.6 0.4 0.2
6 Upper Bounds
0
Actual RHO
-0.2
7
-0.4 Lower Bounds
-0.6
8 9 10 11
-0.8 -1 0
20
40
60
80
100
120
140
160
180
200
Pixel Position
Fig. 3. Variation of upper and lower bounds on the correlation coefficient with the variation of bounding correlations ρkk and ρkio jo . ρkk varies across the curves while k ρio jo varies as the pixel position varies along a row in the reference image. (1) to (5): Upper Bounds for ρkk = 0.306, 0.441, 0.571, 0.722 and 0.896. (6) Actual value of ρkio jo . (7) to (11) Lower bounds for ρkk = 0.896, 0.722, 0.571, 0.441 and 0.306. The Cauchy Schwartz inequality based upper bound is always +1 and the lower bound is always -1.
4.3
Sufficient Elimination Conditions
Using the upper and lower bounds on ρkio jo , we can derive three types of sufficient elimination conditions for eliminating the candidate search locations from the search space of B-Frame, F k . Note that this elimination is without any change in value or location of the global maxima. All locations for which any one of the following conditions is satisfied, cannot become the best match search location: 1. All search locations Rio jo can be discarded if there exists a location Ri o jo such that: αRio jo < βRio jo (9) For this condition to be maximally effective, the maxima of the lower bound should be known. However, it can be shown that the maxima of the lowerbound will occur at the peak location of ρkio jo . The proof of this fact follows directly from Equation 8. Since ρkk is constant for all locations, the location at which the maxima of ρkio jo occurs, will be the location of the maxima of the lower-bound. 2. All search locations Rio jo can be discarded if their upper bound is less than the yet known maxima of correlation surface: αRio jo < ρmax
(10)
The actual amount of search locations discarded due to this condition depends upon the magnitude and location of the current known maxima (ρmax ). In search order, earlier a high value of correlation coefficient is found, larger the number of eliminated search locations will be. Thus this condition takes into account the computational advantage of an initial guess.
654
A. Mahmood and S. Khan
3. All locations Rio jo can be discarded if their upper bound is less than a known initial threshold: αRio jo < ρthreshold (11) Selecting a high initial threshold can enhances the elimination capability of this condition. However, selection of a high initial threshold can discard the actual peak location as well, in case if the actual peak magnitude was lesser than the selected initial threshold.
5
Experiments and Results
In order to demonstrate the concept of Transitive Elimination Algorithm, a multi-satellite multi-temporal real image dataset is prepared. The reference image is an 800K pixels satellite image [14] from the University of Central Florida area, having ground sampling distance of approximately 5 m/pixel. The video frames are acquired by modeling a flight simulation on images of the same area but captured at different time of the year and by a different satellite [15]. In the simulation, the scale and orientation is assumed to be approximately same as that of the reference image (Figure 1). The video camera acquires 25 frames per second while moving at a velocity of 450km/hr. In order to reduce the size and temporal redundancy, the video is down-sampled by dropping every second frame. For experimentation, a sequence of 250 frames, each of size 101×101 pixels is selected. The input video sequence is divided into GOPs. For the test sequence, the optimal length of GOP is found to be 7 frames. The detection of optimal length of GOP is discussed later in this section. For the optimal length GOP, the average execution time of the proposed algorithm is 2.24 Sec/Frame, including all required calculations, on a 1.6GHz, 1GB RAM, IBM ThinkPad computer. The C-Frames in each GOP are correlated in frequency domain [11] while the B-Frames are correlated in spatial domain. The average spatial domain time, without elimination, is observed to be 15.64 Sec/Frame and the frequency domain time as 10.53 Sec/Frame. The spatial domain implementation is also modified for the BPC algorithm with correlation area taken as 20% [13]. The average execution time of BPC implementation is observed to be 5.70 Sec/Frame and the average elimination as 65% of the computations. The elimination observed in the proposed algorithm is 81% of the search locations, which is significantly larger than the BPC algorithm (Figure 4b). According to these results, the speedup of the proposed algorithm, over spatial domain implementation is 7 times, over FFT based implementation is 5 times and over BPC implementation is 2.5 times. For maximum speedup performance of the proposed algorithm, detection of the optimal length of GOP is of prime importance. Optimal GOP length is one that minimizes the overall computation time by maximizing the elimination in B-Frames while minimizing the number of C-Frames. Optimal GOP length can be found by starting the matching process with the minimum GOP length and then increasing the length gradually until the optimal performance is achieved (Figure 5c). Optimal GOP length can also be predicted if average value of global
Exploiting Inter-frame Correlation for Fast Image Alignment % Eliminated Computations
Execution Time (sec)
16 14 12 10 8 6 4 2 0
SPT
FFT
BPC
655
100 80 60 40 20 0
TEA
SPT
FFT
Algorithm Type
BPC
TEA
Algorithm Type
Fig. 4. Comparison of Transitive Elimination Algorithm (TEA) with Spatial (SPT), FFT and BPC based implementations: (a) Average execution time per frame. (b) Average % computation elimination.
maxima and average inter frame correlation is known for a given dataset. For optimal GOP length, the inter frame correlation should be such that, at all search locations, the upper bound approaches the current known maxima ρmax from below: ρmax − ρkk ρkio jo + (1 − (ρkk )2 )(1 − (ρkio jo )2 ) = Δkio jo , (12)
where Δkio jo is a very small positive number. By varying the length of GOP, we can change ρkk , while ρkio jo is generally very small and we can safely assume its average value to be zero: E[ρkio jo ] = 0. For the test video sequence average value of ρmax is 0.72. Assuming Δkio jo = 0 in Equation 12, the required ρkk turns out to be 0.70 (Figure 5a). From Figure 5b, for ρkk = 0.70, the number of B-Frames per GOP turn out to be 6, that is the size of GOP as 7 frames. We also verified this finding experimentally, by studying the variation of performance with the variation of the length of GOP (Figure 5c). The length of GOP is varied from 3 to 15 frames and the optimal length of GOP is again found to be 7 frames, which verified the predicted GOP length. 1
1 2 3 4 5
0.2 0
-0.2 0
0.2
0.4
0.6
0.8
Inter Frame Correlation
1
UB / MLB / IFC
Upper Bound
0.4
(a)
UB
0.8
0.6
Average Execution Time(s)
1
0.8
0.6
IFC 0.4
MLB 0.2 0 3
(b)
5
7
9
11
13
15
Number of B-Frames per GOP
17
8
The opt when up
6
4
2
(c)
2
4
6
8
10
12E[rho]=0 14
IFC B-Frames/GOP
rho max 0
Fig. 5. (a) Variation of UB with variation of Inter Frame Correlation (IFC) for E[ρkio jo ] = {0.2, 0.1, 0.0, −0.0, −0.2} for curves 1 to 5 respectively. (b) Variation of IFC, Upper Bound(UB) and Maxima of Lower Bound (MLB) with variation of the length of GOPs. (c)Variation of execution time with the length of GOPs.
656
6
A. Mahmood and S. Khan
Conculsion
An elimination algorithm for fast video to reference image alignment is presented. The elimination algorithm is based upon new proposed bounds on the correlation coefficient. Using the proposed algorithm, significant speedup is obtained without any change in the value or location of the global maxima of the correlation coefficient. The proposed algorithm is up to 7 times faster than the spatial domain implementation and 5 times faster than the FFT based implementations of the correlation coefficient.
References 1. ITU-T, ISO/IEC JTC1: Advanced video coding for generic audiovisual services. (JTC 1, Recommendation H.264 and ISO/IEC 14 496-10 (MPEG-4) AVC, 2003 2. Shah, M., Kumar, R.: Video Registration. Kluwer Academic Publishers, Boston (2003) 3. Ghanbari, M.: Standard Codecs: Image compression to advanced video coding. IEE Telecom. Series 49, Institute of Electrical Engineers 49 (2003) 4. Li, W., Salari, E.: Successive elimination algorithm for motion estimation. IEEE Trans. Image Processing 4(1), 105–107 (1995) 5. Montrucchio, Q.B.: New sorting-based lossless motion estimation algorithms and a partial distortion elimination performance analysis. IEEE Trans.CSVT 15, 210–212 (2005) 6. Barnea, D., Silverman, H.: A class of algorithms for fast digital image registration. IEEE Trans. Commun. 21(2), 179–186 (1972) 7. Pratt, W.K.: Digital Image Processing, 3rd edn (2001) 8. Irani, M., Anandan, P.: Robust multi-sensor image alignment. In: ICCV (1998) 9. Sheikh, Y., Khan, S., Shah, M.: Feature-based geo-registration of aerial images. Geo-sensorNetworks (2004) 10. Ziltova, B., Flusser, J.: Image registration methods: A survey. Image and Vision Computing 21, 977–1000 (2003) 11. Lewis, J.: Fast normalized cross-correlation. In: Vision Interface, pp. 120–123 (1995) 12. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992) 13. di Stefano, L., Mattoccia, S., Tombari, F.: ZNCC-based template matching using bounded partial correlation. Pattr. Rec. Ltr. 26(14) (2005) 14. Google Earth, http://earth.google.com/ 15. Microsoft Terra Server, http://terraserver.microsoft.com/
Flea, Do You Remember Me? Michael Grabner, Helmut Grabner, Joachim Pehserl, Petra Korica-Pehserl, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {mgrabner,hgrabner,pehserl,korica,bischof}@icg.tugraz.at
Abstract. The ability to detect and recognize individuals is essential for an autonomous robot interacting with humans even if computational resources are usually rather limited. In general a small user group can be assumed for interaction. The robot has to distinguish between multiple users and further on between known and unknown persons. For solving this problem we propose an approach which integrates detection, recognition and tracking by formulating all tasks as binary classification problems. Because of its efficiency it is well suited for robots or other systems with limited resources but nevertheless demonstrates robustness and comparable results to state-of-the-art approaches. We use a common over-complete representation which is shared by the different modules. By means of the integral data structure an efficient feature computation is performed enabling the usage of this system for real-time applications such as for our autonomous robot Flea.
1
Introduction
Autonomous robots guiding blinds, cleaning dishes, delivering mail, laundering, entertaining and handling many other daily tasks belong to the future goals of a competition called RoboCup@Home1 . The aim is to develop applications that can assist humans in everyday life. One specific task within this challenge is called Who is Who? and is thought to focus on enhancement of techniques for natural and social human-computer interaction. In specific, the detection and recognition of known vs. unknown individuals should enforce robots usability and make them automatically recognize familiar persons. Real-time capability is essential for interaction. From the computer vision perspective (we do not consider the audio modality in this work) it requires three approaches to successfully handle these tasks, namely (1) Detection (2) Recognition and further on (3) Tracking can be optionally added. For these specific computer vision problems much research has been done and overviews of proposed techniques are given in [1,2,3]. Especially classification techniques have turned out to provide robust results for these tasks and are hard to compete in efficiency. For object detection the probably most widely 1
www.robocupathome.org
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 657–666, 2007. c Springer-Verlag Berlin Heidelberg 2007
658
M. Grabner et al.
used technique is AdaBoost introduced by Viola and Jones [4]. For face recognition/identification several classification methods have been applied [5,6] and also tracking recently has been often solved by fromulating it as a classification problem between object and background [7,8,9]. However, only a few approaches exist considering the problem of detection, recognition and tracking as a common problem [10,11]. At most they are combined using consecutive stages and using totally independent techniques for the specific tasks (i.e. motion detection for tracker initialization). Related is also the work of Zisserman et al. [12,13] on face recognition in feature-length films. Their task is to perform person retrieval from movies, and they use face detection and tracking to perform that task. Their system demonstrates that by closely integrating the tasks an impressive performance can be obtained. However their approach is not applicable to our problem because of the high computational costs. To summarize, there does not exist an efficient combination of detection, recognition and tracking however for all specific tasks classification methods have been successfully applied. In this paper we propose a system which integrates detection, identification and further on tracking. The approach is applied to faces however can be used as well for other objects. All tasks are formulated as binary classification problems allowing to apply well established learning techniques. An integral data structure is shared among all modules allowing very fast and efficient feature extraction. The system is especially suited for any device having limited resources. In specific, we apply the proposed approach to an autonomous robot. The outline of the paper is as follows. In Section 2 we describe detection, recognition and tracking as binary classification problems. Furthermore the procedure how to share low-level computations among the modules as well as the used learning technique is presented. Section 3 shows experimental evaluations on a public database and in addition illustrative sample sequences captured from our mobile robot Flea. Section 4 concludes the paper and gives some outlook of ongoing work.
2
Identifying Familiar Persons and Unknowns
The identification of persons within images requires two steps. First, there is the need of a detection part which is responsible for locating all faces appearing in the image. Second, given the set of faces we want to distinguish between the class of known persons and unknown persons and further on between the individuals of known persons which is accomplished by the recognition step. The problem of identification is formulated in a coarse to fine manner by applying discriminative classification methods at both stages. In addition we are also interested in tracking of detected individuals since it allows our robot to track even if appearance changes (e.g. due to occlusion, view change) occur.
Flea, Do You Remember Me?
2.1
659
Detection, Recognition and Tracking as Binary Classification
The key idea of our approach is to formulate the abilities to detect, recognize and to track as classification problems, as depicted in Figure 1. By doing so we can apply the same techniques for all tasks. The major advantage is that low-level computations can be shared and have to be done only once. This will be explained in more detail in Section 2.2.
Fig. 1. Detecting, recognizing and tracking persons are considered as independent binary classification problems. A detector is trained off-line given a large set of positive labeled faces against non-face images. For recognition a specific face is trained against all other faces. For tracking an on-line classifier is used which allows continuously updating the model using the surrounding background as negative samples.
Detection. The task of the detector is to distinguish between the class of faces and the background. This can be formulated as a binary classification problem as proposed by Viola and Jones [14]. Given a training set Xd = {xd,1 , yd,1 , . . . , xd,n , yd,n } where xd,i ∈ IRm is an image patch and the class labels yd,i ∈ {+1, −1} for faces and non-faces, respectively. Using this training set a binary classifier is trained by applying an off-line learning algorithm Lof f . Positive labeled samples (faces) are hand labeled and negative samples are sampled from an image database containing no faces. For evaluation, i.e. the detection of faces in an image, the classifier is applied in an exhaustive way searching over many image patches which are sampled at different locations and scales. Recognition. Once a detection has been accomplished by the detector it is handed over to the recognizer module. In the first stage it classifies the provided sample as known or unknown and in the second stage verifies the identity in case it is a known face. This can be formulated as a multiclass classification problem. Given a set of samples Xr = {xr,1 , yr,1 , . . . , xr,n , yr,n } where xr,i ∈ IRm represent face images and yr,i ∈ {0, 1, 2, . . . , M } correspond to the labels each corresponding to one of the M different individuals and 0 to unknowns. This problem can be rewritten and formulated in an one vs. all manner which makes the usage of binary classifiers feasible. Meaning, for each person j = 1, . . . , M
660
M. Grabner et al.
we train a single binary classifier against the other classes. Thus, the training set for the classifier Cj is Xr,j = {xr,i , +1|yr,i = j} ∪ {xr,i , −1|yr,i = j}. The classifier Cj is created by applying an off-line learning algorithm Lof f on the training set. In fact, a model is learned which best discriminates the current person to the other given identities as shown in Figure 1. In order to obtain robustness against the class unknown, we add arbitrary faces (e.g. from a face database used for training the face detector) as negative samples for the training. In the evaluation step, each classifier Cj (x) evaluates the face image x and provides a confidence value. The classifier with the highest response delivers the class label yˆ. yˆ = arg max Cj (x); j = 1, . . . , M j
(1)
The class unknown is recognized if all classifiers responses are below a certain threshold. This approach can be extended by using an on-line learning algorithm Lon for updating the classifiers. It allows to add novel persons by just applying its samples as negative updates to existing classifiers and learning a new classifier as shown above. Tracking. Tracking allows us to localize the object even if detection fails due to appearance change (e.g. occlusion). Further on it helps to get rid of possible false detections and to increase recognition accuracy. Following the formulation in [9] we summarize the main steps of the tracking formulation as a binary classification problem. Once the target object has been detected at time t, it is assumed to be a positive image sample x, +1t=0 for the classifier. At the same time negative examples {x1 , −1, . . . , xn , −1}t=0 are extracted by taking regions of the same window size from the surrounding background. Given these examples an initial classifier Ct=0 is trained. The tracking step is based on the classical approach of template tracking. The current classifier Ct is evaluated at the surrounding region of interest and so obtain for each sub-patch a confidence value which implies how well the underlying image patch fits the current model. Afterwards we analyze the obtained confidence map and shift the target window to the new maxima location. Next, the classifier has to be updated in order to adjust to possible changes in appearance of the target object and to become discriminative to a different background. The current target region is used for a positive update of the classifier while surrounding regions again are taken as negative samples. This update policy has proved to allow stable tracking in natural scenes. As new frames arrive, the whole procedure is repeated and the classifier is therefore able to adapt to possible appearance changes and in addition becomes robust against background clutter. Note, in order to formulate the tracking as a classification task, we need an on-line learning algorithm Lon . The binary classifier updates the model (decision boundary) by using the information from a single new sample x, y, x ∈ IRm and y ∈ {+1, −1}.
Flea, Do You Remember Me?
2.2
661
Efficient Features and a Single Shared Data Structure
An overview of the proposed system is given in Figure 2. For each frame the integral representation needs to be computed only once which is then used by all three modules for feature computation. Note that each unit selects appropriate features for the specific task however computation time of the features is negligible.
Fig. 2. Each module (detector, tracker and recognizer) is based on the same classification method, allowing the use of same feature types (Haar-like wavelets). These features can be computed very efficiently using a shared integral data structure.
For binary classification of image patches we propose to use the classical approach from Viola and Jones [14]. There main assumption is that a small set of simple image features can separate two classes. The selection of the features is done by a machine learning algorithm. In the following we briefly summarize the applied techniques. Features. As features we use simple Haar-like features2 . We spend some time on pre-computation of the efficient data structure, namely the integral image, which can be used for fast feature evaluation. This pre-computation has to be done only once since all three modules share this information. Boosting for Feature Selection. For training a classifier we apply boosting for feature selection [17,14]. Core of the technique is the machine learning algorithm AdaBoost [18]. Given a set of training samples X = {x1 , y1 , . . . , xn , yn }, xi ∈ IRm and yi ∈ {−1, +1} boosting builds an additive model of weak classifiers in the training stage. At each iteration a weak classifier is trained using a weight distribution over the training samples. In order to perform feature selection, a weak classifier corresponds to a simple image feature. Afterwards a re-weighting of the samples is done. The result is a strong classifier
2
Note, also other kind of features like edge orientation histograms can be build using integral data structures [15] or Local Binary Patterns [16].
662
M. Grabner et al.
H(x) = sign(conf (x)) N 1 conf (x) = N α αi · hi (x) i=1
(2)
i
i=1
which consists of a weighted linear combination of N weak classifiers hi . The value conf (x) bounded in the interval [−1, +1], denotes how confident the classifier is about its decision. This fulfills the requirements of the classifier C from the previous section. Boosting and especially boosting for feature selection as described above runs off-line, meaning all the training data is given at once. Primarily for tracking we need an on-line learning algorithm. For on-line adaption of the classifier we make use of an on-line version [19]. The key idea is to introduce so called selectors which hold a set of weak classifiers and each selector can chose exactly one of them. An on-line boosting algorithm [20] is performed on the selectors and not on the weak classifiers directly. Updating can be done efficiently. After updating the classifier the evaluation is similar to the off-line case, because the selector has chosen one specific weak classifier which again corresponds to a single feature.
3
Results
First we introduce our autonomous robot and give some relevant details regarding the hardware setup. We present a performance evaluation of our proposed system focusing on recognition accuracy as well as considering the ability to distinguish between known and unknown individuals. Finally an illustrative experiment is shown which is also available at www.flea.at. 3.1
Robot Flea
The used hardware setup consists of an ActiveMedia Peoplebot platform including diverse sensors (e.g. sonar, IR). The robot’s head has thirteen degrees of freedom and can move its eyes, mouth, eyebrows, forehead, chin and neck. A Dual Core Centrino with 2 GHz and 1024 MB RAM represents the main
Fig. 3. Our robot Flea consists of a humanoid head. Visual information is obtained through a camera which is included in the artificial eye. A captured image from the view of the robot is depicted on the right.
Flea, Do You Remember Me?
(a) Scalability
663
(b) Performance
Fig. 4. Performance evaluation of the recognition system. (a) shows the recognition rate when increasing the number of persons, (b) depicts the trade off of recognizing persons vs. unknowns.
processing unit. A stereo camera from Videre Design STH-MDCS2-VAR (max. 1280 × 960 used: 640 × 480) is used for capturing images and about 12 frames per seconds are processed with a non optimized C++ implementation. Training of the robot is done in a fully autonomous way. In case Flea meets an unknown person, she focuses on it and asks for the name and other relevant information. During the conversation it starts collecting training samples of the person and trains a classifier for identification. When meeting Flea somewhere and asking the robot Flea, do you remember me?, she will reply Sure,... adding your name in case she knows you and otherwise asking you for your name. This is exactly the task that has to be fulfilled within the Who is Who? competition from RoboCup@Home. 3.2
Recognition Performance
For evaluation of the recognition accuracy and in specific the recognition of class unknown we use the AT&T database3 which includes 40 different persons with 10 samples per class. This dataset is well suited since in our application it is not important to distinguish a huge number of individuals however it is important to accurately recognize the class of unknown individuals. In the first experiment we want to demonstrate recognition accuracy with respect of the number of classes. For training the dataset has been randomly split into training and test set (70% trainig and 30% testing). The result, depicted in Figure 4, has been obtained by running the experiment 10 times. The second evaluation illustrates on the one hand the performance of distinguishing between unknown and known faces and on the other hand shows also the trade off of recognizing the correct class in case of a known person. For training we use the same training procedure as in the previous evaluation. We randomly selected 20 3
www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
664
M. Grabner et al.
Fig. 5. Subset of training samples for three different persons (from upper left to bottom right: Chris, Joe, Ann) and samples taken from the training database (bottom right set) as additional negative samples
(a) Frame 215
(b) Frame 220
(c) Frame 355
(d) Frame 357
(e) Frame 431
(f) Frame 839
Fig. 6. Sample sequence from the perspective of Flea. Learned faces are detected, robustly tracked (d) and correctly identified if they are known (e-f). Different individuals are marked by different colored rectangles. The approach is running on the robot with about 12 frames per second.
individuals from the dataset and applied cross-validation to achieve statistically significant results. As depicted in Figure 4 the overall performance for distinguishing between unknown and known faces is handled in a proper way. The difference in performance of recognizing the correct identity compared to recognizing just the class known is marginal. 3.3
Sequences
We also want to demonstrate a typical Who is who? scenario. Three persons introduce themselves to the robot whereas the robot collects samples of each
Flea, Do You Remember Me?
665
individual as depicted in Figure 5. Training each individual is done within a few seconds including the capturing of the faces. Figure 6 illustrates the applied system on a sequence of our autonomous robot. As can be seen, the approach handles the recognition of known and unknown faces and further on shows the benefit of combining detection, recognition and tracking.
4
Conclusion
We have combined detection, recognition and tracking by formulating all tasks as binary classification problems. As a result low-level computations can be shared among all modules. Due to efficient feature computation the approach can be used within real-time applications such as autonomous robots. Note that the approach is not limited to faces since all modules are generic and therefore the proposed approach can be applied to any other type of object. The common formulation opens several new venues such as improving (specializing) detectors and recognizers for images taken from a static camera as it is the case in video surveillance applications.
Acknowledgement This work has been sponsored by the Austrian Joint Research Project Cognitive Vision under projects S9103-N04 and S9104-N04, the EC funded NOE MUSCLE IST 507572.
References 1. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002) 2. Tan, X., Chen, S., Zhou, Z.H., Zhang, F.: Face recognition from a single image per person: A survey. Pattern Recognition 39(9), 1725–1745 (2006) 3. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(4) (2006) 4. Viola, P., Jones, M.: Robust real-time object detection. International Journal of Computer Vision (2002) 5. Jonsson, K., Kittler, J., Li, Y.P., Matas, J.: Learning support vectors for face verification and recognition. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 208–213. IEEE Computer Society Press, Los Alamitos (2000) 6. Yang, P., Shan, S., Gao, W., Li, S.: Face recognition using Ada-boosted Gabor features. In: Proceedings Conference on Automatic Face and Gesture Recognition, pp. 356–361 (2004) 7. Avidan, S.: Ensemble tracking. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 2, pp. 494–501 (2005) 8. Avidan, S.: Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1064–1072 (2004)
666
M. Grabner et al.
9. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Proceedings British Machine Vision Conference, vol. 1, pp. 47–56 (2006) 10. Ebbecke, M., Ali, M., Dengel, A.: Real time object detection, tracking and classification in monocular image sequences of road traffic scenes. In: Proceedings International Conference on Image Processing, vol. 2, pp. 402–405 (1997) 11. Hern´ andez, M., Cabrera, J., Dominguez, A., Santana, M.C., Guerra, C., Hern´ andez, D., Isern, J.: Deseo: An active vision system for detection, tracking and recognition, pp. 376–391 (1999) 12. Arandjelovic, O., Zisserman, A.: Automatic face recognition for film character retrieval in feature-length films. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 860–867 (2005) 13. Sivic, J., Everingham, M., Zisserman, A.: Person spotting: Video shot retrieval for face sets. In: Proceedings International Conference on Image and Video Retrieval, pp. 226–236 (2005) 14. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition. vol. 1, pp. 511–518 (2001) 15. Porikli, F.: Integral histogram: A fast way to extract histograms in cartesian spaces. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 829–836 (2005) 16. Ojala, T., Pietik¨ ainen, M., M¨ aenp¨ aa ¨, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 17. Tieu, K., Viola, P.: Boosting image retrieval. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 228–235 (2000) 18. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 19. Grabner, H., Bischof, H.: On-line boosting and vision. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition. vol. 1, pp. 260–267 (2006) 20. Oza, N., Russell, S.: Online bagging and boosting. In: Proceedings Artificial Intelligence and Statistics, pp. 105–112 (2001)
Multi-view Gymnastic Activity Recognition with Fused HMM Ying Wang, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing {wangying,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. More and more researchers focus their studies on multi-view activity recognition, because a fixed view could not provide enough information for recognition. In this paper, we use multi-view features to recognize six kinds of gymnastic activities. Firstly, shape-based features are extracted from two orthogonal cameras in the form of transform. Then a multi-view approach based on Fused HMM is proposed to combine different features for similar gymnastic activity recognition. Compared with other activity models, our method achieves better performance even in the case of frame loss.
1
Introduction
Human activity recognition is a hot topic in the domain of computer vision. There are a wide range of open questions in this field, such as dynamic background modelling, object tracking under occlusion, activity recognition and so on [1]. Most of the previous activity recognition methods are dependent on the view direction. In these work, there is a strong assumption that low-level features for latter activity recognition are obtained without any ambiguity. However, recognizing actions from a single camera is affected by the unavoidable fact that parts of the action are not available from the camera because of self-occlusions. Moreover, action from any view looks different and some activities may not be captured because of the loss of depth information. In [2], Madabhushi and Aggarwal recognized 12 different actions in the frontal or lateral view using movement of the head, but they had not been able to model and test all the actions due to the problem of self-occlusion for some actions in frontal view. Therefore great efforts are taken to find robust and accurate approaches to solve this problem. In fact, while performing an action, the object essentially generates a viewindependent 3D trajectory or shape in (X, Y, Z) space with respect to time. Thus 3D methods can recognize activity efficiently without the trouble of selfocclusion and depth information loss. In [3], authors extracted 3D shape for recognizing human posture using support vector machines. In [4], Chellappa et al. chose six joint points of the body and calculated their 3D invariants of each posture. In [5], Motion History Volume (MHV) was proposed to extract viewinvariant features in Fourier space for recognizing actions in a variety of viewpoints. Alignment and comparisons were performed using Fourier transform in Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 667–677, 2007. c Springer-Verlag Berlin Heidelberg 2007
668
Y. Wang, K. Huang, and T. Tan
OriginalȱVideo
AtoS
AupLup
AtoC
Lup
Aup
Still Recognizing
Extractedȱ Silhouettes
ActivityȱModel Parameters Learning R Transform
O11
O12
O13
U11
U12
U13
O1N
U1N
FHMM O21
O22
O23
O2N
U21
U22
U23
U2N
PCAȱFeatureȱ Vector
Fig. 1. The flowchart of multi-view activity recognition
cylindrical coordinate around the vertical axis [5]. For all 3D methods, in order to use affine transformation in activity learning and recognition, point correspondence is needed, which has high computation cost. To avoid this, a simple mechanism is to use 2D data from several views, which can integrally describe the activities with low computation cost. Some attempts to combine these features from different views are in the process of constructing activity model. In [6], Bui et al. constructed Abstract Hidden Markov Model (AHMM) to hierarchically encode the wide spatial locations from different views, and describe activities at different levels of details. Some researchers try to directly fuse multi-view information on feature level. In [7], Bobick et al. used two cameras at orthogonal views to recognize activity by temporal template. Motion history information (MHI) was proposed to represent activity, which just had temporal information but no spatial information of motion. In [8], Huang proposed a representation “Envelop Shape” obtained from silhouette of objects using two orthogonal cameras for view insensitive action recognition. However, “Envelop Shape” simply overlap the silhouettes from two views, which inevitably destroys the correspondence between consequent frames of respective view. 2D data from multiple views are easy to acquire, but how to use them efficiently deserves further research. In this paper, six kinds of gymnastic activities are recognized. Only from a single view, many movements seem so similar that they could not be classified correctly. Similarly, we use two cameras with orthogonal views to capture more silhouette features. Different from previous work, we use transform, a novel shape descriptor, to represent gymnastics activities. Then activity models, fused HMMs (FHMMs) based on features extracted by transform, are trained for
Multi-view Gymnastic Activity Recognition with Fused HMM
669
six kinds of gymnastic activities, which could merge different activity features captured from different views. The overall system architecture is illustrated in Fig. 1. The remainder of this paper is organized as follows: In section 2, we describe six kinds of gymnastic activities for analysis in this paper. In section 3 and 4, we provides a detailed description of transform and FHMM. Section 5 demonstrates the effectiveness of the proposed method by comparison with other activity models. Finally, some conclusions are drawn in section 6.
2
Activity Description
In this study, we focus our attention on gymnastic activity. Gymnastics is rhythmical, and each activity starts by standing with one’s arms down without any motion and ends with the same stance. In this framework, we divide gymnastic video of each person into six kinds of activities. Table 1 describes these activities with their respective activity number and abbreviations. Some silhouette examples sampled from video sequence for each activity are shown in Fig. 2. In each sub figure, the first row shows the silhouettes from the frontal view and the second row shows the silhouettes from the lateral view. Table 1. Six type activity description No. 1
2 3 4
5
6
Ab. AtoS
Activity description Raise arms out to the side with elbow straight to shoulder height, keep standing with arms held at such height and then put arms downwards to body side. AtoC Lift two arms up towards ceiling and then put arms to starting position. Aup Raise one arm forward with elbow straight to ceiling and then down. AupLup Lift two arms up towards ceiling while one leg backwards, then arms backwards while one leg forwards, finally put arm and leg down to starting position. Lup Lift two hands up to shoulder height while raise lap up till it is parallel to floor and knee pointing to the ceiling, finally put arm and leg down to starting position. Still Keep body still (which occurs in the end of each activity).
These activities are so similar that they are difficult to discriminate from a single view. Because of the loss of depth information, the fore-and-aft movement facing the camera could not be captured on a 2D image plane. As for shape sequence, there are just a little variance that could not represent the detailed activity information, as shown in Fig. 2.4, which is hard to recognize. For example, from the frontal view, activity 1 and activity 5 are different movements, but have quite similar shape variance as shown in Fig. 2.1 and 2.5. So do activity 2 and 4 (Fig. 2.2 and 2.4). From the lateral view, activity 1 and 6 have the same shape variance (Fig. 2.1 and 2.6). In order to discriminate these seemingly
670
Y. Wang, K. Huang, and T. Tan
(1. Hand upwards to shoulder height)
(2. Two hands upwards to ceiling)
(3. One hand upwards to ceiling)
(4. Two hands upwards to ceiling, leg forwards and backwards alternately)
(5. One leg upwards till it parallel to floor)
(6. Body keeping still) Fig. 2. Examples of extracted silhouettes in video sequences from two views
similar but actually different activities, two views are needed to provide more abundant information for discriminating activities. As shown in Fig. 2, some activity sequences have the similar variance from one view, but discriminations could be found from another view. So these easily misclassified activity sequences from a single view could be recognized correctly from two views.
3
Low-Level Feature Representation by Transform
Feature representation is the key step of human activity recognition because it is an abstraction of original data to a compact and reliable format for latter
Multi-view Gymnastic Activity Recognition with Fused HMM
671
processing. In this paper, we adopt a novel feature descriptor, transform, which is an extended Radon transform [10]. Two dimensional Radon transform is the integral of a function over the set of lines in all directions, which is roughly equivalent to finding the projection of a shape on any given line. For a discrete binary image f (x, y), its Radon transform is defined by [11]: ∞ ∞ f (x, y)δ(x cos θ + y sin θ − ρ)dxdy = Radon {f (x, y)} (1) TRf (ρ, θ) = −∞
−∞
where θ ∈ [0, π], ρ ∈ [−∞, ∞] and δ(.) is the Dirac delta-function, 1 if x = 0 δ(x) = 0 otherwise
(2)
However, Radon transform is sensitive to the operation of scaling, translation and rotation. and hence an improved representation, called Transform, is introduced [9,10]: f (θ) =
∞
−∞
TR2 f (ρ, θ)dρ
(3)
transform has several useful properties in shape representation for activity recognition [9,10]: → Translate the image by a vector − μ = (x0 , y0 ), ∞ ∞ TR2 f ((ρ − x0 cos(θ) − y0 sin(θ)), θ)dρ = TR2 f (ν, θ)dρ = f (θ) (4) −∞
−∞
Scale the image by a factor α, ∞ ∞ 1 1 1 2 T f (αρ, θ)dρ = 3 T 2 f (ν, θ)dρ = 3 f (θ) α2 −∞ R α −∞ R α Rotate the image by an angle θ0 , ∞ TR2 f (ρ, (θ + θ0 ))dρ = f (θ + θ0 )
(5)
(6)
−∞
According to the symmetric property of Radon transform, and let ν = −ρ,
∞ −∞
TR2 f (−ρ, (θ±π))dρ = −
−∞ ∞
TR2 f (ν, (θ±π))dν =
∞ −∞
TR2 f (ν, (θ±π))dν = f (θ±π) (7)
From equations (4)-(7), one can see that: 1. Translation in the plane does not change the result of transform. 2. A scaling of the original image only induces the change of amplitude. Here in order to remove the influence of body size, the result of transform is normalized to the range of [0, 1].
672
Y. Wang, K. Huang, and T. Tan
3. A rotation of θ0 in the original image leads to the phase shift of θ0 in transform. In this paper, recognized activities rarely have such rotation. 4. Considering equation (7), the period of transform is π. Thus a shape vector with 180D is sufficient to represent the spatial information of silhouette. Therefore, transform is roust to geometry transformation, which is appropriate for activity representation. According to [9], transform outperforms other moment based descriptors, such as Wavelet moment, Zernike moment and Invariant moment, on similar but actually different shape sequences, and even in the case of noisy data. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0
(Arm up and leg up)
0.1 0
20
40
60
80
100
120
140
160
0
180
( transform)
(Arm up and leg up)
0
20
1
1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
80
100
120
140
160
180
0.2
0.1 0
60
( transform)
0.9
0.2
40
0.1 0
20
40
60
80
100
120
140
160
0
180
(Arm back, leg forwards) ( transform)
20
40
60
80
100
120
140
160
180
(Arm back, leg forwards) ( transform)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0
0
0.1 0
20
40
60
80
100
120
140
160
(Arm up, leg backwards) ( transform)
180
0
0
20
40
60
80
100
120
140
160
180
(Arm up, leg backwards) ( transform)
Fig. 3. transform of key frames for different activities from two views
Fig. 3 shows silhouette examples extracted from different activities. Each row shows the same frame from two views and the sub-figure following each silhouette is their respective transform results. transform curves of different activity from two views show the different variance. For example, from the frontal view, the transform curve of the first row has two peaks, one is about 90◦ and the other is about 170◦ . That of the second row has no peak and that of the last row has a peak close to 10◦ . This proves that transform can describe the spatial information sufficiently and characterize the different activity silhouettes effectively.
Multi-view Gymnastic Activity Recognition with Fused HMM
4
673
Fused Hidden Markov Model
The following process is combining these features obtained from different views. Here we employ FHMM, which is proposed by Pan in bimodal speech processing [12]. Like Coupled HMM (CHMM) [13], FHMM consists of two HMMs as shown in Fig. 4 (where circle represents the observation and rectangle represents hidden state. Each red rectangle is one HMM component). However, unlike CHMM’s connections between hidden states, FHMM’s connections is between hidden state node and observation node of different HMMs, as shown in Fig. 4.
O11
O12
O13
HMM 1 U11
U12
U13
O21
O22
O23
U1N
O2N
HMM 2
HMM 2 U21
CHMM (1)
O1N
HMM 1
U22
U23
U2N
FHMM (2)
Fig. 4. The graphical structure of CHMM and FHMM
Assume O1 and O2 are two different observation video. In FHMM’s parameter training, the focus is to estimate the joint probability function P (O1 , O2 ). However, the straightforward estimation of joint likelihood P (O1 , O2 ) is not desirable because the computation is inefficient and large training data are required. To solve this problem, Pan et. al. train two HMM separately, and then use their respective parameters to estimate an optimal solution for P (O1 , O2 ) [12,14]. According to maximum entropy sense, an optimal solution P(O1 , O2 ) is less precisely equal to computing the following equation [15]: P (w, v) P (O1 , O2 ) = P (O1 )P (O2 ) P (w)P (v)
(8)
where w = f (O1 ), v = g(O2 ), f (.) and g(.) are mapping functions, which must satisfy the following requirement: 1. The dependencies between w and v can describe the dependencies between O1 and O2 to some extent. 2. P (w, v) is easy to be estimated. In other words, f (.) and g(.) should maximize the mutual information of w and v [16]. This is an ill-posed inverse problem with more than one solutions. ˆ1 = arg maxU1 (log p(O1 , U1 )), V = O2 according to Specifically, we choose w = U maximum mutual information criterion [16]. Then equation (8) is expressed as:
674
Y. Wang, K. Huang, and T. Tan
ˆ1 , O2 ) P (U ˆ1 ) = P (O1 )P (O2 |U P(O1 , O2 ) = P (O1 )P (O2 ) ˆ1 )P (O2 ) P (U
(9)
Finally, the computation of joint probability P (O1 , O2 ) is converted to estimate ˆ1 ). According to the process mentioned above, the learning P (O1 ) and P (O2 |U algorithm of FHMM include three steps [12,14]: 1) Learn the parameters of two individual HMM independently by EM algorithm: (Π1 , A1 , B1 ) and (Π2 , A2 , B2 ). 2) Determine the optimal hidden states of the HMMs using Viterbi algorithm with obtained parameters: U1 and U2 . ˆ1 ) using known parameters. 3) Estimate the coupling parameters P (O2 |U ˆ1 ) = P (O2 |U
T −1 t=0
ˆ1t − i) δ(O2t − k)δ(U ˆ t − i) δ(U 1
(10)
where k is the length of observation O2 and i is the number of states in HMM 1.
5 5.1
Experimental Analysis Experimental Data and Feature Extraction
Experimental data are synchronized videos (320*240, 25fps) obtained by two cameras placed roughly orthogonally. The experiments are based on 300 low resolution video sequences of 50 different people, each performing six gymnastic activities as described in Table 1. The resultant silhouettes contain holes and shadows due to imperfect background segmentation. To train the activity models, holes, shadows and other noise are removed manually to form ground truth data. 180 of 300 sequences (30 of 50 people) are used in training while 120 of 300 sequences (20 of 50 people) are used for recognition. Then transform is used to extract the spatial information of posture in video. Because transform is non-orthogonal, the shape vector of 180D is redundant. In general, PCA is employed to obtain the compact and accurate information in each video sequence. According to primary analysis of each activity, we find 10 principal components are enough to represent 98% variance. Then six Table 2. Recognition results based on FHMM Activity 1. 1. AtoS 2. AtoC 3. Aup 4. AupLup 5. Lup 6. Still
AtoS 2. 16 1 2
1
AtoC 3. Aup 4. AupLup 5. Lup 6. Still Correct recognition rate 2 1 1 80% 17 2 85% 2 15 1 75% 19 1 95% 1 18 1 90% 1 2 16 80%
Multi-view Gymnastic Activity Recognition with Fused HMM
675
FHMMs, consisting of a 2-states HMM for frontal view and a 3-states HMM for lateral view, are constructed to combine two views’ features and model six kinds of activity in Table 1. We can find that FHMM receives good recognition results as shown in Table 2 (Each activity has 20 testing samples). Activity 4 achieves the best recognition rate, 95%. Even the poorest result, 75% of activity 3 is also promising. 5.2
Comparison with Other Graphical Models
In order to evaluating the performance of robustness and coupling ability, FHMM is compared with CHMM and Independent HMM (IHMM). Moreover, two component HMMs in CHMM and IHMM have the same structures with those of FHMM, i.e. a 2-states HMM for the frontal view and a 3-states HMM for the lateral view. Both of them use the same training and testing data with FHMM. The structure of CHMM is shown as Fig. 4.1, and more details in parameter training and inference can be found in [13]. IHMM assumes O1 , O2 independent, so the dependence between two observations is computed by P (O1 , O2 ) = P (O1 )P (O2 ). This means IHMM simply multiples the observation probability of two independent HMMs. As shown in Fig. 5.1, although three methods achieve different recognition rates for each activity, the overall performance among is: F HM M > CHM M > IHM M IHMM obtains the worst recognition performance. This is because it does not consider the correlations of observation from two views. As described in Section 2, some activity seem so similar just from one view, and thus it hard to avoid misrecognition. This misrecognition increases linearly with the product of P (O1 ) and P (O2 ). The recognition performance of CHMM is better than that of IHMM but worse than that of FHMM, because CHMM optimizes all the parameters globally by iteratively updating the component HMM’s parameters and coupling parameter. Therefore more training data and more iterations are needed for it to achieve convergence. Considering the same training data with FHMM, but more requirements of training data, the parameters of CHMM may
1
Correct Recognition Rate
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0.1
0.1
0
0
1 AtoH;
2 AtoC;
3 Aup;
4 AupLup;
5 Lup;
6 Still
1. Ground truth data.
1 CHMM FHMM IHMM
0.9
CHMM FHMM IHMM
0.9
Correct Recognition Rate
CHMM FHMM IHMM
Correct Recognition Rate
1 0.9
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
1 AtoH;
2 AtoC;
3 Aup;
4 AupLup;
5 Lup;
6 Still
2. Frame loss data.
0
1 AtoH;
2 AtoC;
3 Aup;
4 AupLup;
5 Lup;
6 Still
3. Frame loss data.
Fig. 5. Recognition rates for CHMM, FHMM and IHMM in the case of different data
676
Y. Wang, K. Huang, and T. Tan
not be robust. Moreover, CHMM is linked by their hidden states, which could not fully represent the statistical relationship between observations extracted from different cameras. That is because the dependence between the hidden states is so weak that it can not represent coupled observation videos accurately. Compared with them, FHMM links the hidden states of one HMM and the observation of the other, which has stronger coupling ability than that of CHMM. 5.3
Comparison Experiments in the Case of Frame Loss
In order to compare the robustness of CHMM, FHMM and IHMM, we simulate 120 sequences (each activity type has 20 samples) with frame loss by removing 10 frames (from 26 to 35, each activity has about 50 ∼ 90 frames). Fig. 5.2 illustrates the recognition results for three models. In spite of the lower recognition rate than that of ground truth data, FHMM still outperforms CHMM and IHMM. This proves that FHMM is relatively more robust to frame loss in video than other two models. In order to test the coupling ability of three models, we also simulate 120 frame loss data (each type has 20 samples), but remove 10 frames of frontal view (from 26 to 35) and different 10 frames from lateral view (from 16 to 25). Fig. 5.3 illustrates the recognition results for three models. Compared with Fig. 5.2, the performance of FHMM and IHMM does not change much, but that of CHMM decreases noticeably (comparing the blue parts of Fig. 5.2 and 5.3 respectively). Note that CHMM does not always perform better than IHMM. For activity 1 and 4, CHMM gets even worse results than that of IHMM. This is because the frame loss of two views does not happen at the same time, the state relationship coupled in CHMM is destroyed which leads to lower recognition rate.
6
Conclusion
From the theoretical and experimental analysis of our proposed approach, we can find that FHMM based on R transform descriptor does have many advantages for multi-view activity recognition. Firstly, only silhouette is taken as input, which is easier to obtain than meaningful feature points which need tracking and correspondence. Secondly, transform descriptor captures both boundary and internal content of the shape. The computation of 2D transform is linear, so the computation cost is low. Moreover, transform performs well for similar but actually different shape sequences, e.g. gymnastic activities. Thirdly, activity features based on multi-view are easy to acquire and have abundant information for discriminating similar activity. Finally, compared with CHMM and IHMM, FHMM gets the best performance with lower model complexity and computational cost. Even in the case of frame loss data, FHMM shows strong robustness, with great binding ability to couple inputs from two views.
Multi-view Gymnastic Activity Recognition with Fused HMM
677
Acknowledgment The work reported in this paper was funded by research grants from the National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.
References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. on Systems, Man and Cybernetics, Part C: Applications and Reviews 34, 334–352 (2004) 2. Madabhushi, A.R., Aggarwal, J.K.: Using movement to recognize human activity. In: ICIP, vol. 4, pp. 698–701 (2000) 3. Cohen, I., Li, H.: Inference of human postures by classification of 3D human body shape. In: IEEE Internal Workshop on FG, pp. 74–81. IEEE Computer Society Press, Los Alamitos (2003) 4. Parameswaren, V.V., Chellappa, R.: Human Action-Recognition Using Mutual Invariants. Computer Vision and Image Understanding 98, 295–325 (2005) 5. Weinland, D., Ronfard, R., Boyer, E.: Free Viewpoint Action Recognition using Motion History Volumes. In: CVIU (2006) 6. Bui, H., Venkatesh, S., West, G.: Policy Recognition in the Abstract Hidden Markov Model. Journal of Artificial Intelligence Research 17, 451–499 (2002) 7. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. In: PAMI, vol. 23, pp. 257–267 (2001) 8. Huang, F., Di, H., Xu, G.: Viewpoint Insensitive Posture representation for action recognition (2006) 9. Wang, Y., Huang, K., Tan, T.: Human Activity Recognition based on Transform. In: The 7th IEEE International Workshop on Visual Surveillance, IEEE Computer Society Press, Los Alamitos (2007) 10. Tabbone, S., Wendling, L., Salmon, J.-P.: A new shape descriptor defined on the Radon transform. Computer Vision and Image Understanding 102 (2006) 11. Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, Chichester (1983) 12. Pan, H., Levinson, S.E., Huang, T.S., Liang, Z.-P.: A Fused Hidden Markov Model with Application to Bimodal Speech Processing. IEEE Transactions on Signal Processing 52, 573–581 (2004) 13. Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: CVPR, pp. 994–999 (1997) 14. Rabiner, L.R.: A Tutorial On Hidden Markov Models and Selected Applications in Speech. Proceedings of the IEEE 77(2), 257–286 (1989) 15. Luttrell, S.P. (ed.): The use of Bayesian and entropic methods in neural network theory. Maximum Entropy and Bayesian Methods, pp. 363–370. Kluwer, Boston (1989) 16. Pan, H., Liang, Z.-P., Huang, T.S.: Estimation of the joint probability of multisensory signals. Pattern Recogn. Letter 22, 1431–1437 (2001)
Real-Time and Marker-Free 3D Motion Capture for Home Entertainment Oriented Applications Brice Michoud, Erwan Guillou, Hector Brice˜ no, and Sa¨ıda Bouakaz Laboratory LIRIS - CNRS, UMR 5205 University of Lyon, France {brice.michoud,erwan.guillou,saida.bouakaz}@liris.cnrs.fr
Abstract. We present an automated system for real-time marker-free motion capture from two calibrated webcams. For fast 3D shape and skin reconstructions, we extend Shape-From-Silhouette algorithms. The motion capture system is based on simple and fast heuristics to increase the efficiency. Multi-modal scheme using both shape and skin-parts analysis, temporal coherence, and human anthropometric constraints are adopted to increase the robustness. Thanks to fast algorithms, low-cost cameras and the fact that the system runs on a single computer, our system is perfectly suitable for home entertainment device. Results on real video sequences demonstrate our approach efficiency.
1
Introduction
In this paper we propose a real-time method for markerless 3D human motion capture apt for home entertainment (see Fig. 2(a)). While commercial real-time products using markers are already available, online marker-free systems remain an open issue because many real-time algorithms still lack robustness, or require expensive devices. While most popular techniques run on PC clusters, our system requires two low-cost cameras (e.g. webcams) and a laptop computer. Our system works in real-time (at least 30 fps), without markers (active or passive) or any particular sensors. Several techniques have been proposed to tackle the marker-free motion capture problem. They differ on the number of cameras and analysis method used. We now review techniques related to our work. Motion capture systems vary in the number of cameras used. Single camera systems [1] encounter several limitations. In some cases they suffer from ambiguous response as different positions can yield the same image. Concerning multi-views approaches, most of techniques are based on silhouette analysis [2,3]. These techniques provide good results if the reconstructed shape topology complies with human topology i.e. each body part is unambiguously mapped to the 3D shape estimation. With self-occlusion cases or large contacts between limbs and body, these techniques frequently fail. Caillette et al . [4] method involves shape and color clues. They link colored blobs to a kinematic model to track an individual’s body parts. This technique requires contrasted clothing between each body part for tracking, thus adding a usability constraint. Few methods provide real-time motion Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 678–687, 2007. c Springer-Verlag Berlin Heidelberg 2007
Real-Time and Marker-Free 3D Motion Capture
(a)
(b)
679
(c)
Fig. 1. (a) System overview: Reconstruction algorithms and Pose estimation algorithms. Body parts labeling (b) and joint naming (c).
capture from multiple views. Most of them run at interactive frame rates (10 fps for [4]). We therefore propose a fully automated system for practical real-time motion capture from two calibrated webcams. Our process is based on simple heuristics, driven by shape and skin-parts topology analysis, and temporal coherence. It runs at 30 fps. This article is organized as fallows: Section 2 presents our work for realtime 3D reconstruction. Section 3 describes an overview of the motion tracking. Section 4 details the motion tracking and Section 5 presents the initialization step. Experimental results are presented Section 6. They show the validity and the robustness of our process. In Section 7 we conclude about our contributions and develop some perspectives for this work.
2
3D Shape and Skin-Parts Estimation
We propose extensions of Shape-From-Silhouette (SFS) algorithms: it reconstructs in real-time 3D shape and 3D skin-colored parts of a person. SFS methods compute in real-time an estimation of the 3D shape of an object, from its silhouette images. Silhouette images are binary masks corresponding to captured images where 0 corresponds to background, and 1 stands for the (interesting) feature of the object. The formalism of SFS was introduced by A. Laurentini [5]. By definition, an object lies inside the volume generated by backprojecting its silhouette through the camera center (called silhouette’s cone). With multiple views of the same object at the same time, the intersection of all the silhouette’s cones build a volume called “Visual Hull”, which is guaranteed to contain the real object. There are mainly two ways to compute an object’s Visual Hull: surface-based and volumetric-based methods. Surface-based approaches compute the intersection of silhouette’s cone surfaces (see Fig. 2(b)). Silhouette edges are first converted into polygons, then back-projected to form Silhouette’s cones. Intersection of these cones approximates the object shape ([6]). Because high computation time, these methods are
680
B. Michoud et al.
(a)
(b)
(c)
(d)
Fig. 2. (a) Interaction setup. (b) Object reconstruction by surface and volumetric approaches. (c) SFS using “projective texture mapping”. (d) Example of ghost objects.
not well-suited for real-time reconstruction on a single computer. Volumetricbased approaches [2] usually estimate shape by processing a set of voxels. The object’s acquisition area is split up into a 3D grid of voxels (volume elements). Each voxel remains part of the estimated shape if its projection in all images lies in all silhouettes (see Fig. 2(b)). This volumetric approach is suitable for real-time pose estimation, due to its fast computation and robustness to noisy silhouettes. We propose a new framework which computes simultaneously a 3D volumetric shape estimation and skin estimation from “hybrid” silhouettes. Our GPU implementation provides real-time reconstruction on a laptop computer. 2.1
Image Processing
Our 3D reconstruction system consists of three tasks. First, webcams are calibrated using a popular algorithm proposed by Zhang et al . [7]. To enforce coherency between the two webcams, color calibration is done using the method proposed by N.Joshi [8]. The second step consists in silhouette segmentation. We assume that the background is static and the subject moves. We use the method proposed by [9]. First, we acquire images of the background (without user). The user is then detected in the pixels whose value has changed. The third step consists of extracting skin parts from silhouette masks and color images. “Normalized Look-up Table” method [10] provides fast skin-colored segmentation. This segmentation is applied on each image limited to silhouette mask. 2.2
Extended GPU SFS Implementation
Volumetric SFS is usually based on voxel projection: a voxel remains part of the estimated shape if it projects itself into each silhouette. To better fit a GPU implementation we choose the opposite: we project each silhouette into the 3D voxel grid as proposed in [9]. If a voxel is intersected by all the silhouette’s projections, then it represents the original object. This provides 3D shape estimation in real-time. We extend this method in order to compute in parallel 3D shape and skin estimations. The classical N 3 voxel cube can be considered as a stack of N images of resolution N × N . We stack the N image in screen parallel planes. For each webcam, let HS be (hybrid silhouette) a two channel image which contains the silhouette mask in the first channel, and the skin mask
Real-Time and Marker-Free 3D Motion Capture
681
in the second. For each camera view, its HS is projected onto each slice using the “projective texture mapping” technique. The per channel intersection of all HS projections on all slices provides voxel-based 3D shape (in the first channel) and 3D skin (in the second channel) estimations. Single channel HS projection is underlined Fig 2(c). Our implementation provides 100 reconstructions per second. Our fast method assumes that skin parts are visible in all the cameras. Nonetheless in our context, the user look the screen, then faces the two webcams (Fig. 2(a)). Ghost Objects Removal. One of the SFS limitations is ghost objects construction. They appear when they are some visual ambiguities like in the example underlined Fig. 2(d). There is a single person filmed. It corresponds to one 3D connex component on the estimated shape. To remove ghost parts we keep the voxels which are in the biggest (in volume, i.e. voxel number) connex component. Data Simplification. To reduce computation time for pose estimation we use a subset of each set of voxels (i.e. shape voxels and skin voxels). Using surface normal estimation for each surface voxel (i.e. which have less than 26 neighbors), we keep voxels that face the webcams. Let Vskin be the selected voxels form shape voxel set, Vskin be the selected voxels from skin voxel set, and Vall be their union.
3
Motion Capture
The goal of motion capture is to determine the pose of the body throughout time. We can determine the pose of the body if we can associate each voxel to a body part. Joints labeling is presented in Figure 1(c). We propose a system based on simple and fast heuristics. This approach is less accurate than registration-based methods, but nonetheless it runs in real-time. Robustness is increased by using a multi-modal scheme composed of both shape and skin-part analysis, temporal coherence and human anthropometric constraints. Our motion capturing runs on two steps: initialization and tracking; both use the same algorithm with different initial conditions. The initialization step (see Section 5) estimates anthropometric values, and initial pose. Then using this information, the latter step tracks joint positions (see section 4). Our premises are that both hands and person’s face are partially uncovered, that the torso is dressed, and that the clothing have a non-skin color. We present here some common notations for the reader: Lx denotes the length of body part x (see Fig. 1(b)), Dx its orientation and Rx its radius (of sphere or cylinder). J n denotes the value of a quantity J (joint position, voxel set...) at frame n. When dealing with sided joints, indices l and r denote respectively left and right side. Vx denotes a set of voxel, EVx its inertia ellipsoid and Cog(Vx ) its gravity center. When dealing with iterative algorithms J(i) denotes the J quantity value at step i.
682
4
B. Michoud et al.
Body Parts Tracking
To track body parts, we assume that the previous body pose and anthropometric values are known. Using 3D shape estimation and 3D skin parts we track the human body parts in real-time. The tracking process works on active voxels Vact . This set of voxels is initialized to all voxels Vall and updated at each step by removing voxels used to estimate body parts. First we estimate head joints. Next, we track torso joints, in the end we compute limb joints. 4.1
Head Tracking
This step aims at finding Tn and Bn , respectively the positions of the top of the head and the connection point between head and neck at frame n. n n be the face’s voxels at the current frame. By hypothesis Vskin contains Let Vface n face and hands voxels. Using Temporal coherence criteria Vface is the nearest n−1 n from the previous set of face voxels Vface . connex component of Vskin n The center of the head Cn is computed by fitting a sphere S(i) in Vact (see figure 3). The sphere S(i) is defined by its center Cn (i) and radius Rhead . n Head Fitting Algorithm. Cn (0) is initialized as the centroid of Vface . n n At step i of the algorithm, C (i) is the centroid of the set Vhead (i) of active voxels that lie into a sphere S(i − 1) defined by its center Cn (i − 1) and its radius Rhead (see Fig. 3(a)). The algorithm iterates until step k when the position of Cn stabilizes, i.e. the distance between Cn (k − 1) and Cn (k) falls below a threshold head.
Head Joints Estimation. Knowing the Cn position, Bn (respectively Tn ) is computed as the lower (resp. upper) intersection between S(k) and the principal n (see Fig. 3(b)). axis of EVhead n The back-to-front direction Db2f is defined as the direction from Cn towards n the centroid of Vface (note that voxels from the back of the head are not in Vskin ). n n At this point, we remove from Vact the set of elements that belongs to Vhead .
(a)
(b)
(c)
(d)
(e)
n n Fig. 3. (a) Head sphere fitting (light gray denotes Vface , dark gray denotes Vhead (i)), (b) head joints estimation, (c) torso segmentation by cylinder fitting. (d) the “binding” step of legs tracking and (e) legs articulations estimation.
Real-Time and Marker-Free 3D Motion Capture
4.2
683
Torso Tracking
n This step aims at finding Pn the pelvis position, by fitting a cylinder in Vact . Estimating the torso shape by a cylinder provides simple and fast method to n be the set of voxels that describes the torso, they are localize pelvis. Let Vtorso n n initialized using voxels Vact . At step i, the algorithm estimates Dtorso by fitting n a cylinder CYL(i − 1) in Vtorso (i) (see Fig 3(c)). CYL(i) has a cap anchored at n Bn , as radius Rtorso , its length is Ltorso and its axis is Dtorso (i). n n (0) is initialized with Vact and the vector from Torso Fitting Algorithm. Vtorso n (0) initial value. Bn to Pn−1 define Dtorso n n At step i, Vtorso (i) is computed as the set of elements from Vtorso (i-1) that lie n n (i) is then the principal axis of EVtorso (see Fig. 3(c)). in CYL(i − 1). Dtorso (i) The algorithm iterates until step k when the distance between the axis of n CYL(k) and the centroid of Vtorso (k) falls below a threshold torso . Pn position is defined as the center of the lower cap of CYL(k). n of the acquired Global Body Orientation. The top-down orientation Dt2d n n subject is given by P − B . Db2f was computed in 4.1. The left-to-right orienn n n n tation Dl2r of the acquired subject is given by Dl2r = Dt2d × Db2f . n n Vact is then updated by removing its elements that belong to Vtorso .
4.3
Arms Tracking
We propose a simple and robust algorithm to compute the forearm joint positions. First, we compute hand positions from skin voxels. Using the forearm length, we determine the elbow positions. Temporal coherence is used to comn be the set of potential voxels of hands. Let Lheight pute their sides. Let Vhand n be the acquired human body length. Lheight /2 is a raising of arm length. Vhand n n is defined by the voxels of Vskin − Vface that lie within a sphere defined by its n center Bn and its radius Lheight/2. By hypothesis Vskin contains hands and face voxels. The different forearms configurations are: n n contains several connex components. Let Vhand0 Two Distinct Hands. Vhand n and Vhand1 be the biggest ones, corresponding to the two hands with Hnx = n Cog(Vhandx ) with x ∈ [0, 1]. Forearms have constant length Lfarm across time. n The potential voxels for forearmx are the voxels from Vact which lie within n a sphere of radius Lfarm and centered in Hx . The connex component of these n voxels which contains Hnx represents the forearmx. Let Vfarm x be this connex component; there are two possible cases to identify elbow. n n n ∩ Vfarm1 = ∅ then we use the principal axis of EVhandx and Lfarm If Vfarm0 n to compute the elbow position Ex . The sides are computed using temporal coherence criteria: the side of the forearmx is the same than the closest forearm computed at the previous frame. n n Otherwise Vfarm0 ∩ Vfarm1 = ∅ and the forearms are touching themselves. In that case we first identify the hand sides by the property of constant forearms length. Hnx is right sided if
||d(Hnx , En−1 ) − Lfarm || < ||d(Hnx , En−1 ) − Lfarm ||, r l
684
B. Michoud et al.
n n or Hnx is left sided. The voxels vi of Vfarm0 ∪ Vfarm1 are segmented into two parts n n Vfarmr and Vfarmg using “point to line mapping” algorithm (see 4.4). If vi is n closer to [Hnr En−1 ] than to [Hnl En−1 ], vi is added on Vfarmr . Else vi is added r l n n n n n on Vfarm l . We compute Er and El with the principal axis of EVfarm ,EVfarm and r l Lfarm . n contains only one connex component One Hand or Jointed Hands. Vhand and it corresponds to jointed hands or to only one hand (the other is not visible). We use the temporal coherence to disambiguate these two cases. n , then the hands are jointed and Hn r = If Hn−1 r and Hn−1 l are close to Vhand n n n H l = Cog(Vhand ) and we compute Vfarm as proposed previously. We segment n n n in two parts Vfarmr and Vfarml by the orthogonal plane to [En−1 En−1 ] Vfarm r l n n n containing H l . Principal axis of EVfarm , E and L are used to compute Vfarm l farm r Enr and Enl . n is used to compute the side of Otherwise the closest hand Hn−1 x to Vhand n n n n H x and H x = Cog(Vhand ). We compute Vfarm as proposed previously and its principal axis of inertia is used to compute En x . n No Visible Hand. Vhand is empty, then no hand is visible. We report the position computed at the n − 1 frame to the current frame. n is updated by removing its elements that belong to each forearm. Vact
4.4
Legs Tracking
n Until now all body parts but the legs have been estimated, hence Vact contains only the legs voxels. Our leg joints extraction is inspired from “point to line mapping” process used to bind an animation skeleton on a 3D mesh [11]. The eln n n n n are split up into four sets Vthigh , Vcalf ements of Vact l , Vthigh r and Vcalf r dependl n−1 n−1 n−1 ing of their euclidean distance to segments [Pn , Kl ], [Kl , Fl ], [Pn , Kn−1 ], r , Fn−1 ] (see Fig. 3(d)). For the left/right side x, we compute the inertia and [Kn−1 r r n ellipsoid EVcalf and P0 and P1 its extrema points. The knee is the intersection x n point between thigh and calf (Fig. 3(e)), hence the foot position Fn x is the EVcalf x n extrema point, farthest from EVthigh (let say it’s P1 ). Then knee is aligned on x [P0 P1 ], P0 sided, at a Lcalf distance of Fn x .
5
Body Parts Initialization
In this section we present our techniques to estimate the anthropometric measures and the initial pose of the body. We can classify in three the methods presented in the literature regarding the initial pose estimation. The first kind [6], the anthropometric measurements and initial pose are entered manually. Another class of methods need a fixed pose, like T-pose [12], these methods work in real-time. The last class of methods are fully automatic [2] and do not need a specific pose, but are not real-time. Our approach is a real-time and fully automated one for any movement as long as the person filmed is standing up, hands below the level of the head, and feet not joined.
Real-Time and Marker-Free 3D Motion Capture
685
Anthropometric Measurements. They correspond to lengths of each body part [13]. We have estimated some anthropometric measures as average ratios of the human body length. Let Lheight be the acquired human body length, estimated as the maximum distance from foreground voxels to floor plane. Hence, knowing Lheight, anthropometric measures can be approximated by these ratios: Rhead ≈ Lheight /16, Ltorso ≈ Lheight /8, Lfarm ≈ Lheight /6, Lcalf ≈ Lheight /4. Active set of voxels Vact is initialized by Vall . Head Initialization. This step aims at finding T0 and B0 . From our initial0 ization hypothesis, the face’s voxels Vface are defined by the topmost connex 0 component among Vskin . We compute T0 and B0 with the head tracking algo0 0 rithm (section 4.1). Vact is updated by removing elements that belong to Vhead . Torso Initialization. The torso fitting algorithm (section 4.2) is applied us0 0 0 ing Vact as initial value for Vtorso (0).Dtorso (0) is initialized as the vector from 0 0 n 0 (0) . Pelvis position P , D N0 toward the centroid of EVact t2d and Dl2r are then 0 0 computed. Vact is updated by removing voxels that belong to Vtorso . Legs Initialization. Tracking algorithm outlined in Section 4.4 need the legs’ previous positions. We simulate them by a coarse estimation, then we compute 0 contains the voxels more precise positions using the legs tracking algorithm. Vact that haven’t been used for any other parts of the body. First, we compute the 0 having their height below Lheight /8. If there set of connex components of Vact is less than two connex components, we assume that feet are joined and can’t 0 be distinguished. Otherwise we use the two major connex components Vfootl 0 and Vfootr . Left and right assignation of voxel’s set is done using the left-to-right vector Dl2r . For the left/right side x, let vx be the vector from P0 to the centroid 0 . Knee and Foot joints are estimated using the following equations: of Vfootx K−1 x = T0 x + vx
Lthigh Lthigh + Lcalf , F−1 x = T0 x + vx . |vx | |vx |
Finally we compute F0 r , K0 r , F0 l and K0 l using the legs tracking algorithm.
6
Results
We now present the results from our system. Fig. 2(a) outlines the system configuration. The acquisition infrastructure is composed of two Phillips webcams (SPC900NC) connected to a single PC (CPU: P4 3.2ghz, GPU: NVIDIA Quadro 3450). Webcams produce images of resolution 320 × 240 at 30fps. Our method has been applied on different persons doing fast and challenging motions. Thanks to shape analysis and skin parts knowledge, our system is able to acquire the joint positions for a challenging pose outlined on the Fig. 4(a). This pose is difficult because the topology of the reconstructed shape is not coherent with the human shape topology. The temporal coherence is the key to
686
B. Michoud et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. (a) (b) and (c) underline results for challenging poses. (d) represents the user recovered pose with shape and skin voxels. (e-h) show results for wide range motions.
success for the pose presented Fig. 4(b). This shows the case of jointed hands (4.3) which is successfully recognized. A difficult pose underlined in Fig. 4(c) is successfully recovered by our system. Fig. 4(d) shows shape, skin voxels and the joints recovered. Images of Fig. 4(e), 4(f), 4(g) and 4(h) demonstrate that our system tracks large range of movements. Additional results are included in the supplementary video. It shows the robustness of our approach. Our current experimental implementation can track more than 30 poses per second on a single computer. It is faster than the webcams acquisition frame rate. An optimized implementation can be used for current generation of home entertainment computers. As our algorithm is based on 3D reconstruction, it depends on the voxel grid resolution. Experiments are made on a grid of 643 voxels with a resolution of 2.7 × 2.7 × 2.7 cm per voxel. This resolution is enough for human-machine interfaces in the field of entertainment.
7
Conclusion
In this paper, we describe a new marker-free system of human motion capture from two webcams connected to a single computer. Fully automated and working under real-time constraint, the system is based on both a 3D shape analysis, human morphology constraints, and a skin-colored segmentation. Combining different 3D information, the approach is robust to self-occlusion and to coarse 3D shape approximation provided by voxel estimation sub-system. We are able to estimate the eleven main human body joints, at more than 30 frames per second. This frame-rate is well suited for home entertainment applications. The system provides real-time motion capture for one person. Future work aims at providing motion capture of several persons filmed together in the same
Real-Time and Marker-Free 3D Motion Capture
687
area, even they are in contact. For home entertainment application, the major limitation is silhouette processing, because the background cannot be guaranteed to be static at home. We work on a new segmentation algorithm based on statistical background model helped by optical flow algorithm.
References 1. Agarwal, A., Triggs, B.: Monocular human motion capture with a mixture of regressors. In: CVPR 2005, p. 72. IEEE Computer Society, Los Alamitos (2005) 2. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. Int. J. Comput. Vision 53(3), 199–223 (2003) 3. Tangkuampien, T., Suter, D.: Human motion de-noising via greedy kernel principal component analysis filtering. In: ICPR 3, pp. 457–460. IEEE Computer Society Press, Los Alamitos (2006) 4. Caillette, F., Galata, A., Howard, T.: Real-Time 3-D Human Body Tracking using Variable Length Markov Models. In: Proceedings BMVC 2005, vol. 1 (2005) 5. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16(2), 150–162 (1994) 6. M´enier, C., Boyer, E., Raffin, B.: 3d skeleton-based body pose recovery. In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission, Chapel Hill (USA) (2006) 7. Zhang, Z.: Flexible camera calibration by viewing a plane from unknown orientations. In: ICCV, pp. 666–673 (1999) 8. Joshi, N.: Color calibration for arrays of inexpensive image sensors. Technical report, Stanford University (2004) 9. Hasenfratz, J.M., Lapierre, M., Sillion, F.: A real-time system for full body interaction with virtual worlds. In: Eurographics Symposium on Virtual Environments, pp. 147–156 (2004) 10. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proceedings of Graphicon-2003 (2003) 11. Sun, W., Hilton, A., Smith, R., Illingworth, J.: Layered animation of captured data. The Visual Computer 17(8), 457–474 (2001) 12. Fua, P., Gruen, A., D’Apuzzo, N., Plankers, R.: Markerless Full Body Shape and Motion Capture from Video Sequences. In: Symposium on Close Range Imaging, International Society for Photogrammetry and Remote Sensing, Corfu, Greece (2002) 13. Dreyfuss, H., Tilley, A.R.: The Measure of Man and Woman: Human Factors in Design. John Wiley & Sons, Chichester (2001)
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation Haiyuan Wu, Yosuke Kitagawa, Toshikazu Wada, Takekazu Kato, and Qian Chen Faculty of Systems Engineering, Wakayama University, Japan
Abstract. This paper describes a sophisticated method to track iris contour and to estimate eye gaze for blinking eyes with a monocular camera. A 3D eye-model that consists of eyeballs, iris contours and eyelids is designed that describes the geometrical properties and the movements of eyes. Both the iris contours and the eyelid contours are tracked by using this eye-model and a particle filter. This algorithm is able to detect “pure” iris contours because it can distinguish iris contours from eyelids contours. The eye gaze is described by the movement parameters of the 3D eye model, which are estimated by the particle filter during tracking. Other distinctive features of this algorithm are: 1) it does not require any special light sources (e.g. an infrared illuminator) and 2) it can operate at video rate. Through extensive experiments on real video sequences we confirmed the robustness and the effectiveness of our method.
1
Introduction
The goal of this work is to realize video rate tracking of iris contours and eye gaze estimation with a monocular camera in a normal indoor environment without any special lighting. Since blinking is a physiological necessity for human, it is also a goal of this work to be able to cope with blinking eyes. To detect or to track eyes robustly, many popular systems use infrared light (IR) [1][2] and stereo cameras [3][4]. Recently, mean-shift filtering [5], particle filtering [6] and K-means clustering [15] are also used for tracking eyes. The majority of the proposed methods assume open eyes, and most of them neglect the eyelids and assume that iris contours show circles in the image. Therefore, they can not detect pure iris contours which are important for eye gaze estimation. The information about eyes used for eye gaze estimation can roughly be divided into three categories: (a) global information, such as active appearance model (AAM) [5][7], (b) local information, such as eye corners, iris, and mouth corners [8][9][10], and (c) the shape of ellipses fitted to the iris contours [11]. There are also some methods based on the combinations of them [12]. D.Hansen et al. [6] employ an active-contour method to track iris. It is based on a combination of a particle filter and the EM algorithm. The method is robust against the changes of lighting condition and camera defocusing. A looking calibration is necessary for eye gaze estimation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 688–697, 2007. c Springer-Verlag Berlin Heidelberg 2007
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation
689
Y.Tian et al. [13] propose a method of tracking the eye locations, detecting the eye states, and estimating the eye parameters. They develop a dual-state eye model and use it to detect whether an eye is open or closed. However, it can not estimate eye gaze correctly. It is difficult to track iris contours consistently and reliably because they are often partly occluded by the upper and the lower eyelids and confused with the eyelid contours. Most conventional methods lack the ability to distinguish iris contours from eyelid contours. Some methods separate the iris contours from eyelid contours heuristically. In order to solve this problem, we design a 3D eye-model that consists of eyeballs, irises and eyelids. We use this model together with a particle filter to track both the iris contours and the eyelid contours. With this approach, the iris contours can be distinguished from the eyelid contours and both of them can be tracked simultaneously. Then the shape of the iris contours can be estimated correctly by only using the edge points between upper and lower eyelids, which are then used to estimate the eye gaze. In this method, we assume that movements of the two eyes of a person are synchronized and use this constraint to restrict the movement of eyeballs in the 3D eye-model during tracking. This increases the robustness of tracking and the reliableness of eye gaze estimation of our method. Implementing on a PC with a Pentium4 3GHz CPU, the processing speed is about 30frames/second.
2
3D Eye-Model for Tracking Iris Contours
Our 3D eye-model consists of two eyes, each eye consists of an eyeball, an iris, an upper and a lower eyelids. As shown in figure 1(a), the two eyeballs are assumed as spheres of equal radius rw . The irises are defined as circles of equal radius rb on the surface of each eyeball. Then the distance between the eyeball center and the iris plane is zpi =
2 − r2 . rw b
(1)
The centers of the two eyeballs (cl and cr ) are put on the X-axis of the eye-model coordinates system symmetrically about the origin, and the distance between each center and the origin is wx . T cl = −wx 0 0 ;
T cr = wx 0 0 .
(2)
We define direction of the visual lines when people look forward same as the Z-axis. Then the plane where the iris resides will be parallel to the X-Y plane. The iris contours of the left and the right eye (pl and pr ) can be expressed with the following expressions. pj (α) = p(α) + cj , where
j = l, r
T p(α) = rb cos α rb sin α zpi ,
α ∈ [0, 2π].
(3)
690
H. Wu et al.
The upper and lower eyelids are defined as B-spline curves located on the plane z = rw , which is a vertical plane in front of eyeballs. As shown in figure 1(a), each eyelid has three control points. We let the upper and the lower eyelid of the same eye share the same inner eye corners (Ehl and Ehr ) and the outside eye corners (Etl and Etr ). The eight control points (Ehl , Eul , Etl , Edl , Ehr , Eur , Etr and Edr ) describe the shape of the eyelids when the two eyes are open. Since the values of these eight control points, wx , rw and rb depend on each individual person, we call them as personal parameters which are estimated at the beginning of tracking.
(a)
(b)
(c)
Fig. 1. (a) The structure of the 3D eye-model. (b) The gaze vector and the movement parameters of an eyeball. (c) The movement parameter (m) of an upper eyelid.
In general, when people look at a far place, the visual lines of both eye will be approximately parallel. Also, when people blink, in most cases the lower eyelids do not move and the upper eyelids of the two eyes move in the same way. In this paper, in order to keep the number of eye movement parameters small so that the particle filter can work efficiently while not losing generrality, we assume that the lower eyelids keep still and the movements of the eyeballs and the upper eyelids of two eyes are same and synchronized. Therefore, only the parameters for describing the movements of the eyeball and the two eyelids of one eye are necessary because the movement of the another eye can be described with the same parameters. The movement of the left (or right) eyeball can be described by an unit vector ve that indicates the gaze direction (See figure 1(b)). T ve = cos θt sin θp sin θt cos θt cos θp , (4) where θt is the angle between ve and the X-Z plane (tilt), and θp is the angle between the projection of ve onto the X-Z plane and Z-axis (pan). In this paper, θt and θp are used as the movement parameters of the left and the right eyeball. Then the iris contours of the left and the right eye can be expressed by pM j (α, θt , θp ) = R(θt , θp )p(α) + cj ;
j = l, r,
(5)
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation
where
691
⎛
⎞ 0 sin θp cos θp R(θt , θp ) = ⎝ − sin θt sin θp cos θt sin θt cos θp ⎠ . cos θt sin θp − sin θt cos θt cos θp
(6)
The movement of opening or closing eyes can be expressed by changing the shape of the upper eyelids. This movement is expressed by movement of the middle control point of an upper eyelid (Eul and Eur ) along the line connecting the middle control points of the upper and the lower eyelids of an opened eye as shown in figure 1(c). The middle control points of the upper eyelids can be expressed as following Euj = mEuj + (1 − m)Edj ;
j = l, r, m = 0 ∼ 1,
(7)
where m is a parameter that describes the movements of the both upper eyelids. The parameters θt , θp and m are movement parameters of our 3D eye-model. In order to track the iris contours and estimate the eye gaze with a particle filter, it is necessary to project the iris contours and the eyelid contours onto the image plane. This can be done with the following equation. ip = Mc (Rh p + Th ).
(8)
p is the point on an iris contour or an eyelid contour in the 3D eye-model and ip is its projection onto the image plane. Mc is the projection matrix of the camera which is assumed known. Rh and Th are the rotation matrix and the translation vector of the eye-model coordinates system relateive to the camera, which are the movement parameter of the head and also are the ones to be estimated with the particle filter during tracking.
3
Likelihood Function for Tracking Iris
In many applications using the particle filter (also called as Condensation[14]), the distance between a model and the edges detected from an input image is used as the likelihood. However, in the case of iris contour tracking, this definition of likehood often leads particles to converge on a wrong place (such as inner eyes or eyelashes). This is because there are many edges so that the likelihood also becomes high at those places. In order to track the iris contours with the particle filter, we define a likelihood function by considering both the image intensity and the image gradient. 3.1
The Likelihood Function of Irises
In most cases, the brightness of iris in an input image is lower than its surroundings. In order to make use of this fact, the average brightness of iris area is introduced into the likelihood function of iris. In this paper, we let the the average brightness the iris area of 3D eye model be 0. Then the values (El and
692
H. Wu et al.
Er ) indicating the likehood of an iris candidate area in an image by considering the image intensity can be calculated with the following equation. Ej = e−Yj /k , 2
j = l, r.
(9)
Here, Yl , Yr are averages of the brightness of the left and right iris candidate areas, and k is a constant. The higher the average brightness in those areas are, the lower El and Er will be. In order to reduce the influence of the non-iris contour edges when estimating the likelihood for irises, we consider the direction of the edges as well as their strength. Since an iris area is darker than its surrounding, the direction of the image gradient at iris contour will be outward from the iris center. In this paper, we define the direction of the normal vector of the iris contour in our 3D eyemodel is outward from the iris center. Therefore, if the iris contour of the 3D eye-model and the iris contour in the image overlaps, the direction of the normal vector and the image gradient will be same. We pick n points from the iris contours of the 3D eye-model at fixed intervals as following.
2πk psjk = pM , j = l, r, k = 0, 1, · · · , n − 1. (10) j n These points are called iris contour points as (ICPs). By using the hypothesis generated by the particle filter, which is a set of parameters describe the movements of the eyes and the head, the projection of ICPs (isjk ) and the normal vectors (hjk ) of them on the image plance can be calculated. Let g(isjk ) indicates the image gradient at each projected ICP, the likelihood πI of iris candidates in the image is computed with the following expression. n−1 n−1 s s B(isrk )D(isrk ) k=0 B(ilk )D(ilk ) πI = El + Er k=0 , (11) n−1 n−1 s s k=0 B(ilk ) k=0 B(irk ) here, B(isjk ) is a function for removing the influence of the edges of eyelids by ignoring the ICPs outside the region enclosed by the lower and the upper eyelids, 1 if isjk is between the lower and the upper eyelids s , (12) B(ijk ) = 0 otherwise
and D(isjk )
=
hjk · g(isjk ) if hjk · g(isjk ) > 0 , j = l, r. 0 otherwise
(13)
Here, · indicates inner product. 3.2
The Likelihood Function of Eyelids
In order to calculate the likelihood of iris more correctly, it is also necessary to track the eyelids. Since the difference between the brightness of the two sides
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation
693
of an eyelid is not as big as the case of an iris contour, we only use the image gradient to estimate the likelihood of eyelids (πE ). N −1 πE =
here D(idjk )
=
D(idlk ) + N
N −1
k=0
h(idjk ) · g(idjk ) 0
D(idrk ) , N
k=0
if h(idjk ) · g(idjk ) > 0 , otherwise
(14)
j = l, r,
idjk is the projection of each point on the eyelids and hdjk is its normal vector. The likelihood function π of the whole 3D eye-model including the irises and the eyelids is defined with the following expression. π = πI πE
4
(15)
Eye Gaze Estimation
In order to estimate the eye gazes with the 3D eye-model, it is necessary to determine the personal parameters described in section 2 for testee’s eyes. At the begining of tracking, we assign the positions of the inner eye corners (Ehl and Ehr ), and the outside eye corners (Etl and Etr ) on the image manually. The other parameters of eyelids are drawn from these four points. The personal parameters about eyeballs are set to the average values of people. After this, all personal parameters except the four manually assigned points are estimated with a particle filter by using the several frames of image. During tracking, the movement parameters of the eyes and the head that give the maximum value of likelihood are estimated by the particle filter. From these movement parameters the eye gaze is calculated.
5
Experimental Results
We have tested our method of tracking iris contours and eye gaze estimation using a PC with a 3GHz Pentium 4 CPU. The input image (640 × 480 pixels) sequences were taken by a normal video camera. The experimental environment is show in figure 2(a). 5.1
Tracking the Iris Contour
Firstly, we used our method to track iris contours. The parameters of particle filter are θp , θt , m, T = (tx , ty , tz ), and ψ that is the rotation angle of the head (eye-model) around z axis of the camera coorinates system. As for the random sampling, the standard deviations of normal distribution were taken as θp : 5[degree], θt : 5[degree], m: 0.05, tx : 0.1[cm], ty : 0.03[cm], tz : 0.01[cm], and ψ: 0.001[rad], respectively.
694
H. Wu et al.
(a)
(b)
Fig. 2. (a): The experimental environment. (b): The environment for evaluating the accuracy of eye gaze.
(a) 1 frame
(e) 80 frame
(b) 35 frame
(f) 140 frame
(c) 45 frame
(g) 170 frame
(d) 74 frame
(h) 200 frame
Fig. 3. Some tracking results with random sampling twice for each frame by 150 samples
The processing time of the tracking was 9.6ms/frame for 100 samples, and was 30.8ms/frame for 400 samples, respectively. When the eye moved quickly, a lot of delays occured during tracking in the case of 100 samples, and only a little occured in the case of 400 samples. In order to increase the tracking accuracy, we carried out the random sampling twice for each frame by 150 samples. Some tracking results are shown in figure 3. In this case, the processing time was 29ms/frame and the iris contour and the eyelid contour could be tracked without delay even when the eyes moved quickly. Moreover, the tracking accuracy was much improved. All experiments described hereafter were carried out in this way. When a person was blinking the eyes, as shown in figure 4, the eyelids moved quickly thus could not be tracked perfectly. Also, since the irises were not visible when the eyes were closed, the iris contours could not be tracked exactly. In the case that the system has detected closed eyes (from the movement parameter m), the system holds the former state of the irises just before the eyes were closed. When the irises become visible after blinking, the tracking for irises will
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation
695
(a) 263 frme
(b) 264 frame
(c) 265 frame
(d) 266 frame
(e) 267 frame
(f) 268 frame
(g) 269 frame
(h) 270 frame
Fig. 4. Some tracking results of blinking eyes
(a) radius of eyeball rw (b) radius of the iris rb (c) the eyeball center wx
(d) first frame
(e) 15 frame
Fig. 5. The convergence of personal parameters
be restarted again. From this experimental resuls we have confirmed that our method can track iris contours even when people blink their eyes. 5.2
Initialization of Personal Parameters
After the four eye corners had been given manually, the rest personal parameters of the 3D eye-model were estimated with a particle filter. As for the random sampling, the standard deviations of normal distribution were taken as rb : 0.02[cm], rw : 0.01[cm], wx : 0.03[cm], all the x and y coordinats of the control points of eyelids: 0.2[cm] and 0.05[cm], respectively. The random sampling by 1500 samples was performed to each frame, and the likelihood was evaluated twice. An known eye gaze is necesary for estimating the personal parameters. In the experiments, this estimation was carried out using the images when the person looked forward.
696
H. Wu et al.
Figure 5 shows the behavior of estimated values of some personal parameters. The horizontal axis indicates the frame number and the vertical axis indicates the estimated value. From figure 5, we confirmed that the personal parameters converge within several frames. Figure 5(d) and (e) show the estimated results of the initial frame and the 15th frame. The processing time for estimating the personal parameters was 430ms/frame. After that, our algorithm could work at video rate (30.1ms/frame) for tracking iris contour and estimating eye gaze. 5.3
Accuracy Evaluation of Eye Gaze Estimation
In order to evaluate the accuracy of the estimated eye gaze using the proposal method, we put some markers on a wall which was 4 meters away from a testee(see figure 2(b)). The number of testee was five. We leted the testee gaze at each marker for 2 second and estimated the eye gaze with our system. Table 1 shows the difference of the mean value of estimated eye gaze and the true value of each marker. Table 1. The difference of the mean value of estimated eye gaze and the true value of each marker to (x-direction, y-directions) [unit: degree] Person A 1 low 2 low 3 low Upper col. (0.6,-0.5) (1.5,1.9) (2.5,1.5) Middle col. (2.6,2.4) (2.5,1.6) (1.8,0.6) Bottom col. (4.9,5.1) (1.9,3.6) (1.7,4.3) Person C 1 low 2 low 3 low Upper col. (5.1,3.7) (2.5,3.3) (2.6,1.3) Middle col. (2.4,-1.8) (2.9,-1.4) (0.9,-0.2) Bottom col. (1.9,1.2) (-2.0,1.9) (-3.2,2.9) Person E 1 low 2 low 3 low Upper col. (3.5,0.8) (2.4,1.4) (4.5,1.5) Middle col. (-0.9,-5.3) (1.2,-6.4) (3.5,-3.5) Bottom col. (3.8,-1.6) (-2.6,-1.9) (-1.1,-2.2)
6
Person B 1 low 2 low 3 low Upper col. (-1.8,-0.5) (-2.9,-0.1) (-4.1,1.4) Middle col. (-1.6,1.7) (-1.6,2.1) (-4.8,1.6) Bottom col. (-0.9,5.4) (-2.4,6.0) (-3.4,6.0) Person D 1 low 2 low 3 low Upper col. (-3.3,-1.8) (1.1,0.0) (-1.0,0.0) Middle col. (-4.5,-1.8) (-2.4,-1.3) (1.6,-0.1) Bottom col. (0.9,0.4) (-0.9,0.2) (0.1,0.8)
Conclusion
This paper aims at tracking iris contours and estimating eye gaze from monocular video images. In order to suppress the influence of eyelid edges, we have proposed a 3D eye-model for tracking iris and eyelid contours. Using this eye-model, the eyelid contours and iris contours can be distinguished. By only using the edge points between upper and lower eyelids, the shape of the iris can be estimated and then the eye gaze can be measured. From the experimental results, we confirmed that the proposed algorithm can track iris contours and eyelid contours robustly and can estimate the eye gaze at video rate. The proposal algorithm can be used
Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation
697
to various applications, such as, to check on whether looking-aside movement of a driver, etc. Acknowledgments. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (C), 18200131 and 19500150.
References 1. Zhu, Z., Ji, Q.: Eye Gaze Tracking Under Natural Head Movements. In: CVPR, vol. 1, pp. 918–923 (2005) 2. Hennessey, C., Noureddin, B., Lawrence, P.: A Single Camera Eye-Gaze Tracking system with Free Head Motion. In: Symposium on Eye tracking research & applications, pp. 87–94 (2006) 3. Matsumoto, Y., Zelinsky, A.: An Algorithm for Real-time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement. In: FG, pp. 499–504 (2000) 4. Beymer, D., Flickner, M.: Eye Gaze Tracking Using an Active Stereo Head. In: CVPR, vol. 2, pp. 451–458 (2003) 5. Hansen, D., et al.: Tracking Eyes using Shape and Appearance. In: IAPR Workshop on Machine Vision Applications, pp. 201–204 (2002) 6. Hansen, D.W., Pece, A.: Eye Typing off the Shelf. In: CVPR, vol. 2, pp. 159–164 (2004) 7. Ishikawa, T., et al.: Passive Driver Gaze Tracking with Active Appearance. In: 11th Word Congress on ITS in Nagoya, pp. 100–109 (2004) 8. Gee, A., Cipolla, R.: Estimating Gaze from a Single View of a Face. In: ICPR, pp. 758–760 (1994) 9. Zhu, J., Yang, J.: Subpixel Eye Gaze Tracking. In: FG (2002) 10. Smith, P., Shah, M., Lobo, N.: Determining Driver Visual Attention with One Camera. IEEE Trans. On Intelligent Transportation System 4(4), 205–218 (2003) 11. Wu, H., Chen, Q., Wada, T.: Visual Direction Estimation from a Monocular Image. IEICE E88-D(10), 2277–2285 (2005) 12. Wang, J.G., Sung, E., Venkateswarlu, R.: Eye Gaze Estimation from a Single Image of One Eye. In: ICCV (2003) 13. Tian, Y.l., Kanade, K., Cohn, J.F.: Dual-state Parametric Eye Tracking. In: FG, pp. 110–115 (2000) 14. Isard, M., Blake, A.: Condensation-conditional density propagation for visual tracking. IJCV 29(1), 5–28 (1998) 15. Hua, C., Wu, H., Chen, Q., Wada, T.: A General Framework For Tracking People. In: FG, pp. 511–516 (2006) 16. Duchowski, A.T.: A Breadth-First Survey of Eye Tracking Applications, Behavior Research Methods, Instruments, and Computers (2002) 17. Criminisi, A., Shotton, J., Blake, A., Torr, P.H.S.: Gaze Manipulation for One-toone Teleconferencing. In: ICCV (2003) 18. Yoo, D.H., et al.: Non-contact Eye Gaze Tracking System by Mapping of Corneal Reflections. In: FG (2002) 19. Schubert, A.: Detection and Tracking of Facial Feature in Real time Using a Synergistic Approach of Spatial-Temporal Models and Generalized Hough-Transform Techniques. In: FG, pp. 116–121 (2000)
Eye Correction Using Correlation Information Inho Choi and Daijin Kim Department of Computer Science and Engineering Pohang University of Science and Technology (POSTECH) {ihchoi,dkim}@postech.ac.kr
Abstract. This paper proposes a novel eye detection method using the MCT-based pattern correlation. The proposed method detects the face by the MCT-based AdaBoost face detector over the input image and then detects two eyes by the MCT-based AdaBoost eye detector over the eye regions. Sometimes, we have some incorrectly detected eyes due to the limited detection capability of the eye detector. To reduce the falsely detected eyes, we propose a novel eye verification method that employs the MCT-based pattern correlation map. We verify whether the detected eye patch is eye or non-eye depending on the existence of a noticeable peak. When one eye is correctly detected and the other eye is falsely detected, we can correct the falsely detected eye using the peak position of the correlation map of the correctly detected eye. Experimental results show that the eye detection rate of the proposed method is 98.7% and 98.8% on the Bern images and AR-564 images.
1
Introduction
Face analysis and authentication problem are solved by three different methods [1] as holistic method [2], local method [2,3], and hybrid method [4]. The holistic method identifies a face using the whole face image and needs the alignment in images and the normalization using the facial features. The local method uses the local facial features on face such as eye, nose, and mouth, and needs to localize the fiducial points to analysis the face image. The hybrid method uses the both holistic features and the local features. Because eyes are the stable features on the face, they are used as reliable features for face normalization. So, it is very important to detect and localize the eyes in the applications of face authentication and/or recognition. Brunelli and Poggio [2] and Beymer [5] detected the eyes using the template matching. It used the similarity between the template image and the input image and is largely dependent on the initial position of the template. Pentland et. al. [6] used the eigenspace method to detect the eyes. The eigenspace method showed better eye detection performance than the template matching method. But its detection performance is largely dependent on the choice of training images. Kawaguchi and Rizon [7] detected the iris using the intensity and the edge information. Song and Liu [8] use the binary edge images. They include many technique such Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 698–707, 2007. c Springer-Verlag Berlin Heidelberg 2007
Eye Correction Using Correlation Information
699
as the feature template, the template matching, separability filter, and binary valley extraction, and so on and need the different parameters on the different database. So, these methods are not intuitive and not simple. So, we consider a more natural and intuitive method that detects the face region and then search the eyes in the subregion of the detected face. Freund and Schapire introduced the AdaBoost algorithm [9] and showed that it had a good generalization capability. Viola and Jones [10] proposed a robust face detection method using AdaBoost with the simple features and it provided a good performance to locate the face region. Fr¨ oba and Ernst [11] used the AdaBoost algorithm with a modified version of the census transform (MCT). This method is very robust to the illumination change and very fast to find the face. Among these AdaBoost methods, we choose the MCT-based AdaBoost training method to detect the face and the eye due to its simplicity of learning and high speed of face detection. However, sometimes we fail to detect the eyes such that the eyebrows or hairs. To reduce this problem, we propose a novel eye verification process whether the detected eye is true or false eye. The proposed eye verification method employs the MCT-based pattern correlation map. We can verify whether the detected eye is true or false depending on the existence of the noticeable peak in the correlation map. Using this property of the correlation map, we can correct the falsely detected eye using the peak position in the correlation map of the opposite eye. Assume that one eye is correctly detected and the other eye is falsely detected. Then, the correlation map of the correctly detected eye provides the noticeable peak that corresponds to the true location of the falsely detected eye.
2
Eye Detection Using MCT + AdaBoost
The Modified Census Transform (MCT) is a non-parametric local transform which modifies the census transform by Fr¨ oba and Ernst [11]. It is an ordered set of comparisons of pixel intensities in a local neighborhood representing which pixels have lesser intensity than the mean of pixel intensities. We present the eye detection method using the AdaBoost training with MCTbased eye features. In the AdaBoost training, we construct the weak classifier which classifies the eye and non-eye pattern and then construct the strong classifier which is the linear combination of weak classifiers. In the detection, we scan the eye region by moving a 12×8 size of the scanning window and obtain the confidence value corresponding to the current window location using the strong classifier. Then, we determine the window location whose confident value is maximum as the location of the detected eye. Our MCT-based AdaBoost training has been performed using only the left eye and non-eye training images. So, when we are trying to detect the right eye, we need to flip the right subregion of the face image.
700
3
I. Choi and D. Kim
Eye Verification
To remove false detection, we devise an eye verification whether the detected eye is true or false using the MCT-based pattern correlation based on symmetrical property of the human face [12]. 3.1
MCT-Based Pattern Correlation
The MCT represents the local intensity variation of several neighboring pixels around a given pixel in an ordered bit pattern. (See Fig. 1) Because the MCT value is non-linear, the decoded value of the MCT is not appropriate for measuring the difference between two MCT patterns. To solve this problem, we propose the idea of the MCT-based pattern and the MCT-based pattern correlation based on the Hamming distance that measures the difference between two MCT-based patterns. The MCT-based pattern P (x, y) at the pixel position (x, y) is a binary representation of the 3 × 3 pixels in the determined order, from the upper left pixel to the lower right pixel as P (x, y) = [b0 , b1 . . . b8 ],
(1)
where bi is a binary value that is obtained by the comparison function as ¯ y) + α, I(x , y )), b3(y −y+1)+(x −x+1) = C(I(x,
(2)
¯ y) is the where x = {t|t ∈ {x − 1, x, x + 1}}, y = {t|t ∈ {y − 1, y, y + 1}}, I(x, mean of neighborhood pixels and I(x , y ) is the intensity of each pixel. b0 b1 b2 b3 b4 b5 b6 b7 b8 (a)
15 70 15 15 70 15 15 70 15
M BP −−−−−−−−−−−→ if I(x, y) > 33.3
010010010
(b)
Fig. 1. Examples of the MCT-based patterns
We propose the MCT-based pattern correlation to compute the similarity between two different MCT-based patterns. It is based on the Hamming distance that counts the number of positions whose binary values are equal between two MCT-based patterns. The MCT-based pattern correlation between image A and image B is defined as 1 ρ= ρx,y , (3) N x,y where N is the number of pixels in the image and ρx,y is the MCT-based pattern correlation at the pixel position (x, y) as ρx,y =
1 (9 − HammingDistance(PA(x, y), PB (x, y))), 9
(4)
Eye Correction Using Correlation Information
701
Fig. 2. Five face images with different illuminations Table 1. Comparison between the conventional image correlation with histogram equalization and the MCT-based pattern correlation
Face image Face image Face image Face image Face image Face image Face image Face image Face image Face image Mean Variance
1 1 1 1 2 2 2 3 3 4
and and and and and and and and and and
face face face face face face face face face face
image image image image image image image image image image
2 3 4 5 3 4 5 4 5 5
Conventional MCT-based pattern Image correlation correlation with HIST.EQ. 0.873 0.896 0.902 0.862 0.889 0.883 0.856 0.839 0.659 0.827 0.795 0.890 0.788 0.849 0.846 0.870 0.794 0.865 0.627 0.808 0.803 0.859 0.094 0.028
where PA (x, y) and PB (x, y) are the MCT-based patterns of the image A and B, respectively. Fig. 2 shows five different face images with different illuminations. Table 1 compares the conventional image correlation with histogram equalization and the MCT-based pattern correlation between two image pairs. The table shows that (1) the mean of the MCT-based pattern correlation is higher than that of the conventional image correlation and (2) the variance of the MCT-based pattern correlation is much smaller than that of the conventional image correlation. This implies that the MCT-based pattern is more robust to the change of illuminations than the conventional image correlation. Where the MCT-based pattern correlation map is built by sliding a detected eye over the eye region of the opposite side and computing the correlation value in term of the Hamming distance. 3.2
Eye/Non-eye Classification
The detected left and right eye patch can be either eye or non-eye, respectively. In this work, they are classified into eye or non-eye depending on the existence of a noticeable peak in the MCT-based correlation map as follows. If there is a
702
I. Choi and D. Kim
noticeable peak in the MCT-based correlation map, the detected eye patch is an eye. Otherwise, the detected eye patch is a non-eye. Since the eye detector produces two detected eye patches on the left and right eye subregions, respectively, we build two different MCT-based pattern correlation maps as – Case 1(2): Left(right) eye correlation map that is the MCT-based pattern correlation map between the detected left(right) eye patch and the right(left) subregion of the face image. We want to show how the MCT-based pattern correlation maps of the correctly detected eye and the falsely detected eye are different each other. Three images in Fig. 3 are taken to build the left eye correlation map (Case 1), where they are (a) a right eye subregion, (b) a flipped image patch of the correctly detected left eye, and (c) a flipped image patch of the falsely detected left eye (in this case, eyebrow), respectively. Fig. 3-(d),(e) show the correlation maps of the correctly detected left eye patch and the falsely detected left eye patch, respectively. As you see, two correlation maps look very different each other: the true eye patch produces a noticeable peak at the right eye position while the non-eye patch (eyebrow) does not produces any noticeable peak over the entire right eye subregion. From this fact, we need an effective way of finding a noticeable peak in the correlation map in order to decide whether the detected eye patch is eye or non-eye. In this work, we consider a simple way of peak finding based on two predetermined correlation values.
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 100
0 100 80 60
50
40
80 60
50
40
20 0
(a)
(b)
(c)
0
(d)
20 0
0
(e)
Fig. 3. A typical example of the right eye subregion, the detected eye and non-eye in the left eye subregion
The proposed eye/non-eye classification method is given below. First, we rescale the correlation map whose the highest peak value becomes 1. Second, we overlay a peak finding window Wpeak with a size of w × h at the position with the highest value in the correlation map, where w and h are the width and the height of the detected eye patch. Third, we classify whether the detected eye Edetected patch is eye or non-eye according to the following rule as eye if R < τ, (5) Edetected = non − eye otherwise,
Eye Correction Using Correlation Information
703
where τ and R are a given threshold and the high correlation ratio, which is defined by a ratio of the number of pixel positions whose correlation value is greater than a given threshold value ρt over the number of total pixel positions within the peak finding window Wpeak as R=
1 N
u+w/2
v+h/2
C(ρ(u , v ), ρt ),
(6)
u =u−w/2 v =v−h/2
where N is the number of total pixel positions of Wpeak and C is an comparison function as 1 if ρ(u , v ) > ρt , C(ρ(u , v ), ρt ) = (7) 0 otherwise.
4
Falsely Detected Eye Correction
After eye verification, we have four different classification results on the left and right eye regions: 1) eye and eye, 2) eye and non-eye, 3) non-eye and eye, and 4) non-eye and non-eye. In the first and fourth case, we succeed and fail to detect two eyes, respectively. In the fourth case, there is no way to detect the eyes. However, in the case of the second and third cases, we can correct the falsely detected eye as follows. In the second case, we can locate the falsely detected right eye using the peak position of the correlation map of the correctly detected left eye. Similarly, in the third case, we can locate the falsely detected left eye using the peak position of the correlation map of the correctly detected right eye. Fig. 4 shows an example of the falsely detected eye correction, where (a) and (b) show the eye region images before and after falsely detected eye correction, respectively. In Fig. 4-(a), A, B , and C represent the correctly detected left eye, the falsely detected right eye, and the true right eye, respectively. As you see in Fig. 4-(b), the falsely detected right eye is corrected well.
(a)
(b)
Fig. 4. An example of the falsely detected eye correction
5
Experimental Results
For the AdaBoost Training with MCT-based eye features, we used two face databases such as Asian Face Image Database PF01 [13] and XM2VTS Database
704
I. Choi and D. Kim
(XM2VTSDB) [14] and prepared 3,400 eye images and 220,000 non-eye images whose size is 12 × 8. For evaluating the proposed eye detection method, we used the Bern database [15] and the AR face database [16]. Because the accuracy of our face detection method is 100% in the Bern images and the AR image, we consider only the performance of the proposed eye detection method. As a measure of eye detection, we define the eye detection rate as reye =
N 1 di , N i=1
(8)
where N is the total number of the test eye images and di is an indicator function of successful detection as 1 if max(δl , δr ) < Riris , di = (9) 0 otherwise, where and δl and δr are the distance between the center of the detected left eye and the center of the real left eye, and the distance between the center of the detected right eye and the center of the real right eye, respectively, and Tiris is a radius of the eye’s iris. 5.1
Experiments in the Bern Images
Fig. 5-(a) and Fig. 5-(b) show some Bern images whose eyes are correctly and falsely detected by the strong classifier that is obtained by the AdaBoost Training with MCT-based eye features, respectively, where the boxes represent the
(a) Some examples of the (b) Some examples of the (c) An example of falsely correctly detected results falsely detected results detected eye correction Fig. 5. Some examples of results in Bern face database Table 2. Comparisons of various eye detection methods using the Bern face database Algorithms Eye detection rate (%) Proposed method 98.7% Kawaguchi and Rizon [7] 95.3% Template matching [7] 77.9% Eigenface method using 50 training samples [7] 90.7% Eigenface method using 100 training samples [7] 93.3%
Eye Correction Using Correlation Information
705
detected eye patches and the white circles represent the center of the detected eye patches. Fig. 5-(c) shows one example of falsely detected eye correction by the proposed eye correction method, where the left and right figures represent the eye detection results before and after falsely detected eye correction, respectively. Table 2 compares the detection performance of various eye detection methods. 5.2
Experiments in the AR Images
The AR-63 face database contains 63 images (twenty-one people × three different facial expressions) and the AR-564 face database includes 564 images (94 peoples × 6 conditions (3 different facial expressions and 3 different illuminations)). Fig. 6-(a) and Fig. 6-(b) show some AR images whose eyes are correctly and falsely detected by the strong classifier that is obtained by the AdaBoost Training with MCT-based eye features, respectively, where the boxes represent the detected eye patches and the white circles represent the center of the detected eye patches.
(a) Some examples of the correctly de- (b) Some examples of the falsely detected tected results results Fig. 6. Some examples of results in AR-564 face database
(a) Results of the falsely detected
(b) Results of the correction
Fig. 7. Three examples of the falsely detected eye correction
Table 3. Comparisons of various eye detection methods using the AR face database Algorithms Proposed method Song and Liu [8] Kawaguchi and Rizon [7]
AR-63 AR-564 98.4% 98.8% 96.8% 96.8% 96.8% -
706
I. Choi and D. Kim
Fig. 7 shows four examples of falsely detected eye correction by the proposed eye correction method, where (a) and (b) represent the eye detection results before and after falsely detected eye correction, respectively. Table 3 compares the detection performance of various eye detection methods. As you see, the proposed eye detection method shows better eye detection rate than other existing methods and we the improvement of eye detection rate in the case of AR-564 face database is bigger than that in the case of AR-63 face database. This implies that the proposed eye detection method works well under various conditions than other existing eye detection methods.
6
Conclusion
We proposed a eye detection method using the MCT-based pattern correlation. The eye detection method can produce the false detection near the eyebrows or the boundary of hair and forehead in particular. When the existing eye detection method detects the eye in just one subregion, then it does not improve the eye detection rate. To overcome this limitation, we proposed the eye verification and falsely detected eye correction method based on the MCT-based pattern correlation. The MCT-based pattern correlation is based on the Hamming distance that measures the difference between two MCT-based patterns ,where the MCT-based pattern is a binary representation of the MCT. Also, the proposed MCT-based pattern is robust to the illumination changes. To verify detected eye, we proposed the eye/non-eye classification method which classifies into eye or non-eye depending on the existence of a noticeable peak in the MCT-based pattern correlation map. The proposed falsely detected eye correction method uses the peak position in the MCT-based pattern correlation map to correction of the falsely detected eye which is verified by the proposed eye/non-eye classification method. It improves the eye detection rate of the proposed eye detection method. The experimental results show that a eye detection rate of 98.7% and 98.8% can be achieve on the Bern images and AR-564 database. The proposed eye detection method works well under various conditions than other existing eye detection methods.
Acknowledgements This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, it is financially supported by the Ministry of Education and Human Resources Development (MOE), the Ministry of Commerce, Industry and Energy (MOCIE) and the Ministry of Labor (MOLAB) through the fostering project of the Lab of Excellency.
Eye Correction Using Correlation Information
707
References 1. Tan, X., Chen, S., Zhou, Z., Zhang, F.: Face recognition from a single image per person: A survey. Pattern Recognition 39, 1725–1745 (2006) 2. Brunelli, R., Poggio, T.: Face recognition: features versus templates. IEEE Transaction on Pattern Analysis and Machine Intelligence 15, 1042–1052 (1993) 3. Lawrence, S., Giles, C., Tsoi, A., Back, A.: Face recognition: a convolutional neuralnetwork approach. IEEE Transaction on Neural Networks 8, 98–113 (1997) 4. Martinez, A.: Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Transaction on Pattern Analysis and Machine Intelligence 24, 748–768 (2002) 5. Beymer, D.: Face recognition under varying pose. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 756–761. IEEE Computer Society Press, Los Alamitos (1994) 6. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 84–91. IEEE Computer Society Press, Los Alamitos (1994) 7. Kawaguchi, T., Rizon, M.: Iris detection using intensity and edge information. Pattern Recognition 36, 549–562 (2003) 8. Song, J., Chi, Z., Li, J.: A robust eye detection method using combined binary edge and intensity information. Pattern Recognition 39, 1110–1125 (2006) 9. Freund, Y., Schapire, R.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 771–780 (1999) 10. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 11. Froba, B., Ernst, A.: Face detection with the modified census transform. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 91–96 (2004) 12. Song, Y.J., Kim, Y.G., Chang, U.D., Kwon, H.B.: Face recognition robust to left/right shadows; facial symmetry. Pattern Recognition 39, 1542–1545 (2006) 13. Je, H., Kim, S., Jun, B., Kim, D., Kim, H., Sung, J., Bang, S.: Asian Face Image Database PF01, Technical Report. Intelligent Multimedia Lab, Dept. of CSE, POSTECH (2001) 14. Luettin, J., Maˆıtre, G.: Evaluation Protocol for the Extended M2VTS database (XM2VTSDB), IDIAP Communication 98-05. In: IDIAP, Martigny, Switzerland, pp. 98–95 (1998) 15. Achermann, B.: The face database of University of Bern. Institute of Computer Science and Applied Mathematics, University of Bern (1995) 16. Martinez, A., Benavente, R.: The AR Face Database, CVC Technical Report #24 (1998)
Eye-Gaze Detection from Monocular Camera Image Using Parametric Template Matching Ryo Ohtera, Takahiko Horiuchi, and Shoji Tominaga Graduate School of Science and Technology, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba, 263-8522, Japan
[email protected], {horiuchi,shoji}@faculty.chiba-u.jp
Abstract. In the coming ubiquitous-computing society, an eyegaze interface will be one of the key technologies as an input device. Most of the conventional eyegaze tracking algorithms require specific light sources, equipments, devices, etc. In a previous work, the authors developed a simple eye-gaze detection system using a monocular video camera. This paper proposes a fast eye-gaze detection algorithm using the parametric template matching. In our algorithm, the iris extraction by the parametric template matching is applied to the eye-gaze detection based on physiological eyeball model. The parametric template matching can carry out an accurate sub-pixel matching by interpolating a few template images of a user’s eye captured in the calibration process for personal error. So, a fast calculation can be realized with keeping the detection accuracy. We construct an eye-gaze communication interface using the proposed algorithm, and verified the performance through key typing experiments using visual keyboard on display. Keywords: Eyegaze Detection, Parametric template matching, Eyeball model, Eyegaze Keyboard.
1 Introduction Human eyes always chase an interesting object. Gaze determines a user’s current line of sight or point of fixation. So, the direction of the eyegaze can express the interests of the user, and the gaze may be used to interpret the user’s intention for non-command interactions. Incorporating eye movements into the interaction between humans and computers provides numerous potential benefits. Moreover, the eyegaze communication interface is very important for not only users in normal health but also severely handicapped users such as quadriplegic users. Although the keyboard and the mouse are used as an interface of the computer, it is necessary for us to move the input device with the hand. Therefore, the substituted input device is necessary for the person who owed the handicap. The eyegaze detection has progressed through measurements of infrared irradiation and myoelectric potential around eyes. Gips et al. proposed a detection algorithm based on EOG method [1] in which the myoelectric potential following the motion of eyeball can be measured by electrode on the face. However, the influence of the electric noise Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 708–717, 2007. © Springer-Verlag Berlin Heidelberg 2007
Eye-Gaze Detection from Monocular Camera Image
709
embedded in miniature potential is not negligible. Moreover, specific instruments are required for measurements, and the load for the user is also not negligible. Cornea-reflex-based systems were also proposed in Refs.[2]-[4]. Those systems require that the environment illumination must suppress the extraneous reflection. T.N.Cornsweet and H.D.Crane presented a simple and high-precision eyegaze detection system with the jaw board and the headrest. In this detection, the motion of head can be cancelled by utilizing 1st and 4th Purkinje images from affecting inversely each other on the motion of head [5]. However, the method using these cornea reflection images had to prepare a specific optical device such as the infrared irradiation devices to take picture of the iris image as a high luminance area or a low brightness area. Recently video-based techniques have been studied. Compared with non-videobased gaze tracking methods as mentioned above, video-based gaze tracking methods have the advantage of being unobtrusiveness and comfortable during the process of the gaze detection. Kawato et al. presented a gaze detection method using four reference points which put on the face and three calibration images [6]. Matsumoto et al. presented a real-time stereo vision system to estimate the gaze direction [7]. Wang et al. presented a method for estimating the eyegaze by measuring the change of the contour of the iris [8]. Generally, the big-eye approaches based on the precise measurements of the eye are expensive [8],[9], because they require a pan-tilt/zoom-in stereo-camera with sufficient resolution to measure accurately iris contour or pupil. The methods using the movement of the iris captured by a low resolution single camera under the condition of fixing head pose are described in Refs.[10],[11]. These methods used a luminance gradient for extract the iris semicircle and eyes corners. Therefore, it is necessary to specify rough position of eyes beforehand. The iris extraction is very important in the algorithm that detects the gaze from the movement of the iris. The authors proposed a precise iris extraction algorithm based on the Hough transform [12]. However, the method cannot detect movement in sub-pixel and the Hough transform requires a heavy operation time. In this study, we concentrate on a video-based approach, and develop a simple eyegaze detection system using only a monocular video camera. The proposed method does not require any specific equipment excluding the monocular video camera, and has psychologically a light burden for a user. In our algorithm, the rotation model for eyeball is constructed through traditional physiological models which are Emsley’s eyeball model [13] and Gullstrand’s model No.2 [14]. In the system, the eyegaze angle to optical axis is calculated by using the amount of the movement at the center of the iris after the calibration. Then, a coarse-to-fine parametric template matching method [15] is performed to extract the user’s iris in robustness, high speed, and sub-pixel accuracy. The remaining sections are organized as follows: In Sec.2, a rotation model for the eyeball is defined, and an eyegaze estimation algorithm is described. In Sec.3, an iris detection algorithm which is the key technology for estimating the eyegaze is proposed. In Sec.4, the experimental results show the performance of the proposed gaze detection system. In order to verify the performance, in Sec. 5, the proposed gaze detection algorithm is applied to an eyegaze communication interface.
710
R. Ohtera, T. Horiuchi, and S. Tominaga
2 Gaze Estimation Model 2.1 Gullstrand’s Schematic Eye No.2 and Emsley’s Reduced Schematic Eye In order to estimate the eyegaze, we begin to consider physiological eyeball models. Schematic eyes are standard model in the eye optics of which the parameters are provided by observed values or its approximated values for the optical parameters in dioptrics. Several schematic eyes have been proposed. Examples include LeGrand’s schematic eye, Donders’ reduced eye, Lawrence’s reduced eye and Listing’s reduced eye. In this paper, we use Gullstrand’s No.2 schematic eye and Emsley’s schematic eye, which can simply express the size of eyeball. Gullstrand’s model, which consists of the precise model (No.1) and the non-precise one (No.2), and Emeley’s model are well-known eyeball models. Gullstrand (No.2)-Emsley’s reduced eye consists of one-surface cornea, two-surface lens, spherical, rotationally symmetric surfaces. Values for the accommodation-stop and the super-accommodation of eyes can be presented. 2.2 Gaze Detection Algorithm The eyegaze can be defined as a vector directed from the center of an iris to the center of the eyeball. In this study, the amount of the movement from the center of the iris is detected, and the eyegaze angle is calculated by using the rotation model based on the Gullstrand’s reduced schematic eye (No.2) and the Emsley’s reduced schematic eye. Figure 1 shows the proposed rotation model. The center of the eyeball lies at the distance 13mm behind the cornea. The gonioscope width, which is the distance from the cornea to the iris, is set as 3.4mm on averaging the value from non-controlled to strong controlled eyes (3.2-3.6mm). The length from the cornea to the eye ground is provided by 23.9mm with the Emsley’s reduced schematic eye. Therefore, the length from the eyeball rotation center to the bottom of the iris becomes 9.6mm. The amount of the eyeball rotation is defined as 0 deg for gazing at the reference point, and the amount of the movement at the center of the iris calculated in Sec.3 is defined as B. For assuming that the size of a camera is very larger than that of the eyeball, the eyegaze angle θ to the front baseline is easily given by θ = sin −1 B S
(1)
where the eyegaze detection for oblique one is done by calculating vertical and horizontal movement B of the iris, respectively, and applying to the model separately. The symbol S in Fig.1 shows the personal size of the eyeball depending on the user, and it will be calibrated in Sec.4.2. After calculating B for vertical and horizontal direction, the oblique eyegaze is detected by applying them into the models, respectively. However, there is a necessity for the conversion to apply to the model because the amount of the movement at the center of the iris is not a value of the observational measurement. When a user fixates on the reference point, the iris becomes a circle because the eyeball does not rotate. Moreover, the diameter of the iris is assumed to be 11.5mm from the model eyes. Therefore, conversion from the obtained image to the observational measurement value assumed the diameter of the cornea to be 11.5mm
Eye-Gaze Detection from Monocular Camera Image
711
from the model eyes, and calculated measurement value (mm) that corresponded to one pixel. Since we have not considered the time-sequential processing between frames, the proposed process is performed for each frame.
Fig. 1. The eye rotation model used in our algorithm
3 Iris Detection Algorithm 3.1 Overview of the Parametric Template Matching As described in Sec.2.2, the eyegaze can be detected by extracting the center of the iris. So, the precise extraction of the center of the iris becomes important for accurate eyegaze detection. In the iris extraction, one of the biggest problems is how to consider the change of shape depending on the rotation of the eyeballs. In addition, high speed processing and the accuracy of the iris detection are required for real-time processing. In this paper, we use the parametric template matching method [15] for the iris extraction. The parametric template space is defined as a template space expressed by a linear interpolation of two or more given templates called “vertex templates”, and high speed and high accuracy matching can be realized by the coarse-to-fine matching. The vertex templates are described in Sec.3.2 in detail. The Zero-mean Normalized Cross- Correlation is calculated between a constructed parametric template and an object image as follows:
ρ ( x, y ) ≡
∑ (Δt ( x, y, k , s))× (Δg (k , s) )
(k , s )∈T
∑ (Δt ( x, y, k , s) ) ( ) k , s ∈T
2
∑ (Δg (k , s)) ( ) k , s ∈T
Δt ( x, y, k , s ) ≡ t ( x + k , y + s ) − t Δg ( k , s ) ≡ g ( k , s ) − g
{
}
T ≡ (k , s )1 ≤ k ≤ l x ,1 ≤ s ≤ l y ;
l x , l y : template size
2
(2)
712
R. Ohtera, T. Horiuchi, and S. Tominaga
Where, t, g represent the luminance value within the template image and the object image, respectively. In this paper, an iris image at gazing center position is used as the object image. Symbols t and g are mean values of each images. The correlation is calculated for all positions within the input image, and the position (x*,y*) with the maximum correlation ρ ( x*, y*) is detected as the iris position. Here, we use the coarse-to-fine searching algorithm for reducing processing time which is explained in the next subsection. 3.2 Construction of the Parametric Template
In this section, we define the vertex templates and construct the parametric template. We use the following seven images as the vertex templates. (a) Three local images within an input image (b) Four iris images when each display corner is gazed. Images (a) are used as vertex templates for detecting the iris movement. As shown in Fig.2, three vertex templates tˆ1 , tˆ2 , tˆ3 are local images at the position ( x, y ) , ( x + Δx, y ) and ( x, y + Δy ) in the input image, respectively. Here, Δ x and Δ y are the sampling intervals. Images (b) are used as vertex template for robustness of geometric transformation of the iris because the shape is transformed into the ellipse by rotating
Fig. 2. Concept of the parametric template matching
Fig. 3. Vertex template images for iris transformation
Eye-Gaze Detection from Monocular Camera Image
713
the eyeball. In Fig.2, the vertex templates are expressed as tˆ4 , tˆ5 , tˆ6 and tˆ7 . Figure 3 shows an example of a set of vertex images for iris transformation. When the rotation angle is large, the iris is concealed by the eyelid. In general multi template matching, all transformed iris images must be prepared beforehand. In this study, we get four vertex images by gazing corner position and express variation of the iris shape by interpolating those four images continuously. By using seven vertex images, the parametric template t (ω ) is constructed by linear interpolation proposed in Ref.[15]. By using the constructed parametric template, the correlation ρ ( x, y ) in Eq.(2) is calculated for all position in the input image. The most matched position (x*,y*) with the highest ρ ( x*, y*) becomes a candidate position of the iris. After that, the sampling intervals Δ x and Δ y decrease, and more precise matching is performed around the candidate position again. These procedures are repeated by decreasing the interval until Δx = Δy = 1 . In the case Δx = Δy = 1 , a sub-pixel matching can be realized. Finally, the center of the template at the most matched position is detected as the center of the iris. 3.3 Adjustment for Center of Iris
The shape of the iris can be transformed into the ellipse when the eyeball is rotate. So the center of the iris projected in the two dimensional image may shift from the center of the detected template. Figure 4 shows the eyeball which we looked at from the top. The line A-B shows the iris and the line A'-B' shows a rotated iris. Algorithm in the previous section extracts the point C as the center of the iris, because the algorithm extracts it as the geometric center point between A' and B'. However, actual center point of the iris is E in Fig.4. For more accurate eyegaze detection, we have to adjust the center of the iris. We assume that the correct center of the iris is the center of the iris when the front is gazed. Let O be the center of the eyeball in Fig.4. Then A and B can be expressed as ⎛⎜ − I ,− S 2 − I 2 ⎞⎟ and ⎛⎜ I ,− S 2 − I 2 ⎞⎟ , respectively. Let S be the ⎝ ⎠ ⎝ ⎠ calibrated radius of the rotation locus. Detailed calibration process is described in Sec.4.2. Then A' and B' transformed by rotating θ degrees can be expressed as follows: ⎡ A' x ⎤ ⎡cosθ A': ⎢ ⎥ = ⎢ ⎣ A' y ⎦ ⎣ sin θ
− sin θ ⎤ ⎡ −I ⎤ ⎥ ⎢ cosθ ⎦ ⎣− S 2 − I 2 ⎥⎦
(3)
⎡ B' x ⎤ ⎡cosθ B ': ⎢ ⎥ = ⎢ ⎣ B ' y ⎦ ⎣ sin θ
I − sin θ ⎤ ⎡ ⎤ ⎥ ⎢ 2 2⎥ cosθ ⎦ ⎣− S − I ⎦
(4)
In Fig.4, f , which is the x-coordinate of the point D, can be calculated as f =
A'x + B ' x = S 2 − I 2 sin θ 2
(5)
Here D is the center of the iris in the camera image. So, the actual angle can be derived as f θ = sin −1 (6) 2 S −I2 Then the accurate eyegaze vector O-E is detected.
714
R. Ohtera, T. Horiuchi, and S. Tominaga
Fig. 4. Adjustment for center of iris
4 Experimental Results of Eyegaze Detection 4.1 Experimental Environment
The proposed method is here demonstrated for the display interface. In the experiment, a reference point is set on the center of display, and an observer with the naked-eyes sits in front of the reference point. The observer fixates his eye to the front, thus the effect of the direction of his face can be suppressed. Observers are three males. Each observer tested by twice. As an observer sits in front of the eyegaze monitor display, a monocular video camera mounted below the monitor observes one of the observer's eyes. The distance from the display to the eyeball of the observer sitting on the chair is set with 400mm, which is a widely usable distance. The source of light is arranged a little backward of the observer only by an overhead fluorescent lamp. The observer is irradiated from the upper behind by only the fluorescent lamp. The face image is taken with the digital video camera, Panasonic NV-GS200K(640×480), which is at the distance 120mm apart from the display. The jaw and forehead of observer is fixed on the plate. In the experiment, 20 indices without the center index were displayed at 10 degree intervals in visual angle. The center index was used as the reference point. One index was displayed in five seconds intervals. Then, the observers gazed the displayed index. The face image was captured four seconds later after the index was displayed. The direction of the eyegaze from the reference point was calculated for each selected indices by using Eq.(1). Under the condition of fixing head pose, the iris doesn't move large. The searching region was limited from an initial position in about 2.5 times of the iris size on account of high-speed processing. 4.2 Calibration Method
In general, before the eyegaze is detected, individual calibration for each observer is performed by gazing two or more markers, here 5-20 markers, on the display. In the eyegaze detection, it is to be desired that the individual calibration is unnecessary. However, because a deviation result from some factors, the individual calibration is
Eye-Gaze Detection from Monocular Camera Image
715
actually required. The guessable factors in the proposed method are as follows. Errors due to the optical system such as the position of camera (1) (2) (3) (4)
Refraction at the surface of cornea Personal error due to the shape of eyeball Degree of aspheric for the surface of cornea Refracting through the grasses or contact lenses
This paper focuses on Error (3). Considering personal eyeball size in Emsley’s model eye and Gullstrand’s model eye No.2, we propose simple 4-point calibration using corner points of the display screen. The points are (vertical view angle[deg], horizontal view angle[deg])=(-25, 20), (-25,-20), (25,-20), (25,20). First, the display is divided into four blocks from the reference point. Next, B in Eq.(1) is calculated. Then, the parameter S in Eq(1) is personalized so that the amount of the rotation of the eyeball in the calibration process. The same adjustment procedure proceeds to the other calibration point. Then, the adjusted parameter S to four blocks is obtained. 4.3 Results of Eyegaze Detection
We compared the proposed method with a conventional algorithm in Ref.[15] under the same system condition. Table 1 shows the average error of gaze detection and processing times. The average error of the proposed method is 0.92deg for horizontal direction, 2.09deg for vertical direction. The average error of the proposed method is slightly larger than the conventional method. However, the maximum error of the proposed method is smaller. Therefore, the proposed method realizes stable eyegaze detection. Next, we compared as to the processing time. In the parametric template matching, the initial sampling intervals are set to 8-15. Then the processing times are averaged. As shown in Table 1, the processing times are drastically reduced. The proposed method is inferior to Ref.[12] in the accuracy. However, it has an advantage in the processing time. There is a large influence in human comfort at the processing time when the eyegaze is applied to the human interface. As for the proposed method, further speed-up is expected by using information between frames. Table 1. Average error of gaze detection and processing times
Ref.[12] The proposed method
horizontal 0.66.deg 0.92.deg
vertical 1.05.deg 2.09.deg
Processing times 86.3 sec 0.69 sec
5 Application for Eyegaze Keyboard We developed an eyegaze communication interface using the proposed method. A user can operate the developed eyegaze communication system by looking at rectangular keys that are displayed on the control screen. Japanese syllabary was written on the visual keyboards. Then the simple word processing can be realized by looking at each key in turn. The size of a key is 5 deg. When the error exceeds 2.5 degrees, the detection fails. However, a user can push forward work without stress so that a letter is detected fast and correct it quickly. Figure 5 shows the eyegaze keyboard system.
716
R. Ohtera, T. Horiuchi, and S. Tominaga
Fig. 5. Eyegaze keyboard system
6 Conclusion This paper has proposed a simple method for eyegaze detection. It is non-contact for the observer and any specific devices excluding the monocular video camera are not required. Moreover, this method requires neither the reference light nor the infrared rays light, etc. The rotation model of the eyeball was constructed. Then, we devised a simple and fast eyegaze detection algorithm by iris extraction using the parametric template matching method. In order to verify the performance of the proposed method, an eyegaze detection experiment was performed. The average of horizontal direction error was 0.92deg, and vertical one was 2.09deg. Although the accuracy was slightly bad, compared with the conventional algorithm, the processing speed was improved with about 1/125 drastically. The improvement of accuracy is future work. For more comfortable system, head-free condition is required.
References 1. Gips, J., Olivieri, C.P., Tecce, J.J.: Direct control of the computer through electrodes placed around the eyes. In: Smith, M.J., Salvendy, G. (eds.) Proc. 5th Int. Conf. on Human Computer Interaction, Orlando, FL. Published in Human-Computer Interaction: Applications and Case Studies, pp. 630–635. Elsevier, Amsterdam (1993) 2. Talmi, K., Liu, J.: Eye and gaze tracking for visually controlled interactive stereoscopic displays. Signal Processing: Image Communication 14, 799–810 (1999) 3. Hutchinson, T.E., White, K.P., Martin, W.N., Reichert, K.C., Frey, L.A.: Human-computer interaction using eyegaze input. IEEE Trans. Systems, Man & Cybernetics 19(6), 1527–1534 (1989) 4. Ohno, T., Mukawa, N., Kawato, S.: Just Blink Your Eyes: A Head-Free Gaze Tracking System. In: Int. Conf. for Human-Computer Interaction, Florida, USA, pp. 950–951 (2003) 5. Cornsweet, T.N., Crane, H.D.: Accurate two-dimensional eye tracker using first and forth Purkinje images. J. Opt. Soc. Am. 63(8), 921–928 (1973) 6. Kawato, S., Tetsutani, N.: Gaze Direction Estimation with a Single Camera Based on Four Reference Points and Three Calibration Images. In: Narayanan, P.J., Nayar, S.K., Shum, H-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 419–428. Springer, Heidelberg (2006)
Eye-Gaze Detection from Monocular Camera Image
717
7. Matsumoto, Y., Zelinsky, A.: An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement. In: Proceedings of IEEE fourth Int. Conf. on Faze and Gesture Recognition, pp. 499–505 (2000) 8. Wang, J., Sung, E.: Gaze determination via images of irises. Image and Vision Computing 19(12), 891–911 (2001) 9. Kim, K.-N., Ramakrishna, R.S.: Vision-based Eyegaze Tracking for Human Computer Interface. In: IEEE Int. Conf. On Systems, Man, and Cybernetics, vol. 2, pp. 324–329 (1999) 10. Hammal, Z., Massot, C., Bedoya, G., Caplier, A.: Eyes Segmentation Applied to Gaze Direction and Vigilance Estimation. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 236–246. Springer, Heidelberg (2005) 11. Benoit, A., Caplier, A., Bonnaud, L.: Gaze direction estimation tool based on head motion analysis or iris position estimation. In: Proc. EUSIPCO2005, Antalya, Turkey (September 2005) 12. Ohtera, R., Horiuchi, T., Kotera, H.: Eye-gaze Detection from Monocular Camera Image Based on Physiological Eyeball Models. In: IWAIT2006. Proc. International Workshop on Advanced Image Technology, pp. 639–664 (2006) 13. Emsley, H.H.: Visual Optics, 5th edn. Hatton Press Ltd, London (1952) 14. Gullstrand, A.: Appendix II.3 The optical system of the eye, von Helmholtz H, Handbuch der physiologischen Optik (1909) 15. Tanaka, K., Sano, M., Ohara, S., Okudaira, M.: A parametric template method and its application to robust matching. Proc, Computer Vision and Pattern Recognition, IEEE, 620–627 (2000)
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications Yu Shi and Timothy Tsui National ICT Australia, Bay 15 Australian Technology Park, Sydney, NSW 1430, Australia
[email protected]
Abstract. Smart camera is a camera that can not only see but also think and act. A smart camera is an embedded vision system which captures and processes image to extract application-specific information in real time. The brain of a smart camera is a special processing module that performs application specific information processing. The design of a smart camera as an embedded system is challenging because video processing has insatiable demand for performance and power, but at the same time embedded systems place considerable constraints on the design. We present our work to develop GestureCam, an FPGA-based smart camera built from scratch that can recognize simple hand gestures. The first completed version of GestureCam has shown promising realtime performance and is being tested in several desktop HCI (Human Computer Interface) applications. Keywords: Intelligent Systems, Human-Computer Interaction, Embedded Systems, Computer Vision, Pattern Recognition.
1 Introduction Broadly speaking, a smart camera can be defined as a vision system in which the primary function is to produce a high-level understanding of the imaged scene and generate application-specific data to be used in an autonomous and intelligent system. A smart camera is ‘smart’ because there is a processing unit which performs application specific information processing (ASIP). The primary goal of ASIP is to extract information from the captured images that is useful to an application. For example, a motion-triggered surveillance camera captures video of a scene, detects motion in the region of interest, and raises an alarm when the detected motion satisfies certain criteria. In this case, the ASIP is motion detection and alarm generation. Strictly speaking, a smart camera is a stand-alone, self-contained embedded system that integrates image sensing, ASIP and communications in one single box. However, there are other types of vision systems that are often referred to as smart cameras as well, such as PC-based smart cameras. In a PC-based smart camera, the camera is a general purpose camera such as webcam or CCTV camera, with the video output connected to a PC port through USB, Ethernet, Firewire, or other protocols. This kind of configuration has a few disadvantages. For example, general purpose PC is usually not suited to intensive image processing of high resolution and high frame rate camera output video streams. In addition, bandwidth Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 718–727, 2007. © Springer-Verlag Berlin Heidelberg 2007
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications
719
requirement for the camera-PC link is very high. A smart camera can greatly simplify application system design because there is no need to have an extra PC to perform image processing tasks. What’s more, the embedded processing unit inside the smart camera is a much better way to process images at high resolution and high frame rate in real time. The output from the smart camera requires very low bandwidth, because only image feature or high level description of the imaged scene needs to be transferred to a central control computer. Smart cameras can have many applications, such as in video surveillance, security, machine vision, human computer interaction, and so on. A multimodal user interface (MMUI) allows a user to interact with a computer by using his or her natural communication modalities, such as speech, pen, touch, gestures, eye gaze, facial expression, just as in human-to-human communication. Gesture recognition is an important part of MMUI system. Compared to glove and trackers based gesture recognition, vision based gesture recognition, which uses cameras and computer vision techniques, is more flexible, portable and affordable. However, vision based gesture recognition is not a trivial task, especially when built as an embedded system. Building smart camera as an embedded system has been a hot R/D topic in recent years. In particular, there has been some research work in recent years in building smart cameras that can recognize gestures. Wolf et al in [1] described a VLSI-based smart camera for gesture recognition. They used a commercial general purpose camera that provides analogue output to a VLSI video processing board which is inserted into a PC. Bonato et al in [2] presented the design of an FPGA-based smart camera that can perform gesture recognition in real-time for mobile robot application. They used a CMOS camera as capture device which provides a processed digital image output to their FPGA. Wilson et al in [3] designed a system allowing a user to control Windows applications using gestures. Their system uses a pair of general purpose cameras. In this paper, we present the design and development of a smart camera, called GestureCam, which can perform simple hand gesture recognition. GestureCam is a smart camera built from scratch, that is, it is not based on a commercial camera which provides processed analogue or digital outputs. Rather, the image capture part of the GestureCam is customarily built on a CMOS image sensor chip, which allows us to apply our own color and image pre-processing algorithms to the raw video output (Bayer pattern) of the image sensor. This provides us with opportunity to have lownoise and better quality data going into gesture recognition. In the following sections, we describe GestureCam design process and architecture, followed by discussions on algorithm development and on implementation issues, especially for the contour tracing and gesture classification algorithms. Finally we present our work-in-progress in applying GestureCam to GestureBrowser, a tool that can enable a person to use hand gestures to control a web browser.
2 GestureCam Design and Implementation 2.1 Design Process To design GestureCam, we followed the process below which we believe is appropriate to the design of smart cameras as embedded systems [4].
720 •
•
•
•
•
•
Y. Shi and T. Tsui
Step one: Application Requirements Specifications. Correct specifications can shorten design and development cycle, provide clear targets for algorithm and hardware performance, and reduce total cost. Step two: System Architecture Design. Software and hardware architectures based on performance, time-to-delivery and cost criteria. Algorithmic design and timing design suitable to the targeted hardware platform also needs to be defined. The mapping between algorithm requirements and hardware resources is an important issue. For hardware architecture, a heterogeneous, multiple-processor architecture can be ideal for smart camera development. Step three: Proof-of-Concept. This may use a PC platform for research and algorithm development. Usually a general purpose camera (e.g. a webcam) is used at this stage. In the next section we will describe our work on proof-ofconcept by using a webcam and a PC. Step four: Algorithm Conversion. This is necessary because algorithm development for embedded systems is quite different from PC-based platforms. It can be a lot more demanding and challenging, especially if FPGA or ASIC processors are targeted. Converting floating-point arithmetic to fixed-point, eliminating divisions as much as possible (e.g. by using hardware multipliers and look-up tables), and taking low power and low complexity requirements into account are other design considerations for algorithm conversion. Step five: Integration and Debugging. This will result in a prototype smart camera using an embedded hardware platform running embedded versions of algorithms. This can be a time-consuming process and sometimes requires adjustments to be made to the algorithms, software and hardware architecture. Step six: Test and Evaluation. Test camera performance under realistic environment, identify potential problems and possible improvements, and benchmark camera performance against initial application requirement specifications.
2.2 Vision Based Gesture Recognition The main steps typically involved in vision-based gesture recognition include image pre-processing, object segmentation, feature extraction, tracking, and gesture classification. Before we built GestureCam, we went through a proof-of-concept stage (Step 3). Specifically, we used a Logitech QuickCam Pro 4000 and a PC to implement and test core modules such as object segmentation, feature extraction, and gesture classification. For object segmentation, we applied skin color detection and contour tracing techniques to segment hand from background. For feature extraction, we calculated the center of mass of the segmented hand. For gesture classification, we applied a simple non-trainable neural network (with hard-coded weightings) to classify gestures. Trajectory-based classification was also tested. The QuickCam and PC based proof-of-concept produced successful validation of all core algorithms for GestureCam, and was used in a speech and gesture based multimodal interface built for traffic incident management application [5]. However, building smart camera as embedded system is very different from building a PC based system. There are many design considerations and implementation issues that are pertinent to the chosen hardware processing architecture and design specifications. All this will be discussed later.
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications
721
2.3 GestureCam System Architecture GestureCam consists of mainly three parts: an image capture unit (ICU), an FPGAbased gesture recognition unit (GRU), and a host and display unit (HDU). The ICU includes a small in-house built PCB on which sits a mega pixel CMOS color image sensor OV9620 from Omnivision. The PCB is fit into a dummy camera casing which provides easy connection to a 2/3” format video lens from Computar. The OV9620 provides full SXGA (1280x1024) resolution Bayer pattern video output at 15 frames per second, and VGA (640x480) resolution at 30 frames per second. A Xilinx VirtexII Pro FPGA development kit from Memec has been chosen to form the GRU. This kit is a powerful yet flexible development platform for imaging applications. The Virtex-II Pro 2VP30 FPGA comes with over 2 million system gates, 2 on-chip embedded PowerPC cores and over 2MB on-chip RAM. All image and video processing algorithms, in addition to image sensor interface and display interface, are to be implemented into the Virtex-II Pro FPGA, without using off-chip memory. All implementation was done in VHDL. Main programming tools are Xilinx ISE7.1, EDK 7.1, and ChipScope Pro 7.1. The motivation of adopting FPGA as a development
Fig. 1. (a) GestureCam development platform. (b) A programmer working on the platform.
Fig. 2. Processing chain of GestureCam implemented in the FPGA
722
Y. Shi and T. Tsui
platform is because FPGA is a far better computing platform than PC to perform dataintensive video processing tasks with real-time requirements. In addition, there is no need for high bandwidth requirement between camera and PC, because the output from the camera can be as simple as merely an index of a gesture among pre-defined gesture database. A picture of the development platform is shown in Figure 1 (a). Figure 1(b) shows one of the authors working on the development platform. Figure 2 shows the software architecture of the GestureCam. The CMOS image sensor outputs Bayer pattern images, meaning that each pixel has only one color of 8bit to it. For each frame, a color interpolation algorithm developed in-house produces real-world color image of the scene. A skin color detection algorithm is then applied to the color image to extract skin color pixels from background. A low-pass filter is then called to smooth out isolated in-color noise and produce cleaner skin image. Contour tracing algorithm is then called to extract hand contour coordinates for the benefit of feature extraction, which calculates the center of mass (CoM) of the hand contour. Lastly, a neural network based gesture recognition stage analyzes the temporal trajectory of the hand CoM and classifies the hand gesture based on probability. The contour tracing and gesture classification will be described in more details in the following section. For the first version of the GestureCam, we decided to work on an image of the half-VGA resolution, that is 320x240 pixels, so that the RAM on the FPGA chip is big enough for all frame and line buffering requirements and there is no need to use off-chip SDRAM. Later-on the design can be scaled up to accommodate full capture resolution. 2.4 Contour Tracing and Gesture Classification for GestureCam 2.4.1 Contour Tracing There are two well-known methods in the context of contour tracing: chain code and row-wise representations. In chain code methods, the immediate pixels surrounding the current edge pixel, often called the 8-neighbour-hood, is searched. In contrast, rowwise representations involve analyzing the interface between two adjacent rows of pixels and inferring a contour from it by looking at the connection between two edge pixels. Row-wise representations offer better performance [6] because tracing can begin as soon as two rows of pixel data are available to the algorithm. However, all of this is at the cost of increased design complexity. With chain code methods, the entire frame needs to be available for tracing to occur, giving rise to poorer performance since tracing starts much later. The initial algorithms researched for GestureCam involved chain code. Given the small resolution of 320x240 pixels at which GestureCam inputs, it was not necessary at this stage to design a complex contour tracing module such as row-wise schemes. The chain code algorithms investigated involved the Square-tracing algorithm [7], Moore-neighbourhood algorithm [7], Rhee et al’s algorithm[12] based on External Boundary Tracing Algorithm and Inner Boundary Tracing Algorithm (IBTA)[8]. Of these, IBTA was believed to be the most suitable. In square-tracing, one of its shortcomings is the restriction to strictly 8-connected only images. Although similar to IBTA, Moore-neighbourhood searches for the next edge pixel in a brute-force approach whereas IBTA searches for the next edge pixel based on a mathematical
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications
723
rule, optimizing the chances of detecting an edge pixel. Rhee et al’s algorithm has this similar shortfall as well. The tracing algorithm used by GestureCam is based on IBTA with several notable modifications which involve: the method in finding the first pixel to begin tracing and the stopping criterion. In the IBTA algorithm, the first pixel to start tracing is found by searching from the top left of the frame and continuing in a raster direction – left to right, top to bottom. The implementation in GestureCam differs by locating the brightest pixel of a frame and hopping in the negative x-direction until the pixel is below a certain threshold. This is the left-most edge of the object and is where tracing begins. The adoption of this method decreases the chance of noisy pixels being traced and hence improves the chance of tracing the hand which is the object of interest. The IBTA algorithm includes a stopping criterion that can trace objects which have one pixel wide segments. This may be a critical feature for applications such as medical imaging in which the utmost precision is required to ensure proper diagnosis, but for the purposes of GestureCam, we do not require such fine-tuned precision, so we have replaced it with a simpler criterion, that is, stop tracing when the initial pixel with which tracing began has been encountered a second time. Figure 3 shows the flow diagram for tracing one frame using the modified IBTA algorithm. Data from the filtering module is passed pixel by pixel to the preprocessing state where the brightest pixel is found. This is achieved by updating the location of the brightest pixel until the whole frame has been traversed. Lastly, because the data from the filtering module is constantly updating, the pixel data is copied to a memory buffer to allow the IBTA algorithm to trace a static frame. The algorithm itself is robust enough to handle any shape of the hand. However, inner boundaries are not traced which is a property inherent in the IBTA algorithm, but since image moments require only the outer boundary, it does not present any problems.
Fig. 3. The flow diagram for tracing one frame using the modified IBT algorithm
2.4.2 Gesture Classification In Gesture Cam’s first prototype, classification involves utilizing the image moment from the feature extraction module. In any given time interval, we can collect a group of image moments and hence build up a trajectory of the user executing a moving hand gesture. GestureCam collects 15 image moments which is approximately 1 second of input data. A total of 8 gestures are recognizable and are shown in Figure 4.
724
Y. Shi and T. Tsui
Fig. 4. Gestures (“left”, “right”, “up”, “down”, “curved right”, “curved left”, “curved down”, “curved up”) that are recognized by GestureCam – arrow tail indicates start of gesture and arrow head indicates end of gesture
The field of classification has been extensively researched over the years and has provided sound techniques such as neural networks, Hidden Markov Models and various other statistical based methods. We achieved classification by implementing a neural network as well as another technique which we call ‘trajectory-based’. While the concept is not new, trajectory-based methods are a term coined by the authors. It is based on the fact that rules are inferred from the trajectory itself. These rules may be based on the general shape, curve or features of the trajectory, but should be differentiable from others such that they do not cause ambiguity. Consequently, the rules derived from the trajectory are hard-coded in VHDL. Preliminary literature review revealed that neural network design with a hardware focused goal predominantly involved classifying ‘stationary’ gestures in which the positions of fingers and hands were being recognized. Bonato et al used a RAM-based neural net in order to control the direction and speed of a mobile robot using stationary gestures [2]. Gruestein compared two methods, the condensation algorithm alongside a time-delayed neural network [9]. The neural net they investigated was essentially based on the back-propagation algorithm in which each layer of inputs correspond to different time intervals. In GestureCam, the input data to classification is the image moment obtained from the feature extraction module. To build up a trajectory, we collect 15 successive image moments and store them into block ram to display onto a monitor. Gesture recognition is processed as soon as new image moments are received at the input, thus no storage is necessary. Once the raw input data have been collected, they undergo conversion for the neural network and trajectory-based methods. As each address of the memory buffer corresponds directly to each pixel on the screen, the image moments are normally stored as 17 bit addresses. The conversion from memory address to input data appropriate for the gesture recognition techniques is based on the work done by Kinder et al [11]. They argue that by separating a trajectory into segments of 2 or more pieces called tuples, position-invariance and error tolerant recognition can occur. For each tuple, a set of features are obtained such as x-component, y-component, gradient and angle. In GestureCam, we define a tuple to be any 2 successive image moments and obtain 3 features: x-component, y-component and gradient – these form the input for both neural network and trajectory based methods. GestureCam analyses 14 tuples individually to build up a collective decision on classifying gestures. We use the neural network to detect “UP”, “DOWN”, “LEFT” and “RIGHT” gestures, whilst for the curved “UP”, “DOWN”, “LEFT” and “RIGHT” gestures, we use the trajectorybased methods. The neural network is composed of 3 inputs, a hidden layer of 2 nodes and an output node. To understand the operation of this neural network, let us consider the
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications
725
“LEFT” or “RIGHT” gesture in which the user waves his hand from right to the left and vice versa. Given the feature set derived from the tuple, we can infer that the ycomponent will not deviate from start to finish, thus we impose a threshold of 10 pixels. Another inference is that the gradient will be mostly level, thus we impose another threshold that the gradient can not exceed 0.5. Since both properties have to be met, there is a third and final node to check that both rules are met. This is illustrated in Figure 5. Next, we can determine whether the gesture is “LEFT” or “RIGHT” by noting the sign of the x-component - if it is negative, then the direction is “LEFT”, otherwise it is “RIGHT”. For gestures “UP” and “DOWN”, we apply similar principles. The result of the neural network consists of whether the gesture was detected, in which case the output would be ‘1’, or not detected where the output would be ‘0’. Using this neural network as a basis, we can alter the weights to suit the requirements for “LEFT”, “RIGHT”, “UP” and “DOWN” gestures, making it highly adaptable and easy to adjust.
(a)
(b)
Fig. 5. Neural network for “left” and “right” gesture recognition. Fig 5a shows the decision boundaries for the “LEFT” or “RIGHT” gesture. Fig 5b shows the neural network with hard-coded weights. Inputs of 1 denote bias weights.
3 Results and Application Development The first version of GestureCam has been completed. The prototype has shown good real-time performance, recognizing all 8 planned hand gestures. The implemented algorithm itself is robust enough to handle any shape of the hand. We applied GestureCam in an application called GestureBrowser, which allows common web browsing operations such as following hyperlinks, traversing the browsing history or scrolling up and down pages, using hand or head gestures. The GestureBrowser is implemented as a regular extension to the Mozilla Firefox web browser, visible as a new toolbar. It connects to the GestureCam through a UDP connection. The GestureCam acts as a gesture event server, delivering gesture and tracking information according to a proprietary protocol. The toolbar allows the user to enter the address and port of the GestureCam, and then initiates connection by a handshake. The server then sends packets containing information about the GestureCam identity (potentially allowing multiple cameras to be used at the same time), timestamp
726
Y. Shi and T. Tsui
(mainly for late packet ordering if required) and a set of parameters about the current gesture: 2D tracking position, type of gesture: simple tracking versus user-defined gesture.
4 Conclusion and Future Work We have developed a smart camera that can recognize simple hand gestures. The camera was built using a single chip of FPGA as processing device. The first version prototype of GestureCam has shown very promising real-time performance and recognition rate. The GestureCam is being tested in a real-world application, GestureBrowser. In future work, we plan to continue improving the various aspects of the GestureCam through GestureBrowser application; in particular we’ll use an offchip SDRAM to allow us to process images of higher resolution which should improve performance in skin color detection which is critical to good hand tracking. Initially, a trainable neural network using the back-propagation algorithm was intended for GestureCam. However, robust classification success rates normally require the amount of training input data to be in the order of hundreds and because GestureCam had no training data of gestures, it made it difficult to achieve acceptable performance using the back-propagation algorithm. In its present state, the neural network is based around thresholding which limit the amount of complexity a gesture can have. Using the Backpropagation algorithm and given ample training data, GestureCam could recognize a plethora of non-linear gestures which would replace the current classification scheme. A major advantage to the programmer is that they need not define rules for every gesture and convert it to VHDL. Instead, the Backpropagation algorithm automatically adjusts edge boundaries until all training data is correctly classified. Some of the benefits include classification of a wide range of complex gestures and improved accuracy.
References 1. Wolf, W., Ozer, B., Lv, T.: Smart Cameras as Embedded Systems. IEEE Computer 35(9), 48–53 (2002) 2. Bonato, V., Sanches, A., Fernandes, M., Cardoso, J., Simoes, E., Marques, E.: A Real Time Gesture Recognition System for Mobile Robots. In: International Conference on Informatics in Control, Automation, and Robotics, August 25-28 2004, Setúbal, Portugal, pp. 207–214. INSTICC (2004) 3. Wilson, A., Oliver, N.: Gwindows: Robust Stereo Vision for Gesture-Based Control of Windows. In: Proceedings of the International Conference on Multimodal Interaction, November 5–7, 2003, Vancouver, British Columbia, Canada (2003) 4. Shi, Y., Taib, R., Lichman, S.: GestureCam: A Smart Camera for Gesture Recognition and Gesture-Controlled Web Navigation. In: Proc. of ICARCV 2006, ICARCV, Singapore (December 2006) 5. Chen, F., Choi, E., Epps, J., Lichman, S., Ruiz, N., Shi, Y., Taib, R., Wu, M.: A Study of Manual Gesture-Based Selection for the PEMMI Multimodal Transport Management Interface. In: Proc. ICMI 2005, pp. 274–281 (2005)
An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications
727
6. Miyatake, T., Matsushima, H., Ejiri, M.: Contour representation of binary images using run-type direction codes. Machine Vision and Applications 70(2), 239–284 (1997) 7. Ghuneim: Contour Tracing (August 2006), http://www.imageprocessingplace.com/DIP/ dip_downloads/tutorials/contour_tracing_Abeer_George_Ghuneim/index.html 8. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision, 2nd edn. Brooks Cole (1998) 9. Gruenstein, A.: Two Methods of Gesture Recognition (March 2002) 10. Gose, E., Johnsonbaugh, R., Jost, S.: Pattern recognition and image analysis. Prentice Hall, PTR (1996) 11. Kinder, M., Brauer, W.: Classification of Trajectories – Extracting Invariants with a Neural Network. Neural Networks 7, 1011–1017 (1993) 12. Rhee, P.K., La, C.W.: Boundary Extraction of Moving Objects From Image Sequence. In: IEEE TENCON (1999)
Color Constancy Via Convex Kernel Optimization Xiaotong Yuan, Stan Z. Li, and Ran He Center for Biometrics and Security Research, Institute of Automation, Chinese Academy of Science, Beijing – China, 100080 Abstract. This paper introduces a novel convex kernel based method for color constancy computation with explicit illuminant parameter estimation. A simple linear render model is adopted and the illuminants in a new scene that contains some of the color surfaces seen in the training image are sequentially estimated in a global optimization framework. The proposed method is fully data-driven and initialization invariant. Nonlinear color constancy can also be approximately solved in this kernel optimization framework with piecewise linear assumption. Extensive experiments on real-scene images validate the practical performance of our method.
1
Introduction
Color is an important feature for many machine vision tasks such as segmentation [8], object recognition [13] and surveillance [4]. However, light sources, shadows, transducer non-linearities, and camera processing (such as auto-gaincontrol and color balancing) can all affect the apparent color of a surface. Color constancy algorithms attempt to estimate these photic parameters and compensate for their contribution to image appearance. There are a large body of works in color constancy literature. A common approach is to use linear models of reflectance and illuminant spectra [9]. Gray world algorithm [1] assumes the average reflectance of all the surfaces in a scene is gray. White world algorithm [5] assumes the brightest pixel corresponds to a scene point with maximal reflectance. Another widely used technique is to estimate the relative illuminant or mapping of colors under an unknown illuminant to a canonical one. Color gamut mapping [3] uses the convex hull of all achievable RGB values to represent an illuminant. The intersection of the mapping for each pixel in an image is used to choose a “best” mapping. In [14], a back-propagation multi-layer neural network is trained to estimate the parameters of a linear color mapping. In [6], a Bayesian estimation scheme is introduced to integrate prior knowledge, e.g. lighting and object classes, into a bilinear likelihood model motivated from the physics of image formulation and sensor error. Linear subspace learning is used in [12] to develop the color eigenflows method to model joint illuminant change. This linear model uses no prior knowledge of lighting condition and surface reflectance and does not need to be re-estimated for new objects or scenes. However, the demanding for large training set and rigorous pixel-wise correspondence between training and test images limits the application of this method. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 728–737, 2007. c Springer-Verlag Berlin Heidelberg 2007
Color Constancy Via Convex Kernel Optimization
729
In this work, we build our color constancy study on linear transformation parameter estimation. Recently, [8] presented a diagonal rendering model for outdoor color classification problem. Only one image containing the color samples under a certain “canonical” illuminant is needed for training Gaussian classifiers. The trained colors seen under different illuminations can be robustly recognized via MAP estimation. Due to the advantage of fewer training data requirements, we adopt this diagonal render model as the base model for our study . The main difference between our solution and that of [8] lies in the definition of objective function and the associated optimization method. In [8], image likelihood and model priors are integrated into a MAP formulation and locally optimized with EM algorithm. This algorithm works well when all the render matrices are properly initialized. However, such initializations are not always available and accurate in practice. In this paper, we propose a novel convex kernel based criteria function to measure the color compensation accuracy in a new scene. A sequential global mode-seeking framework is then developed for parameter estimation. The optimization procedure includes following three key steps: – A two-step iterative algorithm derived by Half-Quadratic optimization is used to find the local maximum. – A multi-bandwidth method is then used to locate the global maximum by gradually decreasing the bandwidth from an estimated uni-mode promising bandwidth. – A well designed adaptive re-sampling mechanism is adopted and the above multi-bandwidth method is repeated till the desired number of peak modes are found. The peak modes obtained in this procedure may be naturally viewed as transformation vectors for apparent illuminants in the scene. Our convex kernel based method is fully data-driven and initialization invariant. Such good numerical properties also leads to our solution for nonlinear color constancy problem based on current linear model. To do this, we make piecewise linear assumption to approximate the general nonlinear cases. Our method can automatically find the transformation vectors for each linear piece. Local optimization methods, such as EM based method in [8], can hardly achieve this goal in practice because of initialization dependency. Some results achieved by our method will be reported. The reminder of this paper is organized as follows: In Section 2, we model color constancy as linear mapping and estimate the parameters via multi-bandwidth kernel optimization in a fully data-driven way. In Section 3 we show the experimental results that validate the numerical superior of our method to that in [8]. We conclude this paper in Section 4.
2
Problem Formulation
For the benefit of fewer training samples requirement, we adopt the linear render model stated in [8] as the base model for our color constancy study. The key assumptions of are:
730
X. Yuan, S.Z. Li, and R. He
– One hand-labeled image is available for training the class-conditional color distributions under the “canonical” illuminant. – The class-conditional color surface likelihood under the canonical illuminant is a Gaussian density, with mean μj and covariance Σj – The illuminant-induced color transformation from test image to training image can be modeled as F (Ci ) = Ci d, where d = (d1 , d2 , d3 )T is the color render vector to be estimated. Ci = diag(ri , gi , bi ) is a diagonal matrix that stores the observed RGB colors for pixel i in the test image. Suppose we have trained S color surfaces with distributions yj ∼ N (μj , Σj ),j = 1, ..., S. Also, assume given a test image with N pixels Ci , i = 1, ..., N , which contains L illuminants linearly parameterized by vectors dl , l = 1, ..., L. Our goal is to estimate the optimal dl from image data and then get the assignments of surface class labels j(i) and illuminant type labels l(i) for each pixel i according to: (1) (j(i), l(i)) = arg min(dist(Ci dl , yj )) j,l
dist(·) is some properly selected distance measurement metric (e.g. Mahalanobis distance in this work). 2.1
Kernel Based Objective Function
To estimate the optimal transformation vectors dl , we propose do find the L peak modes of following kernel sum function: fˆk (d) =
N S
wij k(M 2 (Ci d, μj , η 2 Σj ))
(2)
i=1 j=1
where k(·) is the kernel profile function [2]( see sect.2.2 for detailed description), M 2 (Ci d, μj , η 2 Σj ) = (Ci d−μj )T (η 2 Σj )−1 (Ci d−μj ) is the Mahalanobis distance from compensated color Ci d to training color surface mean yj . wij is the prior weight for pixel i belonging to color surface j. The larger function (2) is, the better test image is compensated by vector d. In the following subsections 2.2 to 2.4, we will focus on the optimization issues and develop a highly efficient sequential mechanism to find the desired L peak modes of (2) as the optimal dl . 2.2
Half Quadratic Optimization
In this section, we will use half quadratic technique [10] to optimize objective function (2). The results follow directly from standard material in convex analysis (e.g. [10]) and we will omit the technical proofs for page limit. All the conditions we impose on kernel profile k(·) are summarized as below: 1. 2. 3. 4.
k(x) is a continuous monotonously decreasing and strictly convex function = β > 0, limx→+∞ k(x) = 0 limx→0+ k(x) limx→0+ k (x) = −γ < 0, limx→+∞ k (x) = 0, limx→+∞ (−xk (x)) = α < β k (x) is continuous with finite discontinuous points.
Color Constancy Via Convex Kernel Optimization
731
The following theorem 1 founds the base for optimizing function (2) in a half quadratic way. Theorem 1. Let k(.) be a profile satisfying all above conditions, then there exists a strictly monotonously increasing concave function ϕ : (0, γ) → (α, β), such that k(M 2 (Ci d, μj , η 2 Σj )) = sup(−pM 2 (Ci d, μj , η 2 Σj ) + ϕ(p)) p
and for a fixed d, the supmum is reached at p = −k (M 2 (Ci d, μj , η 2 Σj ). To further study criteria (2), we introduce a new objective function F : R3 × (0, γ)N → (0, +∞) Fˆη (d, p) =
S N
wij (−pij M 2 (Ci d, μj , η 2 Σj ) + ϕ(pij ))
(3)
i=1 j=1
where p = (p11 , pN 1 , ..., pN S ). According to theorem 1, we get fˆk (d) = sup(Fˆη (d, p)) p
It is straight forward to see that max fˆk (d) = max sup(Fˆη (d, p)) d
d
(4)
p
From (4) we tell that maximizing fˆk (d) is equivalent to maximizing Fˆη (d, p)), which is quadratic w.r.t. d when p is fixed. We propose to use a strategy based on alternate maximization over d and p as follows (superscript l denotes the time stamp): plij = −k (M 2 (Ci dl−1 , μj , η 2 Σj )) (5) ⎡ dl = ⎣
N S
⎤−1 ⎡ ⎤ N S wij plij CiT Σj−1 Ci ⎦ ⎣ wij plij CiT Σj−1 μj ⎦
i=1 j=1
2.3
(6)
i=1 j=1
Global Mode-Seeking
Since the above two-step iterations (5) and (6) are essentially a gradient ascending method, it will surely converge to local maximum. In this section, we first derive in the following proposition 1 indicating that if bandwidth parameter η is large enough, then the criterion function (2) is strictly concave, hence is uni-mode. Then we develop a global peak mode seeking method based on this proposition to find the transformation vector d that best compensates the illuminant in the test image.
732
X. Yuan, S.Z. Li, and R. He
Proposition 1. One sufficient condition for Fˆη (d, p) to be uni-mode is 12 k (v) η > Const ∗ 2 sup − k (v) v
(7)
where Const = max{ M 2 (x, μj , Σj )|x ∈ [0, 255]3, j = 1, ...S}. The proof is just built on trivial derivative calculation. We give below an example profile to further clarify proposition 1.
−x/2 Example 1. (Gaussian . Then k (x) = − 12 e−x/2 , k (x) =
profile): k(x) = e (x) 1 −x/2 = 12 . By proposition 1, the uni-mode-promising band. supx − kk (x) 4e width can be selected according to η > Const. In addition, the dual variable function is ϕ(p) = 2p − 2p ln 2p in theorem 1.
From proposition 1 we can tell that if η is large enough , then from any initial estimation, the two-step iteration algorithm presented in (5) and (6) will converge to a unique maximizer of the over-smoothed density function. When the uni-maximizer is reached, we may decrease the value of η and run the same iterations again, taking the previous maximizers as initializations. This procedure is repeated until a certain termination condition is met (e.g., convergence error is small enough). The final obtained maximizer is very likely to be the global peak mode of the criteria function, since such a numerical procedure is actually deterministic annealing [7]. See algorithm 1 for a formal description of this optimization procedure. We have noticed that this global peak mode-seeking mechanism is similar to what called annealed mean shift in [11], which aims to find the global kernel density mode. The key improvement lies in that we give an up-bound of uni-mode promising bandwidth, hence make the algorithm more operable in practice.
Algorithm 1. Global Transformation Vector Seeking 1: 2: 3: 4: 5: 6: 7: 8:
m ← 0, Initialize ηm satisfying the condition presented in proposition 1 Randomly initialize d while Terminate condition is not met do Run the iteration (5) and(6) till converge. m←m+1 ηm ← (ηm−1 ∗ ρ). Initialize d and p with the maximizers obtained in 4. end while
In the following subsections, we denote d∗ and p∗ be the convergent points reached in algorithm1, and η ∗ be the corresponding bandwidth. We also call the global maximizer d∗ reached in algorithm 1 to be the global transformation vector (GTV) (associated with current prior weights w).
Color Constancy Via Convex Kernel Optimization
2.4
733
Multiple Mode-Seeking
In this section, as an extension of algorithm 1, we develop an adaptive and sequential method, namely Ada-GTV, for multiple transformation vector modeseeking. The core idea of this method is to find the GTVs one after another by adaptively changing the prior weight vector w and finding the corresponding GTV d∗ via algorithm 1. Suppose that current GTV is estimated , we then search for a local maximizer d∗ around it for the criterion function (2) estimated under equal prior weights (this is because our purpose is to find the peak modes of (2) estimated on original training and test data). Dual variable p is calculated as pij = −k (M 2 (Ci d∗ , μj , η ∗2 Σj ), i = 1, ..., N, j = 1, ..., S. We then reweight all the terms in (2), giving higher weight to the cases that are “worse” compensated (with lower pij ) and repeat the GTV seeking procedure by algorithm 1. This leads to a sequential global mode-seeking algorithm. The formal description of Ada-GTV is given in algorithm 2. The founded GTVs can be naturally viewed as transformation parameters for different illuminations in the scene. Compensation and color classification can be easily done according to (1), as stated in [8]. The running time of Ada-GTV is obviously O(L ∗ N S) (L, S N ), hence it is a linear complexity algorithm w.r.t pixel number N . Algorithm 2. Ada-GTV 0 1: Initialization: Start with weights wij = 1/N S, i = 1, ..., N, j = 1, ..., S 2: for l = 0 to L − 1 do 3: GTV Estimation: Find GTV d∗ by algorithm 1 with current prior weight wl . 4: Mode Refinement : Starting from d∗ , find the local maximum d∗ for fˆk (d) estimated under η ∗ and w0 . 5: Dual Variables: Get pij = −k (M 2 (Ci d∗ , μj , η ∗2 Σj )). l+1 l+1 l ← wij /(1 + pij ). Normalize wij ← 6: Sample Re-weight: Set wij l+1 l+1 wij / ij wij 7: end for 8: Color and Illuminant Classification: Each pixel’s illuminant and color label is determined as (j(i), l(i)) = arg minj,l (M 2 (Ci dl , μj , Σj )).
3
Experiments
We present several groups of experiments on color compensation and classification of real scenes to show the performance of the our method. The first experiment is done to show the global optimization property of our algorithm. For comparison purpose, we adopt one set of image data used in [8]. The training image under “canonical” light (with the manually selected sample colors) and the test image are shown in fig.1(a) and 1(b). Compensation and color classification results by [8] are shown in fig.1(c) ∼ 1(f). It is obviously to see that result R1 from starting point P1 (fig. 1(c) and 1(d)) is much more satisfying than R2 from starting point P2 (fig. 1(e) and 1(f)), hence the algorithm
734
X. Yuan, S.Z. Li, and R. He
(a)
(b)
(c)
(g)
(d)
(h)
(e)
(f)
(i)
Fig. 1. A comparison example with EM based method [8]. (a): training image (with selected color) under “canonical” light (b) test image. (c) ∼ (f): compensation and color classification results by [8]. (c) and (d): R1 from starting point P1 ; (e) and (f): R2 from starting point P2 . (g) ∼ (i): the compensation, color classification and illuminant classification results by Ada-GTV from the starting point either P1 or P2 .
Table 1. Numerical results, Ada-GTV vs. EM
Starting point P 1 Result R1 by EM [8] Result by Ada-GTV Starting point P 2 Result R2 by EM [8] Result by Ada-GTV
d1 (1.0,1.0,1.0) (0.693,0.773,0.914) (0.916,0.990,1.053) (0.5,0.5,0.5) (0.493,0.748,0.502) (0.916,0.990,1.053)
d2 (2.0,2.0,2.0) (2.005, 1.636,1.456) (2.123,1.614,1.402) (1.0,2.0,1.0) (1.873, 1.557,1.487) (2.123,1.614,1.402)
is highly initialization relevant. The compensation, color classification and illuminant classification results by our Ada-GTV algorithm initialized with either P1 or P2 is shown in fig. 1(g) ∼ 1(h). Detailed numerical results can be seen in table 1, which clearly indicates the initialization invariant property of our method. The second experiment will show the ability of our method to handle nonlinear illuminant changes based on current linear render model. To do this, we make piecewise linear assumption to approximate the general nonlinear cases. Our method can automatically find the transformation vectors for each linear piece. We give here one experiment on a pair of “map” images to validate this interesting property. We used Canon A550 DC with automatic exposure, taking care to compensate for the camera’s gamma setting. The training image fig.2(a) and test image fig.2(b) are shot under two very different camera settings. The selected 6 sample colors from the training image and their ground truth
Color Constancy Via Convex Kernel Optimization
(a)
(d)
(b)
(e)
735
(c)
(f)
(g)
Fig. 2. Piecewise linear color constancy. (a) Training image; (b) Test image; (c) left: 6 selected sample colors and their ground truth counterparts in the test image; right: the ground truth transformation vectors for the 6 sample colors; (e)∼(g) color compensation, color classification and piecewise linear illuminant classification results. The black part in (e) and (f) represents unseen colors in the test image. (h): color compensation result by render vector d1 only.
Table 2. Initializations and iteration results for render matrices d1 d2 Initializations (1,1,1) (1,1,1) Iteration results (0.649,0.845,1.661) (0.788,1.008,3.014) Initializations (0.5,0.5,0.5) (0.5,0.5,0.5) Iteration results (0.648,0.843,1.661) (0.788,1.008,3.014) Initializations (5,5,5) (5,5,5) Iteration results (0.655,0.852,1.661) (0.788,1.008,3.014)
counterparts in the test image are shown in fig.2(c)(left part). To test whether the illuminant change in the test image is linear or not, we calculate the ground truth transformation vectors for the samples and plot them in fig.2(c)(right part). Obviously two clusters (bounded by dotted ellipses) appear from these vectors, thus the illuminant change is highly nonlinear. One reasonable assumption is that such a change is piecewise linear and we may just feed the image data into Ada-GTV to let it find the transformation vector modes sequentially for each piece, from arbitrary initializations. EM based method [8] can hard to achieve this goal simply because accurate initialization for each linear piece is required, which is not always available beforehand. Here, we properly set the mode number L=2 in Ada-GTV and initialize both render vectors d1 and d2 with three different starting points. The convergent points are the same under these initializations, as is shown in table 2 (parameters are set to be η0 = 1.934 and ρ0 = 0.5). The image results are shown in fig.2(d)∼ 2(f). fig.2(g) shows the color compensation result by render vector d1 only, which obviously introduces very
736
X. Yuan, S.Z. Li, and R. He
Fig. 3. Some other experimental results. From left to right: training image, test image and color compensated image. (a)“Casia” image pairs, (b)“Comic” image pairs, (c) and (d): “face” image pairs.
large compensation error, visually. Thus, we can see that the adopted piecewise linear assumption greatly improves performance of color constancy. We have also extensively evaluated our Ada-GTV method on some other real scene image pairs, and selected results are given in fig.3.
4
Conclusion
We introduce in this paper a novel convex kernel based method for color constancy computation with explicit illuminant parameter estimation. A convex kernel sum function is defined to measure the illuminant compensation accuracy in a new scene that contains some of the color surfaces seen in the training image. Render vector parameters are estimated by sequentially locating the peak modes of this objective function. The proposed method is fully data-driven and initialization invariant. Nonlinear color constancy can also be approximately solved in our framework with piecewise linear assumption. The experimental results clearly show the advantage of the our method over local optimization frameowrk, e.g. MAP formulation with EM solution stated in [8].
Acknowledgement This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, and Chinese Academy of Sciences “100 people project”.
Color Constancy Via Convex Kernel Optimization
737
References 1. Buchsbaum, G.: A spatial processor model for object color perception. Journal of Franklin Institute 310(1), 1–26 (1980) 2. Comaniciu, D., Meer, P.: Mean shft: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603– 619 (2002) 3. Forsyth, D.A.: A novel algorithm for color constancy. International Journal of Computer Vision 5(1), 5–36 (1990) 4. Gilbert, A., Bowden, R.: Tracking objects across cameras by incremetanlly learning inter-camera colour calibration and patterns of activity. In: European Conference on Computer Vision, vol. 2, pp. 125–136 (2006) 5. Hall, J., McGann, J., Land, E.: Color mondrian experiments: the study of average spectral distributions. J. Opt. Soc. Amer. A(67), 1380 (1977) 6. Finlayson, G., Banard, K., Funt, B.: Color constancy for scenes with varying illumination. Computer Visualization and Image Understanding 65(2), 311–321 (1997) 7. Li., S.Z.: Robustizing robust m-estimation using deterministic annealing. Pattern Recognition 29(1), 159–166 (1996) 8. Manduchi, R.: Learning outdoor color classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1713–1723 (2006) 9. Marimont, D.H., Wandell, B.A.: Linear models of surface and illuminant spectra. J. Opt. Soc. Amer. 9(11), 1905–1913 (1992) 10. Rockfellar, R.: Convex Analysis. Princeton Press (1970) 11. Shen, C., Brooks, M.J., Hengel, V.A.: Fast global kernel density mode seeking with application to localization and tracking. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1516–1523. IEEE, Los Alamitos (2005) 12. Tieu, K., Miller, E.G.: Unsupervised color constancy. In: Thrun, S., Becker, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, MIT Press, Cambridge 13. Tsin, Y., Ramesh, V., Collins, R., Kanade, T.: Bayesian color constancy for outdoor object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1132–1139 (2001) 14. Funt, B.V., Gardei, V.C., Barnard, K.: Modeling color constancy with neural networks. In: International Conference on Visual Recognition and Action: Neural Models of Mind and Machine (1997)
User-Guided Shape from Shading to Reconstruct Fine Details from a Single Photograph Alexandre Meyer, Hector M. Brice˜no, and Sa¨ıda Bouakaz Universit´e de Lyon, LIRIS, France
Abstract. Many real objects, such as faces, sculptures, or low-reliefs are composed of many detailed parts that can not be easily modeled by an artist nor by 3D scanning. In this paper, we propose a new shape from shading (SfS) approach to rapidly model details of these objects such as wrinkles and reliefs of surfaces from one photograph. The method first determines the surface’s flat areas in the photograph. Then, it constructs a graph of relative altitudes between each of these flat areas. We circumvent the ill-posed problem of shape from shading by having the user set if some of these flat areas are a local maximum or a local minimum; additional points can be added by the user (e.g. at discontinuous creases) – this is the only user input. We use an intuitive mass-spring based minimization to determine the final position of these flat areas and a fast-marching method to generate the surface. This process can be iterated until the user is satisfied with the resulting surface. We illustrate our approach on real faces and low-relief photographs.
1 Introduction Despite recent advances in surface modeling and deformation, creating photorealistic 3D models remains a difficult and time consuming task. Many real objects, such as people, faces, sculptures, masks or low-reliefs are composed of many detailed parts that can not be easily modeled by an artist. Alternatively, 3D scanning technology is still an expensive process. While much work has been devoted to using several photographs to build 3D models, or to rendering new views from many photographs, little work has been done to address the problem of modeling objects from a single photograph. The fine aspects of these surfaces appear on photographs as a variation of shading. The methods that recover these features from the shading are called Shape from Shading (SfS). Nevertheless, it has been shown that this is an ill-posed problem [1,2]: a solution does not necessarily exist and when it exists, it is not unique meaning that different surfaces may have produced a given image. Figure 1 illustrates this point: the image on the left may have been produced by both objects on the right. In this paper, we propose a new practical Shape from Shading method which can be applied to a real photograph to help on the difficult problem of modeling fine aspect such as wrinkles and reliefs of surfaces. The ill-posed of shape from shading is solved by asking an user to decide whether some areas orthogonal to the viewing direction are a
Laboratoire d’InfoRmatique en Images et Syst`emes d’information UMR 5205 CNRS/INSA de Lyon/Universit´e Claude Bernard Lyon 1/Universit´e Lumi`ere Lyon 2/Ecole Centrale de Lyon.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 738–747, 2007. c Springer-Verlag Berlin Heidelberg 2007
User-Guided Shape from Shading to Reconstruct Fine Details
739
Fig. 1. Shape from Shading is an ill-posed problem: two different surfaces may produce the same shaded image. The left image may have been produced by several surfaces shown on the on the right. The highlights correspond to the flat areas. The upper right image was produced considering highlights A and C as peaks, and the lower right image considering only highlight B as a peak. Other combinations are possible.
local extrema (local maximum or local minimum) or not. The user information is propagated during a minimization process based on a mass-spring simulation. This massspring minimization has the advantage of presenting a graphical visualization which allows the user to interact during the computation according to his knowledge. Indeed, we have noticed that many minimization approaches like simulated annealing [3] which are fully automatic do not offer any convenient way of correcting reconstructed surface if it is incorrect. With our approach, the user can visually follow the intuitive massspring minimization and correct any errors in the subsequent reconstruction. Moreover, each time he wants to see the potential reconstructed surface, the fast marching method [4,5] generates it in few seconds.
2 Related Work The problem of surface reconstruction from single image can vary in the degree of user interaction: from fully automatic approaches to interactive modeling. Hoiem et al. in [6] proposed a fully automatic system that creates a 3D model of the scene made up of several texture-mapped planar billboards. Their approach captures the global geometry of the scene as planar. On a single image without any underlying lines (or planes), for example a face, it seems difficult to obtain better results without using shading information. The Shape from Shading (SfS) problem has been widely studied in the computer vision area. See [7,8,9] for a survey on SfS methods. The SfS problem is known as a difficult problem because of his ill-posedness [2]. Consequently, few approaches have been tried on real photographs. Courteille et al. in [10] propose a method to set flat a photograph of a curved sheet of paper in order to facilitate the character recognition. Prados et al. in [2,11] have tried a method based on taking into account the light attenuation on face photographs with relatively good results. Zhu et al. in [12] tackle the ambiguities of shape from shading by a semi-definite programming relaxation process which flip patches and adjust heights until the result surface has no kinks.
740
A. Meyer, H.M. Brice˜no, and S. Bouakaz
To our knowledge, beside Zeng et al. [13] SfS approaches are mostly fully automatic in spite of the ill-posed nature of the problem. Zeng’s method asks the user to enter some normal in order to determine in which direction the slope is going up to a local maximum. Once all local maxima are computed, they compute the relative altitude between each of them and the fast-marching algorithm generates the surface. Similarly to Zeng et al., we believe that the ill-posed aspect may only be tackled with user interaction. Comparing to Zeng’s approach we differ in the user input aspect. In Zeng’s approach, the user has to enter enough normals to capture the small variation of surface such as wrinkles on a forehead. Our approach automatically computes all flat areas and mass-spring minimization is more visual. Thus, the user may interact more directly on the data (the mass-spring graph for us) and may explore different solutions as illustrated on Fig. 1 with the two plausible configurations.
3 Formulation and Overview of Our Approach Our technique takes as input a color image and produces as output an heightmap which is an image where each pixel stores a distance to the underlying plane. Input images are obtained with a camera using a flash as then only light source. The first step of our approach is to compute the shading image from the RGB colored photograph. For that, we convert each RGB pixel into YUV color space and keep the luminance Y as shading. A more elaborate solution based on assumption of shading continuity is one proposed by Funt et al. in [14]: reflectance changes are located and removed from the luminance image by thresholding the gradient at locations of abrupt chromaticity change. More recently, Tappen et al. in [15] proposed a solution based on both color information and a classifier trained to recognize gray-scale patterns. Once the shading image is computed, we have the classical SfS problem. In this part, we assume that the shading image is photographed orthographically, and the scene is composed of Lambertian surfaces which exhibit single-bounce reflections and are illuminated from the camera direction by a point light source at infinity. Our approach is decomposed as follows: 1. For each pixel of the RGB photograph, extract the shading value by converting it into YUV and taking the Y value (or with [14]’s or [15]’s methods). 2. Detect the flat regions (highlights) in the image which will become the vertices to our relative altitude graph which will guide the user-interaction and the reconstruction (Sect. 4). 3. Using fast-marching (for a recap see Sect. 3.1) we compute the relative altitude difference between the vertices in the relative-altitude graph (Sect. 4). 4. The user defines few vertices as peaks or saddles. 5. The position of the remaining vertices is computed by solving a spring-mass system over the vertices (Sect. 5), and a new surface is quickly computed 6. If features on the surface do not have a vertex associated with them, it is possible for the user to add new vertices to the relative-altitude graph. 7. These three last steps can me re-iterated until the user is satisfied with the reconstructed surface.
User-Guided Shape from Shading to Reconstruct Fine Details
741
3.1 Shading Image Formation Model and Fast-Marching We first review the shading image I(x, y) formation model for a 3D Lambertian object. In our approach the camera and the light-source have the same position which we consider at infinity from the object; we then define the light source direction as ∂zx,y ∂zx,y , ∂y , 1). NoL = (0, 0, 1). The surface normal direction is given by Nx,y = ( ∂x tice that N is not normalized. The shading image is the dot product of the light and the normalized surface normal, it is computed by: Ix,y = L.
1 Nx,y = 2 ||Nx,y || ∂zx,y ∂zx,y 2 + ∂y +1 ∂x
∂zx,y ∂zx,y −2 With ∇zx,y = ( ∂x , ∂y ) we get ||∇zx,y || = Ix,y − 1. This equation is known as the Eikonal equation which can be solved by the numerical algorithm fast marching which we recap coarsely here. More detailed information can be found in [4,5,16]. At initialization, all pixel’s altitude zx,y are set to ∞ beside few pixels whose altitudes are known. All known pixels are put into a priority queue ordered by their altitude: smaller altitude first. The algorithm extracts pixels from the priority queue until it is empty. Starting from a known pixel, the altitude of its four-connected neighbors is updated and added to the queue. The altitude zx,y of pixel (x, y) is updated as follows: • Let z1 = min(zx−1,y , zx+1,y ) and z2 = min(z √ x,y−1 , zx,y+1 ). • If |z1 − z2 | < ||∇zx,y || then zx,y = else zx,y = min(z1 , z2 ) + ||∇zx,y ||;
z1 +z2 +
2 −(z −z )2 2×∇zx,y 1 2 2
4 Relative Altitude Graph Our minimization process described in the next section is based on a relativealtitude graph mapped on the shading image. This section is dedicated to this graph computation. Vertices Detection. We define a graph over the photograph where the vertices correspond to highlights (high intensity values) on the shading image. These points correspond to singular points on the surface where the gradiant is 0 (e.g. local minima, local maxima, saddle) On the shading image, several adjacent pixels may have the same intensity, we thus consider only one area, and thus one vertex in the graph. Since we assume an orthographic camera, a vertex represent a flat area of the surface orthogonal to the viewing direction. We named them OVD areas for Orthogonal to the Viewing Direction. All these OVD areas are parallel because of the orthographic assumption. Theoretically, these OVD areas have a maximal shading value. Since on real photograph, light attenuation may appear, we define them by a threshold Tshading . To compute these OVD areas we consider all regions of 4-connected pixels with equivalent values. The OVD areas are computed by a depth-first search algorithm on the shading image interpreted as a graph: two pixels are neighbor if their shading values are equal. Among all these areas, we keep as OVD areas only those with a shading value greater than
742
A. Meyer, H.M. Brice˜no, and S. Bouakaz
Fig. 2. (a) Each highlight of the shading image gives a vertex in the relative-distance graph. Fast marching algorithm computes the relative distance between each pair of vertices. The graph is simplified into a minimum spanning tree. (b) An iterative process of mass-spring simulation/user inputs on the graph (c) runs until the user is satisfied by the reconstructed surface (d). Notice that, since our graph represents the relative-altitude (and not euclidean distance), each vertex can move only in a column (change altitude) during the mass-spring simulation.
Tshading and where its neighboring area is less bright. On the vase image of Fig. 2, this algorithm finds four OVD areas which intuitively are the four highlight spots. Edges and Relative Altitude Computation. We now determine edges and their weight which are the relative-altitude between vertices. For each vertex vi of the graph, we compute the relative altitude to all other vertices by fast marching: We set to zero the altitude of the pixel pi under the considered vertex vi . The altitude of all other pixels is set to ∞. We run the fast marching algorithm as described on Sect. 3.1. The altitude of all other pixels will be lower than vi because we only descend from this pixel. We then look at the altitude of pixels that correspond to the other vertices vj , j = i. We use this difference to set the weight of the edge eij to be the relative altitude difference between vertex vi and vj . Notice the fact that relative altitude difference between two vertices might be wrong if there is a inflection point between them, this situation will be addressed in next section by a simplification of the graph. We iterate this process for each vertex until we get the weights of all the edges between all the vertices. After this process, we do not know if a vertex is, for example, a local maximum, local minimum, saddle; we only know its relative altitude to its neighbors.
5 Mass-Spring Simulation and User Interaction The relative altitude graph is mapped onto the shading image, then it is simplified and converted into a mass-spring network which will serve as a visual aid and a way for the user to correct the minimization process. Our initial complete graph is composed of C2n = 12 n(n − 1) edges, n being the number of vertices (OVD areas). However, the relative altitude between two far vertices (in the sense of Euclidean distance) is probably incorrect and should be discarded: the monotonic descent assumption used for the relative altitude calculation does not
User-Guided Shape from Shading to Reconstruct Fine Details
743
hold if the path between the two vertices crosses a valley, a saddle or a ridge. Thus, a relative altitude is only meaningful for adjacent vertices, the graph can be reduced to a subset of edges. For the same reasons that Zeng et al. in [13], we simplify the graph by its minimum spanning tree using Prim’s algorithm [17]. The number of edges is thus reduced to n. Since all vertices are directly/indirectly connected to each other by the tree, the user can seed a minimization process to find their absolute altitude. Notice, that our approach of computing a complete graph which is then simplified is simple to setup whereas determining directly which vertices are neighbors would have been error-prone. The weight of each edge is the (relative-)altitude difference between its two vertices but the sign of this relative altitude is unknown: we do not know which vertex is above the other. In others words, we do not which vertices are local maximum, which are local minimum, and which ones are saddle. Thus, we ask a user to select some vertices and to move them up or to the down according to his knowledge of the target surface, this will serve as the initial condition to the minimization process. For example on a face, user will move up the vertex corresponding to the nose. In order to respect the relative altitude constrains between vertices and to propagate user’s information to the remaining vertices we build a mass-spring network. This saves the user from having to adjust the altitude of all vertices. Each vertex becomes a mass which will be able to move only in the z direction as its (x,y) position is fixed. Indeed, we consider only relative-altitude between vertices (and not Euclidean distance) this simulation is like a single column of mass linked by springs as illustrates on Fig. 2. All vertices have a mass of 1 meaning any vertex is more important than another. Each edge of the spanning tree becomes a spring having a rest-length equal to the relative altitude between its two vertices. This mass-spring network is animated by an Euler-explicit integration [18] until it stabilizes. This simulation has the two advantages: propagating user inputs and being visually intuitive for the user. Indeed, even before the stabilization, the user may want to interact by moving vertices according to his knowledge of the surface. Moreover, each time he wants to see the potential reconstructed surface, the fast marching method
Fig. 3. A surface may have sharp edges corresponding to local minimum or maximum without producing highlights. For example, at the line of the junction of the lips, the surface changes its orientation without forming a flat area. To correct this, the user can add vertices to the graph which will allow a change in the orientation. On the left, the graph without user intervention produces incoherent lips whereas on the right, after adding a vertex (blue), our method produces correct lips.
744
A. Meyer, H.M. Brice˜no, and S. Bouakaz
generates it in less than a second. This iterative process “mass-spring simulation/user interaction on the vertices” is running until the reconstructed surface satisfies the user. Our interactive method allows to deal with surfaces with sharp edges: meaning local minimum or maximum without highlight. Sharp edges are points where the surface is C0 continuous but piecewise C1 continuous meaning where the gradient is not smooth. For instance, at the line of the junction of the lips, the surface changes its orientation without producing a flat area with a highlight. Thus, our method allows the user to add a vertex to the relative-altitude graph which will allows a change in the surface orientation. Fig. 3 illustrates this feature: on the left the graph without user intervention produce incoherent lips whereas on the right, after the addition of the blue vertex, our method produces correct lips.
6 Results On Fig. 5 we show results from real photographs to demonstrate that our technique is well suited to image-based modeling. This surface is reconstructed from one input photograph in few minutes by an user: graph generation takes around 30 seconds on a Intel Centrino Laptop for approximately 70 vertices, surface generation by fast-marching takes less than a second for an image of 300 × 200. For the faces, the user has to interact with fewer vertices, between 2 and 10. We illustrate our concept on face photographs to show the capability of our technique to capture fine wrinkles of the skin which are difficult to obtain with multi-view approaches. Once the surface is reconstructed as an heightmap, we use the surface normal to extract a pseudo intrinsic color of each pixel by solving the diffuse equation (R, G, B)image = (R, G, B)intrinsic × N.L with L = (0, 0, 1). Thus, a textured image of the surface is obtained by combining the intrinsic color and the shading computed with surface normal. Notice, that if the light is similar to the original photograph (position and color) we should obtain similar results.
Fig. 4. On the left, the original photograph and the computed shading image. On the right, the reconstructed surface representing a low-relief hand. After an empirical test on this image, we do not perform any particular process to manage the specular aspect, nor to take into account that the wall behind the hand has probably a different albedo than the hand.
User-Guided Shape from Shading to Reconstruct Fine Details
745
Fig. 5. Fine detail of facial expression are hard to capture because of the wrinkles of the skin. Using shading information, our technique allows to capture them from a single photograph. We show the original image, the computed shading image used to reconstruct the surface, a rendering of the reconstructed image with only the shading computed using the normal and some results with the color texture. Top left is a photo downloaded from the web. Others are extracted from a 640 × 480 video sequence. Notice that Zeng et al. in [13] propose a minimization method to fix the kinks of the surface due to fast marching imprecision like the ones present near the eyebrows (bottom).
746
A. Meyer, H.M. Brice˜no, and S. Bouakaz
The heightmap produced by our system is easily triangulated to a mesh. Nevertheless, it can also be directly included as displacement map, for instance to produce realistic scene of low-relief walls as illustrate on Fig. 4. Limitations. Our method allows the reconstruction of fine details, nevertheless there are some limitations with our system. First, the reconstructed surface might have some kinks at the surface jonctions during the fast marching process. On Fig. 5, kinks near the eyebrows are present due to the difference of albedo between the skin and the eyebrows. At the end of [13], Zeng et al. propose a method to fix this kind of surface incoherency by a minimization process. Second, if there are too many fine details, the amount of user input can become important. It is concevaible to add heuristics, to modify the mass-spring simulation or to use hierarchical approach to alleviate this limitation. Additionnally, the algorithm supposes that the surface is C1 continuous, thus discontinuities in the surface can be hard to capture. This problem is partially mitigated by allowing the user to add vertices to the relative altitude graph.
7 Conclusion and Future Work Starting from a single color image (a photograph), we have presented an intuitive method for user reconstruction of surfaces which may have produced the image. Our method is interactive, guided by the user it may reconstruct different surfaces for a same input image. It allows to explore different SfS solutions in case of doubt. For example, the image on the left of Fig. 1 which may have been produced by the two surfaces on the right. This exploration facility allows the user to interactively, quickly, and easily reconstruct the surface of a given object with only one photograph. The ambiguity around the global shape of the photographed object is hard to resolve automatically without any a priori knowledge, so we ask the user to specify few local extrema (maximum or minimum). Since reconstructed surface is computed in less than few seconds, it is easy for the user to converge to a surface. Manual intervention is only needed to reconstruct the global shape whereas fine part of the surface is automatically extracted. Finally, we believe that a little user interaction can help to reconstruct many real objects. Thus, SfS approaches may be practically included in 3D mesh modelers 1 by defining a shape by example paradigm. In the future, we also would like to combine SfS approaches to global surface reconstruction based on multiple-views.
References 1. Durou, J.D., Mascarilla, L., Piau, D.: Non-Visible Deformations. In: Del Bimbo, A. (ed.) ICIAP 1997. LNCS, vol. 1311, Springer, Heidelberg (1997) 2. Prados, E., Faugeras, O.: Shape from shading: a well-posed problem? In: CVPR 2005. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 870–877. IEEE, Los Alamitos (2005) 3. Courteille, F., Durou, J.D., Morin, G.: A global solution to the sfs problem using b-spline and simulated annealing. In: ICPR (2006) 1
Such as Maya(Alias Wavefront), 3D Studio Max(Discreet) or Image Modeler(Realviz).
User-Guided Shape from Shading to Reconstruct Fine Details
747
4. Sethian, J.A: A Fast Marching Level Set Method for Monotonically Advancing Fronts. Proceedings of the National Academy of Sciences of the United States of America 93(4), 1591– 1595 (1996) 5. Kimmel, R., Sethian, J.A.: Optimal Algorithm for Shape from Shading and Path Planning. Journal of Mathematical Imaging and Vision 14(3), 237–244 (2001) 6. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. In: SIGGRAPH 2005. ACM SIGGRAPH 2005 Papers, pp. 577–584. ACM Press, New York (2005) 7. Kozera, R.: An overview of the shape from shading problem. Machine Graphics and Vision (1998) 8. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from Shading: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8), 690–706 (1999) 9. Durou, J.D., Falcone, M., Sagona, M.: A Survey of Numerical Methods for Shape from Shading. Rapport de Recherche 2004-2-R, Institut de Recherche en Informatique de Toulouse, Toulouse, France (2004) 10. Courteille, F., Crouzil, A., Durou, J.D., Gurdjos, P.: Shape from shading for the digitization of curved documents. Machine Vision and Applications (2006) 11. Prados, E., Camilli, F., Faugeras, O.: A unifying and rigorous shape from shading method adapted to realistic data and applications. Journal of Mathematical Imaging and Vision 25(3), 307–328 (2006) 12. Zhu, Q., Shi, J.: Shape from shading: Recognizing the mountains through a global view. In: CVPR 2006. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2006) 13. Zeng, G., Matsushita, Y., Quan, L., Shum, H.Y.: Interactive Shape from Shading. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. I, pp. 343–350. IEEE Computer Society Press, Los Alamitos (2005) 14. Funt, B.V., Drew, M.S., Brockington, M.: Recovering shading from color images. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 124–132. Springer, Heidelberg (1992) 15. Tappen, M.F.: Recovering intrinsic images from a single image. IEEE Trans. Pattern Anal. Mach. Intell 27(9), 1459–1472 (2005) 16. Ho, J., Lim, J., Yang, M.H.: Integrating Surface Normal Vectors Using Fast Marching Method. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 239–250. Springer, Heidelberg (2006) 17. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36, 1389–1401 (1957) 18. Desbrun, M., Schr¨oder, P., Barr, A.: Interactive animation of structured deformable objects. Graphics Interface, 1–8 (1999)
A Theoretical Approach to Construct Highly Discriminative Features with Application in AdaBoost Yuxin Jin, Linmi Tao, Guangyou Xu, and Yuxin Peng Computer Science and Technology Department, Tsinghua University, Beijing, China
[email protected]
Abstract. AdaBoost is a practical method of real-time face detection, but abides by a crucial problem of overfitting for the big number of features used in a trained classifier due to the weak discriminative abilities of these features. This paper proposes a theoretical approach to construct highly discriminative features, which is named composed features, from Haar-like features. Both of the composed and Haar-like features are employed to train a multi-view face detector. The primary experiments show promising results in reducing the number of features used in a classifier, which leads to the increase of the generalization ability of the classifier.
1
Introduction
In 1995, Freund and Schapire [1] introduced AdaBoost algorithm based on traditional boosting method. Thanks to their efforts, theoretical analysis on AdaBoost was pro-posed in the following years. They proved in [1] that the generalization error should be smaller if fewer training rounds are involved. Later, they gave a new theory of the generalization in terms of margins [2]: greater margins contribute to better results. In early years, AdaBoost was, however, inapplicable in real-time case due to its great computational cost. Fortunately, the breakthrough occurred in 2001 when Viola and Jones proposed a novel real-time AdaBoost for face detection [4]. The keys to make real-time possible are the usage of the Integral Image, Haar-like features and Cascade Hierarchy. Based on this approach, two kinds of extensions were focused on the improvement of hierarchy and feature. [12] extended the cascade hierarchy into the multi-view case - Detector Pyramid Architecture AdaBoost (DPAA). Later, [7] adopted Width-First- Search (WFS) Tree Structure to make a balance between high speed and robust. [11] extended the Haar-like features with 45o rotated features. [8] proposed Asymmetric Rectangle Features and experiment showed improved performance. However, the above methods are based on Haar-like features which are so weak that a large number of weak classifiers are used to train a strong classifier. Proven in [1], such a burdensome strong classifier increases the risk of overfitting. Some efforts were taken to overcome the disadvantage of Haar-like features (their poor discriminative abilities). [9] used PCA approach to generate the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 748–757, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Theoretical Approach to Construct Highly Discriminative Features
749
global features which are included in the feature set in later layers of cascade. [10] used Gabor Features instead of Haar-like features. Although these features show superior discriminative abilities to Haar-like features, they are time-consuming in computation which may forbid real-time application. [14] used EOH (Edge Oriented Histogram) Features and gained good results. In this paper, we propose a novel theoretical approach to construct highly discriminative features named composed features from Haar-like features whose computational load is small suitable for real-time tasks such as face detection. Thus, we can not only efficiently compute the highly discriminative features but also decrease overfitting. In section 2, we will discuss features and their discriminative abilities. In section 3, highly discriminative features efficient in computation will be constructed by Haar-like features. Section 4 shows the promising experiment results.
2
Features
For AdaBoost, a strong classifier combined by many weak classifiers which are not much better than random guess can achieve any little error rate in training after sufficient loop rounds which has been proven in [1]. Each weak classifier contains two parts: feature and classification function. AdaBoost systematically chooses different features and then builds weak classifiers based on them and finally outputs a strong classifier. We will focus on the features. Currently, in order to be real-time, Viola and Jones [4] introduced Haar-like features. With the help of Integral Image, they are efficient to compute. Only about six additions are involved to compute their value. Moreover, they are in essence linear features because their value can also be calculated as the dot product of the feature and the image vector. Fig. 1 show the vector representation.
Fig. 1. Vector representation of Haar-like feature
Image could be represented as: image = [a11 , a21 , a31 , a41 , ...a34 , a44 ]T Similarly, the Haar-like feature also could be represented in vector form: w = [0, −1, −1, 0, 0, −1, −1, 0, 0, +1, +1, 0, 0, +1, +1, 0]T Feature value xw is just the dot product of these two vectors. (The module of w does not affect the discriminative ability.) xw = w · image = wT image
(1)
750
Y. Jin et al.
Unlike many known linear features which are highly discriminative such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), Gabor features etc, a single Haar-like feature is so simple that its discrimination ability is weak. With respect to the discriminative abilities, features can be categorized into two classes: weak features and strong features shown in Fig. 2. Categorization is not strictly defined. Strong features only mean the features with relatively higher discrimination abilities (or the classification error may be lower than some threshold).
Fig. 2. A 2-Class case. Left: 2 weak feature vectors’ direction and 1 strong feature vector’s direction. Right: dash lines: the separation plane in weak feature space; solid line: the separation plane in strong feature space.
In this example, two classes are drawn in circles and crosses. Suppose there are only two weak features whose directions are horizontal and vertical. In order to fully separate these two classes, about eight weak classifiers should be trained by AdaBoost. However, only one weak classifier with the strong feature can fully separate them shown in Fig. 2. As to the generalization theory in [1], the strong classifier combined with fewer weak classifiers performs better. Although strong features are superior to weak features in discriminative abilities, they are always time-consuming in computation which involves n multiplications for an n dimensional image vector. That is exactly the reason to keep them from application in real-time cases. In the example illustrated by Fig. 2, all features are vectors. The strong feature can be constructed by two weak features fW 1 and fW 2 :fS = αfW 1 + βfW 2 . More generally, any linear feature vector in can be constructed by at most linearly independent vectors (linear features) in . In light of this, we will propose a novel approach to reduce the computational load of strong feature by constructing strong features from Haar-like features.
3 3.1
Composed Features Definition of Composed Features
The computation cost of a strong feature value arises from the numerous multiplications of dot product between feature vector and image vector. Thus, in order to reduce such cost, the number of multiplications should be decreased. We can use Haar-like features to construct a strong feature. Due to the low computation
A Theoretical Approach to Construct Highly Discriminative Features
751
cost for Haar-like features’ value, strong feature’s value can be calculated fast. A strong feature is de-noted as vector fj in Rd . For a 16*16 image, there exist more than 50,000 Haar-like features. These features are definitely linearly dependent because the dimension is just 256 far fewer than the number of features. Actually, there are just M ∗ N linearly independent feature vectors for M ∗ N image size. We define one group of M ∗ N independent feature vectors from the set of Haar-like features as base features. w0 , w1 , w2 , ...wd−1 , d = M × N By using these base features, fj can be constructed as follows: fj =
d−1
pij wi
(2)
i=0
Then the feature value can be computed as: xj =
d−1
pij wiT image
(3)
i=0
In this way, we can compute the strong feature’s value. However, such representation requires d multiplications, too. To reduce the computing time, we want to use fewer Haar-like features to construct a strong feature. We choose k linearly independent Haar-like features w0 , ...wk−1 ∈ W, k d to construct a feature qj , W is the complete set of Haar-like features, then: qj =
k−1
qij wi = Wαq , W = [w0 , ...wk−1 ], αq = [q0j , ...qk−1j ]T
(4)
i=0
We define such features as Composed Features: Definition 1. Composed features: linear and constructed by some linearly independent features - base features so that the computation of the features’ value can be implemented indirectly by calculating the base features’ value. Usually, the computational cost is reduced. 3.2
Approximation Measurement
Composed features are not necessary to be strong features. Some of them may be strong features while others may not. We have two ways to find strong features: Firstly, we can exhaustively search a proper set of to construct a com-posed feature and then check its discrimination ability. If the classification error is less than some predetermined threshold, it can be viewed as a strong feature. Secondly, as some strong features (PCA, LDA, or Gabor) are known, we can con-struct a composed feature to approximate these features. The first way would be feasible only if k is very small(just 2-3). Making exhaustive exploration is equivalent to making comparison among all the possible combination of all Haar-like features. Even fixing the coefficients, eg. αq = [1, 1]T ,
752
Y. Jin et al.
the searching space will still be C(|W |, k), where |W | is the number of Haar-like features. When k > 3 , the computational cost is too large:C(50000, 4) ≈ 2.6e17 . Such a na¨ıve approach will crash when k is a bit larger. The second way is more practical. In this case, fj is assumed to be known(they can be PCA, LDA, or Gabor features). Our task is to find a proper set of w0 , ...wk−1 and construct a composed feature which can approximate fj . Thus, a measurement to evaluate the approximation should be introduced as follows: Definition 2. Approximation Measurement: The smaller the angle θ between vector fj and qj is, the better qj approximates fj . Thus, given set W, the coefficients αq can be uniquely determined according to the Approximation Measurement. It is equivalent to maximize cos θ. k−1 αTq FW qij wiT fj qj · fj = i=0 = cos θ = k−1 2 |qj ||fj | T αTq WT Wαq |fj | (5) i=0 qij wi wi |fj | T fj ]T where FW = [w0T fj , ..., wk−1
Then the problem becomes solving the maximum problem given below. max αTq FW s.t. αTq WT Wαq = 1
(6)
Here, we restrict the composed vector qj to be unit vector without losing generosity. Such problem could be easily solved by implementing Lagrangian method. f (p0j , p1j , ..., pk−1j , λ) = αTq FW − λ(αTq WT Wαq − 1) ⎧ ⎪ ⎪ FTW (WT W)−1 FW ⎪ ⎪ ⎪ ⎨λ = 2 T −1 W) FW (W ⎪ ⎪ ⎪ αq = ⎪ ⎪ ⎩ FTW (WT W)−1 FW
(7)
(8)
As selected Haar-like features are linearly independent, WT W is invertible. The remaining problem is how to find a proper set W . It is infeasible to implement an exhaustive search, therefore, we will introduce a novel algorithm based on Simulated Annealing to achieve it. 3.3
Proper W Searching Algorithm
To find a proper set W is an optimization problem described as follows: min θ = min f (W) for given k s.t. W ∈ D
(9)
A Theoretical Approach to Construct Highly Discriminative Features
753
D is the configuration space. Each configuration is a set of k Haar-like features chosen from W . Because the number of Haar-like features denoted as |W | is finite, so that the number of configurations in D (|D|) is finite. However, it is an NP problem because |D| = C(|W |, k), k d. In order to find a proper set of Haar-like features, Simulated Annealing algorithm is implemented. We take θ as the energy function, if current θi is larger than θj calculated from configuration through (5), i will be transited into j, otherwise the transition occurs at some probability. From the current configuration i to the next one j, i, j ∈ D, we only exchange one Haar-like feature in i with the transition probability. pij = Gij (t)Aij (t),
∀j ∈ D
(10)
Where t is the temperature. Gij (t) and Aij (t) are Generalization and Acceptance probability, respectively.
1/|N (i)|, j ∈ N (i) Gij (t) = (11) 0, j∈ / N (i)
Aij (t) =
1, f (i) ≥ f (j) (i) exp − f (j)−f , f (i) < f (j) t
(12)
N (i) is the neighbor of configuration i. Only one feature by randomly selection in i can be exchanged with any other feature in W which is linearly independent with the remained features. Thus, the neighbor number of i is k|W | |D|. In this way, the problem is feasible for solving. The Proper Searching Algorithm is given in Fig. 3. Research in feature extraction lasts for several decades. Thanks to these efforts, PCA, LDA, Gabor features or other possible linear features can be used
Fig. 3. Proper Searching Algorithm
754
Y. Jin et al.
here as fj . With the construction, complicated strong features are composed of several simple Haar-like features. As a result, the computation of strong features’ value is faster with some insignificant loss in accuracy. Then the Composed Features can be used in AdaBoost for real-time application.
4
Experiment
A total of 10,000 faces are collected from variant sources and categorized into 5 views, [-90, -60], [-60, -15], [-15, 15], [15, 60] and [60, 90] (view 1∼5) with 2000 of each view. Each face example’s size is 16*16. We adopt DPAA ([12], [8]) as Train-ing Structure under AdaBoost Scheme. The ratio of the number of positive examples to the number of negative examples is 1.0 for each layer. We implement our experiments by using a P4 3.0GHZ, 512 RAM computer.
Fig. 4. Left: faces data in view 4; Center: extracted PCA features of group 4; Right: approximated PCA features
Fig. 5. Feature values’ distribution for features. Top-Left: PCA feature. Bottom-Right PCA- CF. Bottom-Left: 1st Haar-like features. Bottom-Right: 5th Haar-like features.
A Theoretical Approach to Construct Highly Discriminative Features
755
In our experiment, we only construct PCA features. We extract PCA from 9 groups, each of which may include one or more views data. Groups 1 5 include 5 views data respectively; group 6 include view 1 and 2; group 7 include view 2, 3 and 4; group 8 include view 4 and 5; group 9 include all views. We choose first 100 PCA features from each group to be approximated. Fig. 4 shows the PCA features and PCA-CFs (PCA Composed Features) each of which is combined with 15 Haar-like features for view [30, 60]. The angles between PCA features and PCA-CFs are less than 20o (cos 20o ≈ 0.939693). In Fig. 5, top row shows that in layer 7 the feature values’ distribution is similar for the PCA feature and PCA-CF. Their error rates are 26.75% and 27.58% respectively. Obviously, although PCA-CF is not exact PCA features, their discrimination abilities are similar. In layer 7, it is difficult to depart faces from nonfaces. Bottom row shows the distribution for the 1st and 5th Haar-like features selected with error 30.17% and 38.17%. They are worse than PCA-CF. We tested our system on the CMU profile data set (208 images and 441 faces). Fig. 6 are the comparison of training error curves and margin distribution in the test. As to the results, the training error rate converges faster to zero if PCACFs are used. Furthermore, fewer features are selected and higher margins are gained. Generalization error’s up bound must be reduced according to [1]. Scaling the scanning window from 16*16 to 256*256 with scale ratio 1.2, we can process about 14 frames per second for 320*240 images on average. Some of the results are given in Fig. 7. We compare ROC curve of our approach with those
Fig. 6. Left: Training Error Curve; Right: Margin distribution
Fig. 7. Some results on CMU profile test set
756
Y. Jin et al.
Fig. 8. ROC comparison on CMU profile test set
pro-posed by Viola, Jones [5] and Schneideman [6] (these approaches reported results in CMU profile data set), the curves are shown in Fig. 8. Apparently, our approach is the best among them and it is real-time. (The results in [8] were given on some unknown dataset. [9][10] gave results on other dataset. [11] only gave the false alarm rate instead of false alarm number).
5
Conclusion
In this paper, we propose a theoretical approach to construct linear strong features so that we improve generalization ability and efficiency. Haar-like features are too weak to discriminate classes, which results in serious overfitting. Strong features may be used to gain highly discriminative ability but they are inefficient in computation. Composed features proposed in this paper inherit advantages from both Haar-like features and strong features. By using Proper W Searching Algorithm, composed features can be constructed and approximate strong features so that efficiency and better generalization ability could be achieved. Experiment shows our method is better than Viola, Jones, Schneiderman and Levi, Weiss. We can build up real-time AdaBoost system on composed features. Our approach could be extended to construct any linear strong features. This approach’s application should not be limited in AdaBoost, it can be used in any cases where need fast computation of some strong features.
Acknowledgements This work is supported by National Science Foundation of China under grant No 60673189 and No 60433030.
A Theoretical Approach to Construct Highly Discriminative Features
757
References 1. Freund, Y., Schapire, R.E.: A Decision- Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Science 55, 119–139 (1997) 2. Schapire, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics 26(5), 1651–1686 (1998) 3. Schapire, R., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37, 297–336 (1999) 4. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2001) 5. Jones, M., Viola, P.: Fast Multi-view Face Detection. In: TR (2003) 6. Schneiderman, H., Kanade, T.: A statistical method for 3D object detection applied to faces and cars. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2000) 7. Huang, C., Ai, H., et al.: Vector Boosting for Rotation Invariant Multi-view Face Detection. In: IEEE International Conf. on Computer Vision (2005) 8. Wang, Y., Liu, Y., Tao, L., Xu, G.: Real-Time Multi-View Face Detection and Pose Estimation in Video Stream. In: IEEE International Conf. on Pattern Recognition (2006) 9. Zhang, D., Li, S.Z., et al.: Real-Time Face Detection Using Boosting in Hierarchical Feature Spaces. In: IEEE International Conf. on Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2004) 10. Yang, P., Shan, S., et al.: Dong Zhang: Face Recognition Using Ada-Boosted Garbor Features. In: IEEE International Conf. on Automatic Face and Gesture Recognition (2004) 11. Lienhart, R., Maydt, J.: An Extended Set of Haar-like Features for Rapid Object Detection. In: IEEE International Conf. on Image Processing (2002) 12. Li, S.Z., Zhu, L., et al.: Statistical Learning of Multi-View Face Detection. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, Springer, Heidelberg (2002) 13. Fleuret, F.: Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 1531–1555 (2004) 14. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: the importance of good features. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2004)
Robust Foreground Extraction Technique Using Gaussian Family Model and Multiple Thresholds Hansung Kim1 , Ryuuki Sakamoto1 , Itaru Kitahara1,2 , Tomoji Toriyama1, and Kiyoshi Kogure1 Knowledge Science Lab, ATR, Kyoto, Japan Dept. of Intelligent Interaction Technologies, Univ. of Tsukuba, Japan {hskim,skmt,toriyama,kogure}@atr.jp,
[email protected] 1
2
Abstract. We propose a robust method to extract silhouettes of foreground objects from color video sequences. To cope with various changes in the background, the background is modeled as generalized Gaussian Family of distributions and updated by the selective running average and static pixel observation. All pixels in the input video image are classified into four initial regions using background subtraction with multiple thresholds, after which shadow regions are eliminated using color components. The final foreground silhouette is extracted by refining the initial region using morphological processes. We have verified that the proposed algorithm works very well in various background and foreground situations through experiments. Keywords: Foreground segmentation, Silhouette extraction, Background subtraction, Generalized Gaussian Family model.
1
Introduction
The background subtraction technique is one of the most common approaches for extracting foreground objects from video sequences [1,2]. This technique subtracts the current image from a static background image acquired in advance from multiple images over a period of time. Since this technique works very quickly and distinguishes semantic object regions from static backgrounds, it has been used for years in many vision systems such as video surveillance, teleconferencing, video editing, and human-computer interfaces. Conventional approaches assume that the background is static; therefore, they cannot adapt to changes in illumination or geometry in it [3,4,5]. Several algorithms have been developed to overcome this problem by modeling and updating the background statistics. They can be classified into two categories: parametric and non-parametric approaches. The parametric approaches set a form of the background distribution in advance and estimate the parameters of the model. Earlier methods used single Gaussian distribution to model the probability distribution of the pixel intensity [6,7]. Recently, the Gaussian mixture model is the most representative Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 758–768, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Foreground Extraction Technique Using Gaussian Family Model
759
approach [8] and has been widely incorporated with Bayesian frameworks [9], color and gradient information [10], mean-shift analysis [11] and region information [12]. However, these approaches require high computational complexity and have trade-off in learning rate [13]. O. Tuzel proposed to use 3D multivariate Gaussians instead of the Gaussian mixture model to improve computational efficiency [14]. The nonparametric approaches estimate density functions directly from sample data. A. Elgammal used Kernel Density Estimators (KDE) to adapt quickly to changes in the background [13], and several advanced approaches using KDE were proposed [15,16]. However, theses KDE-based approaches consume a lot of memory to update recent background statistics. K. Kim proposed a codebook algorithm to construct a background model from long image sequences [17]. Other approaches include representing various environmental conditions [18,19] use Hidden Markov Models (HMMs) to switch states of the background with observations, and [20,21] aim to segment objects especially in dynamic textured backgrounds such as water and waving trees. In this paper, we propose a robust foreground silhouette extraction algorithm for color-video sequences. Each pixel of the background is modeled as a Generalized Gaussian Family (GGF) distribution. The form of the distribution is chosen in the family by calculating kurtosis of data, and the model is updated by two methods in order to adapt to both changes in illumination and geometry in the background. We classify the initial mask into four categories according to their reliability and refine them with color information. The final foreground silhouette is extracted by morphological processes.
2 2.1
Background Model Modeling Background
Formerly, the variance of a pixel in a static scene over time was modeled with the Gaussian distribution N (μ, σ) because the image noise over time was modeled by a zero-mean Gaussian distribution N (0, σ) [4,5,6,7,8]. However, the latest digital video cameras provide clean and steady images with noise reduction. Moreover, in the case of stable scenes such as indoor ones, variations of pixels are smaller than those in outdoor scenes due to less light dispersion and illumination change, and fewer of those small motions that tend to occur frequently in nature. We extracted distributions of deviation from the mean of each pixel in indoor and outdoor scenes over a short time interval, and then compared these distributions with two Gaussian family distributions: a Gaussian distribution and a Table 1. Average difference of distributions Outdoor Indoor
Gaussian 0.4165 0.0452
Laplace 0.4923 0.0161
760
H. Kim et al.
Fig. 1. Intensity histograms of pixels in an image
Laplace distribution. Table 1 shows the average of differences from each model within the range of 3σ. Clearly, the indoor scene is much closer to a Laplace model than to a Gaussian one. More serious problem is that the distributions of each pixel over time show different shapes in an image. Figure 1 shows intensity histograms of several pixels in an image and their excess kurtosis g2 . Excess kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution and calculated as Eq. (1) where n is a number of samples and μ is a mean of them. The excess kurtosis of Gaussian and Laplace distributions are 0 and 3, respectively. In the Fig. 1, we can see that the background is hard to be modeled only with a Gaussian distribution. n n (xi − μ)4 m4 (1) g2 = 4 − 3 = n i=1 2 −3 σ ( i=1 (xi − μ)2 ) Therefore, we propose to use the GGF distributions to model the background in this research. The GGF model is defined as: ργ p(x : ρ) = exp (−γ ρ |x − μ|ρ ) (2) 2Γ (1/ρ) with γ =
1 σ
Γ (3/ρ) Γ (1/ρ)
1/2
, where Γ (·) is a gamma function and σ 2 is a variance
of the distribution. In Eq. (2), ρ = 2 represents a Gaussian distribution while ρ = 1 represents a Laplace distribution. In this research, we restrict the GGF model to Laplace and Gaussian. The models for each pixel in the background are decided by calculating the excess kurtosis of the first N frames. Optimized parameters of the models can be estimated by maximizing the likelihood of the observed data [22,23]. The background is modeled in two distinct parts: a luminance component and a color component. Normal RGB components are very sensitive to noise and changes in lighting conditions. Therefore, we use a luminance component for initial object segmentation. However, the luminance component may change drastically due to shadows of objects in the background and reflections from lighting in the foreground. We have constructed a second background model
Robust Foreground Extraction Technique Using Gaussian Family Model
761
with the color component of the image to remove false segmentation. Color component H is extracted from the HSI model as follows. I = max(R, G, B) (I − min(R, G, B)/I) if I = 0 S= 0 otherwise ⎧ ⎪ if I = R ⎨(G − B) × 60/S H = 180 + (B − R) × 60/S if I = G ⎪ ⎩ 240 + (R − G) × 60/S if I = B
(3)
if H < 0 then H = H + 360
2.2
Updating the Background
The background model should be changed according to a change in background statistics. There are two types of change in background, with different characteristics: gradual change due to lighting conditions, and sudden changes due to shifts in background geometry. To cope with gradual changes, we update the background model of each pixel using the running average of Eq. (4). 0 if xt ∈ / background μt+1 = αxt + (1 − α)μt , α= (4) 2 σt+1 = α(xt − μt )2 + (1 − α)σt2 0.05 if xt ∈ background However, this background update process cannot handle sudden and permanent changes in the background. For example, if an object in the background is moved and stays fixed at the new location for a long time, the system will detect both the new position and old position as foreground objects permanently. Therefore, the background model is also updated using static pixel observation. If any region is determined as being a foreground region and assigned the same label by the labeling process described in Section 3, successive frame differences of pixels in the region are observed. If the pixels have been stationary for the past T hBg frames, the old background models of the pixels in the region are replaced with new models. However, if there is any non-stationary area bigger than the smallest region size T hRG , all observation processes in the region with the same label are reset to avoid partial disappearance of local stationary pixels in the foreground. That is, background models are updated by a unit with the same label. In the experiment, T hRG was set to 0.1% of the image’s size.
3
Foreground Extraction
Based on the constructed background models, the silhouette of foreground objects are extracted from video sequences. At first, initial region classification is performed by subtracting the luminance components of the current frame from the background model. The fixed single
762
H. Kim et al.
threshold, however, can lead to serious error of over-segmentation or undersegmentation in ambiguous regions such as shadows or foreground regions with brightness similar to the background. Therefore we classify the initial object region into four categories using multiple thresholds based on their reliability, as in Eq. (5). LI and LB indicate the luminance components of the current frame and background model, respectively, and σ is a standard deviation of the background model. BD(p) = |LI (p) − LB (p)| ⎧ BD(p) < K1 σ(p) ⎪ ⎪ ⎪ ⎨K σ(p) ≤ BD(p) ≤ K σ(p) 1 2 ⎪ σ(p) ≤ BD(p) ≤ K K 2 3 σ(p) ⎪ ⎪ ⎩ K3 σ(p) ≤ BD(p)
⇒ (a) Reliable Background ⇒ (b) Suspicious Background ⇒ (c) Suspicious Foreground ⇒ (d) Reliable Foreground
(5)
Thresholds K1 ∼ K3 used in Eq. (5) were determined by training data. We used around 100 images with ground-truth foreground masks taken from different environments. The following condition is used to decide the parameters, where β was set to 3 because false negative errors are generally more critical than false positive errors in foreground extraction.
β × F alseN egativeError (6) (K1 , K2 , K3 ) = arg min +F alseP ositiveError K1 ,K2 ,K3 However, a large amount of background can be assimilated in the suspicious foreground region in the results, caused by the shadow of an object that changes background brightness. We eliminate the shadows from the initial object by using a color component because the shadow does not change the color property of the background but only the brightness. With Eq. (7), shadows in the Suspicious foreground region are merged into the Suspicious background region. H indicates the color components of images, and σH is a standard deviation of the color component in the background model.
if p ∈ region(c)&|HI (p) − HB (p)| < K1 σH (p) then p ⇒ region(b) (7) In the labeling step, all foreground regions (c) and (d) in Eq. (5) are labeled with their own identification numbers. All connected foreground pixels with 8-neighbor regulation are assigned the same labels using a region growing technique [24]. However, there are also small noise regions in the initial object regions. A conventional way of eliminating noise regions is to use a morphological operation to filter small regions. Therefore, we refine the initial mask by a closing and opening process [24]. Then, we sort and relabel all labeled regions in descending order based on the size of the regions. In the relabeling process, regions smaller than T hRG are eliminated. Finally, we use a silhouette extraction technique that is an improvement of Kumar’s profile extraction technique [3] to smooth the boundaries of the foreground and eliminate holes inside the regions. A weighted one-pixel-thick drape
Robust Foreground Extraction Technique Using Gaussian Family Model
(a) Original image
(d) Labeling
(b) Classification
(e) Silhouette extraction
763
(c) Shadow elimination
(f) Final result
Fig. 2. Segmentaion results in each step
is moved from one side to the opposite side. The pixels adjacent to the drape are connected by an elastic spring that covers the object without infiltrating gaps whose widths are smaller than threshold M . This process is performed from all quarters, and the region wrapped by four drapes denotes the final foreground region. We independently apply the profile extraction technique to each labeled region to avoid errors between multiple foreground objects. However, the silhouette extraction algorithm covers real holes inside the object. We perform a region growing technique from Reliable background regions in the silhouette if the region is bigger than the small region threshold T hRG . Figure 2 shows a test image and the results of the silhouette extraction in each step, respectively.
4
Experimental Results
We applied the proposed algorithm to various video streams including indoor/ outdoor scenes taken with an IEEE-1394 camera/a normal camcorder. The IEEE-1394 camera provides 1024 × 768 RGB video streams and the normal camcorder provides 720 × 480 interlaced DV streams. We simulated the algorithm on a PC with a Pentium IV 3.2-GHz CPU, 1.0-GByte memory, and Visual C++ on a Windows XP operating system. The parameters used in the simulation are experimentally selected as T hBG = 100 for stationary objects in updating the background model and M = 12 for a maximum gap width in silhouette extraction. We set the T hBG to be very short to show the effect of background update in short time, but it should be much longer in real applications. Figure 3 shows the segmentation results of various scenes: the left image shows the captured image, and the right image shows the extracted foreground in each pair. Figure 4 shows objective evaluations of the proposed algorithm. We randomly selected 14 frames from seven different scenes (i.e., 98 images in total) and created ground-truth segmentation masks by manual segmentation. Then
764
H. Kim et al.
Fig. 3. Results of foredround extraction
(a) False Negative (FN) error
(b) False Positive (FP) error
Fig. 4. Segmentation errors to ground-truth (%)
we compared the segmentation error of the proposed algorithm with a Gaussianbased algorithm with a single threshold [7] and a KDE-based algorithm [13]. We applied the same morphological processes to all experiments to see the effect of the proposed model. We compared the results by calculating the percentage of erroneous pixels as Eq. (8). error =
N umber of erronous pixels × 100(%) N umber of real f oreground pixels
(8)
In Fig. 4, F N means false negative error, whereby the foreground region is falsely classified as the background region; F P means false positive error, which is the background being falsely classified as the foreground. In all results, F P errors are much bigger than F N errors due to blurring from fast motion and errors around object boundaries. Generally, F N error is more uncomfortable to
Robust Foreground Extraction Technique Using Gaussian Family Model
765
Table 2. Processing speed analysis (msec) Stage Background subtraction Shadow elimination Object labeling Silhouette extraction Background update Total
Time 15 46 16 250 15 342
the eye and less acceptable to many vision systems than F P . The average error rates of the proposed algorithm are lower than those of the conventional methods in most scenes. Table 2 shows a runtime analysis with the proposed system. The times listed are the average processing times when one person exists in the scene. The resolution of the video is 1024 × 768. Considering the image resolution, the processing speed is quite high. In order to evaluate the effect of the background update algorithm, we made an artificial environment where the lighting condition is gradually changing in a short time. Some rigid objects in the background are also shifted to different positions by an actor. In this experiment, we assumed that a rigid object attached to an actor is a foreground object, but it is considered to be background if it
(a) original scene
(b) Without background update
(c) With background update Fig. 5. Results of background update
766
H. Kim et al.
Fig. 6. Errors to ground-truth of sequence in Fig.5 (%)
is separated from the actor. Figure 5 shows snapshots of the results of this experiment. We also manually made ground-truth foreground masks of every 3 frames in 1200 and plotted the error rate of the segmentation results in Fig. 6. In this graph, error rates were calculated as a percentage of errors not against the foreground size as in Fig. 4 but against the image size because the error rate diverged by the change of background when there was no real object in the scene. The graph of the result with the background update in Fig. 6 shows that the error rate increased temporally when the objects parted from the actors but soon became low again, and the change in brightness in the room hardly affected the error rate.
5
Conclusion
In this paper, we proposed a powerful algorithm for silhouette extraction that is robust against variations in the background. The background is modeled as GGF distributions, and updated by selective running averages and static pixel observations, while the foreground is segmented using multiple thresholds and morphological processes. Experimental results indicate that the proposed algorithm works very well in various background and foreground situations. Future work on this topic will take two main directions. Although the proposed algorithm is fast, it does not work in real time with XGA image sequences. However, it can be achieved by using hardware accelerators such as a Graphics Processing Unit (GPU) and by further optimizing the implementation. Second, the proposed method can cope with gradual or long-term changes in background but not with repetitive changes at high frequencies such as flickering monitors or branches shaking in the wind. We are going to develop a multi-modal GGF model to overcome this problem.
Robust Foreground Extraction Technique Using Gaussian Family Model
767
Acknowledgements This research was supported by the National Institute of Information and Communications Technology.
References 1. Gelasca, E.D., Ebrahimi, T., Karaman, M., Sikora, T.: A Framework for Evaluating Video Object Segmentation Algorithms. In: Proc. CVPR Workshop, pp. 198–198 (2006) 2. Piccardi, M.: Background Subtraction techniques: a review. In: Proc. IEEE. SMC, vol. 4, pp. 3099–3104 (2004) 3. Kumar, P., Sengupta, K., Ranganath, S.: Real time detection and recognition of human profiles using inexpensive desktop cameras. In: Proc. ICPR, pp. 1096–1099 (2000) 4. Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Detection and location of people in video images using adaptive fusion of color and edge information. In: Proc. ICPR, pp. 627–630 (2000) 5. Horprasert, T., Harwood, D., Davis, L.S.: A robust background subtraction and shadow detection. In: Proc. ACCV (2000) 6. McKenna, S.J., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking Groups of People. Computer Vision and Image Understanding 80(1), 42–56 (2000) 7. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: Real-Time Tracking of the Human Body. IEEE Trans. PAMI 19(7), 780–785 (1997) 8. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: Proc. CVPR, pp. 246–252 (1999) 9. Lee, D.S., Hull, J.J., Erol, B.: A Bayesian framework for Gaussian mixture background modeling. In: Proc. ICIP, vol. 3, pp. 973–976 (2003) 10. Javed, O., Shafique, K., Shah, M.: A hierarchical approach to robust background subtraction using color and gradient information. In: Proc. IEEE Motion and Video Computing, pp. 22–27. IEEE Computer Society Press, Los Alamitos (2002) 11. Porikli, F., Tuzel, O.: Human body tracking by adaptive background models and mean-shift analysis. In: Proc. PETS-ICVS (2003) 12. Cristani, M., Bicego, M., Murino, V.: Integrated region- and pixel based approach to background modeling. In: Proc. IEEE MVC, pp. 3–8. IEEE Computer Society Press, Los Alamitos (2002) 13. Elgammal, A., Harwood, D., Davis, L.S.: Non-parametric model for background subtraction. In: Proc. ECCV, vol. 2, pp. 751–767 (2000) 14. Tuzel, O., Porikli, F., Meer, P.: A Bayesian Approach to Background Modeling. In: Proc. IEEE MVIV, vol. 3, pp. 58–63 (2005) 15. Han, B., Comaniciu, D., Davis, L.: Sequential kernel density approximation through mode propagation: applications to background modeling. In: Proc. ACCV (2004) 16. Mittal, A., Paragios, N.: Motion-based background subtraction using adaptive kernel density estimation. In: Proc. CVPR, pp. 302–309 (2004) 17. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.S.: Real-time foregroundbackground segmentation using codebook model. Real-Time Imaging 11, 172–185 (2005) 18. Wang, D., Feng, T., Shum, H., Ma, S.: Novel probability model for background maintenance and subtraction. In: Proc. ICVI (2002)
768
H. Kim et al.
19. Stenger, B., Ramesh, V., Paragios, N., Coetzee, F., Buhmann, J.: Topology free hidden Markov models: Application to background modeling. In: Proc. ICCV, pp. 294–301 (2001) 20. Zhong, J., Sclaroff, S.: Segmenting foreground objects from a dynamic textured background via a robust Kalman filter. In: Proc. ICCV, pp. 44–50 (2003) 21. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background modeling and subtraction of dynamic scenes. In: Proc. ICCV, pp. 1305–1312 (2003) 22. Lee, J.Y., Nandi, A.K.: Maximum Likelihood Parameter Estimation of the Asymmetric Generalized Gaussian Family of Distribution. In: Proc. SPW-HOS (1999) 23. Kotz, S., Kozubowski, T.J., Podgorski, K.: Maximum likelihood estimation of asymmetric Laplace parameters. Ann. Inst. Statist. Math. 54, 816–826 (2002) 24. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, New Jersey (2001)
Feature Management for Efficient Camera Tracking Harald Wuest1,2 , Alain Pagani2, and Didier Stricker2 1
Centre for Advanced Media Technology (CAMTech) Nanyang Technological University (NTU) 50 Nanyang Avenue, Singapore 649812 2 Department of Virtual and Augmented Reality Fraunhofer IGD TU Darmstadt, GRIS, Germany
[email protected]
Abstract. In dynamic scenes with occluding objects many features need to be tracked for a robust real-time camera pose estimation. An open problem is that tracking too many features has a negative effect on the real-time capability of a tracking approach. This paper proposes a method for the feature management which performs a statistical analysis of the ability to track a feature and then uses only those features which are very likely to be tracked from a current camera position. Thereby a large set of features in different scales is created, where every feature holds a probability distribution of camera positions from which the feature can be tracked successfully. As only the feature points with the highest probability are used in the tracking step, the method can handle a large amount of features in different scale without losing the ability of real time performance. Both the statistical analysis and the reconstruction of the features’ 3D coordinates are performed online during the tracking and no preprocessing step is needed.
1
Introduction
Tracking point based features is a widely used technique for the camera pose estimation. Either reference features are taken from pre-calibrated images with a given 3D model [1,2] or the feature points are reconstructed online during the tracking [3,4,5]. These approaches are very promising if the feature points are located on well textured planar regions. However, in industrial scenarios objects often consist of reflecting materials and poorly textured surfaces. Because of spotlights or occluding objects, the area of camera positions where a feature point has the same visual appearance can be very limited. Increasing the number of features can help to ensure a robust camera pose estimation, but as the 2D feature tracking step makes up a big amount of the computation time, the overall tracking performance gets very poor. Using only a subset of those features which are visible from a given viewpoint can avoid this problem. Najafi et al. [1] present a statistical analysis of the appearance and shape of features from possible viewpoints. In an offline training phase they coarsely Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 769–778, 2007. c Springer-Verlag Berlin Heidelberg 2007
770
H. Wuest, A. Pagani, and D. Stricker
sample the viewing space at discrete camera positions and create cluster groups of viewpoints for every model feature according to similar feature descriptors. Thereby a map is created which gives information about the detection repeatability, accuracy and visibility from different viewpoints for every feature. During the online phase this information is used for a selection of good features. In this paper we present a method for a feature management which does not rely on any preprocessing but performs an online estimation of the tracking probability of every feature. The ability to track a feature is observed during the runtime and a distribution of camera positions of tracking successes and tracking failures is created. These distributions are represented by a mixture model with a constant number of Gaussians. A merge operation is used to keep the number of Gaussians fixed. The resulting tracking probability, which not only models the visibility but also the robustness of a feature, is then used to decide which features are most suitable to be tracked at a given camera position. The robust camera pose estimation is solved by using Levenberg-Marquardt minimization and RANSAC outlier rejection.
2
Feature Tracking and Reconstruction
For a robust reconstruction and pose estimation a feature point must be tracked as long as possible. Therefore it should be invariant to deformations, illumination and scale. The well-known Shi-Tomasi-Kanade tracker is a widely used technique for tracking 2D feature points [6]. It is based on the iterative minimization of the sum of squared differences with a gradient decent method. In [7] illumination compensation has been added in the minimization procedure. The problem of updating a template patch has been addressed in [8]. Another promising approach for a reliable 2D feature tracking was presented by Zinßer et al.[9], where a brightness corrected affine warped template patch is used to track a feature point. They proposed a two-stage approach where pure translation from frame to frame is estimated first on several levels of the image pyramid, and then the template patch is iteratively aligned at the resulting image position of the first stage. The alignment of the patch T in the image I is based on minimizing the following squared intensity difference (1) = (I(x) − (λT (gα (x)) + δ))2 , where λ and δ are the parameters for adjusting the contrast and the brightness, and gα is the affine transformation function. We extended this method by extracting a template patch in different resolution levels of the image pyramid and always select that patch which has the most similar resolution to the predicted affine transformed patch. If the desired resolution of the patch does not exist, it is extracted out of the current image after a successful tracking step. A feature is regarded as tracked successfully if the iterations of the alignment converge and the error of equation 1 is smaller than a given threshold. Successfully tracked features are reconstructed by triangulation and further refined by an Extended Kalman Filter. More details can be found in [5].
Feature Management for Efficient Camera Tracking
3
771
Feature Management
The functions of the feature management are the extraction of new features, the estimation of the feature tracking probability, the selection of good features for a given camera position and the removal of features which are not of any use for further tracking. The whole management shall be an incremental process which runs in real-time and only uses a limited amount of memory. The tracking probability of a feature is denoted as the probability if a feature is able to be tracked successfully at a given camera position. In the following section the sequential estimation of this probability is described.
3.1
Tracking Probability
As the rotation around the camera center does not have any influence on the visibility of a point feature, if the feature is located inside the image, only the position of the camera in world coordinates is regarded as useful information to decide whether a feature is worth tracking. What is known about the ability to track a feature at a given camera position are the observations of its tracking success in previous frames. The problem of modeling a probability distribution p(x) of a random variable x, given a finite set x1 , . . . , xN of observations, is known as density estimation. A widely used nonparametric method for creating probability distributions are Kernel density estimators. To obtain a smooth density model we choose a Gaussian kernel function. For a D-dimensional vector x the probability density can be denoted as N 1 x − xn 2 1 exp − p(x) = N n=1 (2πσ 2 )D/2 2σ 2
(2)
where N is the number of observation points xn , and σ represents the variance of the Gaussian kernel function in one dimension. Every observation of a feature belongs to one element of the class C = {s, f }, which simply holds the information whether the tracking step was successful (s) or the tracking failed (f). The probability density of the camera position is estimated for every element of the class C separately. Let p(x|C = s) be the conditional probability density of the camera position for successfully tracked features and p(x|C = f ) the conditional probability density for unsuccessfully tracked features. The marginal probability of tracking successes is given by p(C = N s) = NNs and for tracking failures by p(C = f ) = Nf , where Ns and Nf are the number of successful and unsuccessful tracking steps respectively, and N is the total number of observations. The probability pt (x) if a feature can be tracked at a given camera position x is estimated with pt (x) = p(C = s|x)
(3)
772
H. Wuest, A. Pagani, and D. Stricker
When applying the Bayes’ theorem, the tracking probability can be written as p(x|C = s)p(C = s) p(x) p(x|C = s)p(C = s) = p(x|C = s)p(C = s) + p(x|C = f )p(C = f ) p(x|C = s)Ns = p(x|C = s)Ns + p(x|C = f )Nf
pt (x) =
(4)
The estimation of probability densities by using equation 2, however, has the major drawback that with an increasing number of observations the complexity for storage and computation is increasing linearly with the number of observations, which is not feasible for an online application. Our approach for the density estimation is based on a finite set of Gaussian mixtures. The use of mixture models for an efficient computation of clusters in huge data sets has already been addressed. In [10] the Iterative Pairwise Replacement Algorithm (IPRA) is proposed, which is a computational efficient method for conditional density estimation for very large data sets where kernel estimates are approximated by much smaller mixtures. Goldberger [11] uses a hierarchical approach to reduce large Gaussian mixtures to smaller mixtures by minimizing a KL-based distance between them. Zhang[12] presents another efficient approach for simplifying mixture models by using a L2 norm as distance measure between the mixtures. Zivkovic [13] presents a recursive solution for estimating the parameters of a mixture with a simultaneous selection of the number of components. We use a method which is similar to [10], but instead of clustering a large data set we use the method for an online density estimation with a finite mixture model. A mixture with a finite number of Gaussians is maintained for both the successfully and unsuccessfully tracked features. Now we regard the multivariate Gaussian mixture distribution of the successfully tracked features, which can be written as p(x|C = s) =
K k=1
ωk N (x|μk , Σk ) with
K
ωk = 1
(5)
k=1
where μk is the D-dimensional mean vector and Σk the D×D covariance matrix. k The mixing coefficients ωk = N Ns hold the information how many observations Nk have affected this Gaussian k. The probability distribution p(x|C = f ) is defined in the same way. Together with equation 4 the tracking probability for a given camera position can be estimated. The mixture model is built and maintained as follows. Depending on the tracking success, an observation is assigned to a class C, which means that either the distribution p(x|C = s) or the distribution p(x|C = f ) is updated. First for every observation a Gaussian kernel function is created where every kernel can be regarded as a Gaussian of the mixture model. If the maximum number of mixtures K is reached, then the two most similar mixtures are merged and a new Gaussian is created by taking the kernel function from the proximate observation.
Feature Management for Efficient Camera Tracking
3.2
773
Similarity Measure
A similarity matrix is maintained where the similarity of all Gaussians among each other is stored. Scott [10] defined the similarity measure between two density functions p1 and p2 as ∞ p1 (x)p2 (x)dx ∞ sim(p1 , p2 ) = ∞ −∞ (6) 2 ( −∞ p1 (x)dx −∞ p22 dx)1/2 Equation 6 can be considered as a correlation between the two densities. If p1 (x) = N (x|μ1 , Σ1 ) and p2 (x) = N (x|μ2 , Σ2 ) are normal distributions, the similarity measure can be calculated by sim(p1 , p2 ) =
(2D |Σ1 Σ2 |1/2 )1/2 exp(Δ) |Σ1 + Σ2 |1/2
(7)
with 1 Δ = − (μ1 − μ2 )T (Σ1 + Σ2 )−1 (μ1 − μ2 ). 2
(8)
This equation follows from the fact that ∞ N (x|μ1 , Σ1 )N (x|μ2 , Σ2 ) = N (0|μ1 − μ2 , Σ1 + Σ2 ).
(9)
−∞
The two Gaussians for which the similarity measure of equation 6 is smallest, are used for the merging step, which is described in the next section. 3.3
Merging Gaussian Distributions
The merge operation of the two most similar Gaussians is carried out as follows. Now we assume that the ith and the j th component are merged into the ith component of the mixture. Since a mixing coefficient represents the number of observations which affect a distribution, the new number of observations is Ni = Ni + Nj , and therefore ωi is updated by ωi = ωi + ωj .
(10)
The mean of the new distribution can be calculated by Nj Ni Ni 1 1 μi = xn = ( xn + xn ) Ni n=1 Ni n=1 n=1
=
1 1 (Ni μi + Nj μj ) = (ωi μi + ωj μj ) Ni ωi
(11)
774
H. Wuest, A. Pagani, and D. Stricker
After the mean is computed, the covariance Σi can be updated as follows Σi =
Ni 1 (xn − μi )(xn − μi )T Ni n=1
=
Ni 1 xn xTn − μi μTi Ni n=1
=
Nj Ni 1 ( xn xTn + xn xTn ) − μi μTi Ni n=1 n=1
1 (Ni (Σi + μi μTi ) + Nj (Σj + μj μTj )) − μi μTi Ni 1 = (ωi (Σi + μi μTi ) + ωj (Σj + μj μTj )) − μi μTi . ωi =
(12)
After the merge operation, the j th component can be used by a new observation to represent a new Gaussian. It can be regarded as a Kernel estimate with a Gaussian kernel function. For a new observation, the camera position is assigned to xj and the covariance is set to σ 2 I, where I is the identity matrix and σ determines the size of the Parzen window. The parameter σ affects the smoothness of the resulting mixture model and must be chosen with respect to the world coordinate system. If for example the camera position vector is given in cm, with σ = 5 a convincing probability distribution can be created for an indoor camera tracking. The weight ωj is initialized with ωj = N1c , where Nc is the number of observations of the assigned class. 3.4
Feature Selection
Features which have a precisely reconstructed 3D coordinate have no need for any reconstruction or refinement step. If we know that such features are not very likely to be tracked from the current camera position, it is probably not of any use for the pose estimation and it can be disregarded for a tracking step. Features which do not have a valid 3D coordinate are selected for the tracking step in every case, because it is important that a feature point gets triangulated fast, and an exact 3D position is reconstructed, so that the feature will be beneficial for the camera pose estimation. Before the tracking step all features which have not been tracked successfully in the last frame are projected into the image with the last camera position in order to provide a good starting position for the features in the iterative alignment. The tracking probabilities of all features which are located inside the current image are calculated with equation 4 and the features are sorted by their probability in descending order. Now the feature tracking described in section 2 is applied on the sorted list of features until a minimum number of features has been tracked successfully. In our implementation we stop after 30 successfully tracked features with a valid 3D coordinate, which should be enough for a robust pose estimation.
Feature Management for Efficient Camera Tracking
775
The benefit of this approach is that the total number of tracked features is kept at a minimum if most of the features are tracked successfully, but if there are lots of tracking failures due to occlusion or strong motion blur, as many features as needed are tracked until a robust camera pose estimation is possible. 3.5
Feature Extraction
Most point based feature tracking methods use the well known Harris Corner Detector [14], which is based on the eigenvalue analysis of the gradient structure of an image patch. Another simple but very efficient approach called FAST (Features from Accelerated Segment Test) was presented by Roston et al.[15]. Their method analyses the intensity values on a circle of 16 pixels surrounding the corner point. If at least 12 contiguous pixels are all above or all below the intensity of the center by some threshold, this point is regarded as a corner feature. For reasons of efficiency we used the FAST feature detector in our implementation. To avoid too many features and overlapping patches, a new feature is only extracted if no other feature points exist within a minimum distance to this feature in the image. New features are extracted if the total number of features with pt (x) > 0.5 for the current camera position x falls below a given threshold. 3.6
Feature Removal
In order to decide if a feature is valuable for further tracking, a measure of usefulness has to be defined. If the tracking probability pt (x) for any camera position x is smaller than 0.5, a feature can be regarded as dispensable. The correct computation of the maximum of pt (x) with the expectation maximization algorithm for every feature is computationally too expensive. When μk,s are the Gaussian means of the mixture model representing successfully tracked features, we approximate the maximum of the tracking probability by evaluating pt at all positions μk,s by the following equation: pmax max pt (μk,s ) k
(13)
If pmax < 0.5 holds, then no camera position exists where this feature is likely to be tracked, and it can be removed from the feature map without the concern of losing valuable information. If a feature point gets lost and the 3D coordinate of that feature has not been reconstructed yet, this feature is removed as well, because without a valid 3D coordinate it is not possible to re-project the feature back into the image for further tracking.
4
Experimental Results
To evaluate if the tracking probability distribution of a single feature is estimated correctly the following test scenario is created. The camera pose is computed by tracking a set of planar fiducial markers, which are located in the x/y-plane. A
776
H. Wuest, A. Pagani, and D. Stricker x 1 tracking failures tracking successes
0.9 0.8 0.7 0.6
z
0.5 0.4 0.3 0.2 0.1 0
(a)
(b)
(c)
Fig. 1. Probability density map of camera position for a single feature. In (a) a frame of the test sequence is shown. (b) visualizes the Gaussian mixture models of camera positions where the feature has been tracked successfully (blue) and where the tracking failed (red). In (c) the tracking probability in the x/z-plane can be seen.
4
4
tracking failures
x 10
2 number of features
number of features
2 1.5 1 0.5 0
0
0.2
0.4 0.6 0.8 tracking probability
(a)
1
tracking successes
x 10
1.5 1 0.5 0
0
0.2
0.4 0.6 0.8 tracking probability
1
(b)
Fig. 2. Histograms of successfully and unsuccessfully tracked features with their corresponding tracking probability
point feature is extracted manually on the same plane. In figure 1(a) a frame of this sequence can be seen. When the camera is moved around, the point feature gets lost while it is occluded by an object, but it is tracked successfully, when it gets visible again. The Gaussian mixture model is visualized in figure 1(b) by a set of confidence ellipsoids, which are drawn in blue and red for p(x|C = s) and p(x|C = f ) respectively. The number of Gaussians is limited to 8 for each mixture model in this particular example. In figure 1(c) the probability distribution pt (x) in the x/z-plane together with the Gaussian means is shown. It can be seen that the camera positions where the point feature was visible or occluded is correctly represented by the mixture model of tracking successes or tracking failures respectively. The probability distribution clearly illustrates that the tracking probability falls to 0 at camera positions where the feature is occluded.
Feature Management for Efficient Camera Tracking
777
Table 1. Average processing time of the individual steps of the tracking approach time in ms prediction step build image pyramid 10.53 29.08 feature selection and tracking 2.74 pose estimation 1.94 update feature probability 5.53 reconstruct feature points extract new features 5.93 total time without feature extraction 49.82
An image sequence showing an industrial scenario is used for the further experiments. In order to evaluate the quality of the tracking probability estimation, all available features are used as an input for the tracking step and it is observed whether the features compared to their tracking probability are tracked successfully or not. In figure 2 histograms are plotted which show the number of successfully and unsuccessfully tracked features with their corresponding tracking probability. It can be seen that the major part of features with a high tracking probability has been indeed tracked successfully. An analysis of the processing time is carried out on a Pentium 4 with 2.8GHz and a firewire camera with a resolution of 640 × 480 pixels. The average computational costs for every individual step are shown in table 1. Without the feature extraction, the tracking system can run at a frame rate of 20Hz. If no feature selection is performed, on average 93.9 features are used in the feature tracking step, and only 49.0% of all features can be tracked successfully. The average runtime of the tracking step is at 64.36 milliseconds. With the selection of the most probable features on average only 48.94 features are analysed per frame in the tracking step. The success rate of the feature tracking is at 83.0% and the mean computation time is lowered to 29.08ms with no significant difference of the quality of the pose estimation.
5
Conclusion
We have presented an approach for real-time camera pose estimation which uses an efficient feature management to store many features and to track only those features which are most likely to be tracked from a given camera position. The tracking probability for every feature is estimated online during the tracking and no preprocessing is necessary. Features which are only visible in a limited area of viewpoints are only tracked at those certain camera positions and ignored at any other viewpoints. Even if they are occluded for a long time, reliable features are not deleted, but kept in the feature set as long as a camera position exists from which the feature can be tracked successfully. Not only the visibility, but also the robustness of a feature is represented by the tracking probability. Tracking failures due to reflections or spotlights at certain camera positions are also modeled correctly.
778
H. Wuest, A. Pagani, and D. Stricker
Acknowledgements This work was partially funded by the European Commission project SKILLS, Multimodal Interfaces for Capturing and Transfer of Skills (IST-035005, www.skills-ip.eu).
References 1. Najafi, H., Genc, Y., Navab, N.: Fusion of 3d and appearance models for fast object detection and pose estimation. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, Springer, Heidelberg (2006) 2. Bleser, G., Pastarmov, Y., Stricker, D.: Real-time 3d camera tracking for industrial augmented reality applications. In: WSCG (Full Papers), pp. 47–54 (2005) 3. Genc, Y., Riedel, S., Souvannavong, F., Akinlar, C., Navab, N.: Marker-less tracking for ar: A learning-based approach. In: IEEE / ACM International Symposium on Mixed and Augmented Reality. pp. 295–304. IEEE Computer Society Press, Los Alamitos (2002) 4. Davison, A.: Real-time simultaneous localisation and mapping with a single camera. In: Proc. International Conference on Computer Vision, Nice (2003) 5. Bleser, G., Wuest, H., Stricker, D.: Online camera pose estimation in partially known and dynamic scenes. In: ISMAR, pp. 56–65 (2006) 6. Shi, J., Tomasi, C.: Good features to track. In: CVPR 1994. IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 7. Jin, H., Favaro, P., Soatto, S.: Real-Time feature tracking and outlier rejection with changes in illumination. In: IEEE Intl. Conf. on Computer Vision. pp. 684–689. IEEE Computer Society Press, Los Alamitos (2001) 8. Matthews, I., Ishikawa, T., Baker, S.: The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6), 810–815 (2004) 9. Zinßer, T., Gr¨ aßl, C., Niemann, H.: Efficient Feature Tracking for Long Video Sequences. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 326–333. Springer, Heidelberg (2004) 10. Scott, D.W., Szewczyk, W.F.: From kernels to mixtures. Technometrics. 43, 323– 335 (2001) 11. Goldberger, J., Roweis, S.: Hierarchical clustering of a mixture model. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 17, pp. 505–512. MIT Press, Cambridge (2005) 12. Zhang, K., Kwok, J.: Simplifying mixture models through function approximation. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19, MIT Press, Cambridge (2007) 13. Zivkovic, Z., van der Heijden, F.: Recursive unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(5), 651–656 (2004) 14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. Alvey Vision Conf, Univ, Manchester, pp. 147–151 (1988) 15. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1508–1511 (2005)
Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e Xin Yin, Kangying Cai, Yuki Takeda, Ryo Akama, and Hiromi T. Tanaka Ritsumeikan University, Nojihigashi 1-1-1, Kusatsu, Shiga, 5258577, Japan http://cv.ci.ritsumei.ac.jp
Abstract. Ukiyo-e is one famous traditional woodblock type Japanese drawing. Some pattern printed by special print techniques can only be seen from some special direction. This phenomenon relate to the reflection properties on the surface of Ukiyo-e. In this paper, we propose a method to measure these reflection properties of Ukiyo-e. Fitstly, the normal on the surface and the direction of the fiber in Japanese paper are computed from photos which are taken by a measuring machine named OGM. Then, fit the reflection model to the measured data and the reflection properties of Ukiyo-e can be obtained. Based on these parameters, the the appearance of Ukiyo-e can be rendered on real-time. Keywords: Ukiyo-e, measurement, fibers in Japanese paper, cultural heritage.
1
Introduction
Some rendering techniques such as the NPR (Non-Photorealistic Rendering) were developed in last two decades. These studies mainly focused on simulating the pen stroke and the distribution of pigment on the paper. On another hand, the scattering of light is important to represent the appearance of drawing also. The particle of pigment and the fiber in paper can affect the scattering of light and make a special effect on the surface of drawing. In this paper, we measured the appearance of one type ancient Japanese drawing named Ukiyo-e, and observed the isotropic reflection from the pigment particle and the anisotropic reflection from the fiber of Japanese paper. Based on this observed result, we proposed a shading technique which can blend these two type reflections and render the appearance of Ukiyo-e on real-time. The Ukiyo-e is one traditional Japanese drawing. The origin of the Ukiyo-e come from describing the life of Kyoto in 16 century. For getting special effect, some print techniques were developed. The techniques of the Karazuri and the Kirazuri are introduced here. The Karazuri do not use any pigment and put the woodblock on the paper by force, then the bump pattern are made. The Kirazuri use the mica and gold particle to draw the pattern on Ukiyo-e. Shown as Figure 1, the pattern of snow is made by the Karazuri and the pattern on the cloth of the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 779–788, 2007. c Springer-Verlag Berlin Heidelberg 2007
780
X. Yin et al.
Fig. 1. The photos of Ukiyo-e
Fig. 2. The photomicrographs of the Ukiyo-e. The pigment in the left is gold particle and the pigment in the middle is ink. There is no pigment in the right.
woman is made by the Kirazuri. As the result of these print techniques, the color of Ukiyo-e varies according to the postion of the light source and the viewpoint. Another special effect come from the fiber in Japanese paper. The length of the fiber in Japanese paper for Ukiyo-e is 6.0-15.0 mm and is 7-8 times of that in general paper. Figure 2 show the photomicrographs of the Ukiyo-e surface. The fiber can be seen clearly in these photos. Even in the part where look like fill with the particles of pigment, the shape of the fibers can be seen clearly. As the reflection of fibers is anisotropic, we need designing a anisotropic shading model to represent the reflection effect of the fiber. As mentioned above, the NPR techniques mainly represent the effect of drawing such as painting, sculpture, block print, dyeing etc.. Some works have been done to present the effect of the Ukiyo-e also. Processing 2D photos ([1]) or simulating the Ukiyo-e make process ([2]) can make the Ukiyo-e in virsual world. These works focus on simulating the isotropic color of the Ukiyo-e and doesn’t simulate the light scattering on the Ukiyo-e. To our knowledge, we are the first time to simulate the light scattering properties of the Ukiyo-e.
Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e
781
Our works based on the measured data and is relate to previous works on measuring spatially varying BRDF (Bidirectional Reflectance Distribution Function) and BTF (Bidirectional Texture Function). This measurement usually set the light source and the camera on a hemisphere dome and take a lot of photos ([3], [4], [5]). Using these photos, the BRDF or the BTF can be constructed to render photorealistic scene. Our measurement is similar to the [5], using high density samples to capture detail color variation on the surface of Ukiyo-e. Constructing the geometric parameters such as the normal on surface from photos is carried out for a long term. The principle of photometric stereo ([6]) can be used to construct the geometric parameters and the BRDFs. The normal of suface can be obtained by the color variation of different photos or video ([7], [8], [9]). Comparing the reflection of examples such as ball to the reflection of the target object under same illumination conditions, the geometric parameter of the target object can be computed ([10]). With the developing of the techniques for scaning 3D geometric data by laser, it is possible to improve the precision of the reflection parameters for the high quality rendering by comparing with the scaning data ([11], [12]). For applying easily, the BRDF and normal on surface can be obtained using small number of photos([13], [14]). To decrease the errors of measuring, we measure the data in high density and construct the shading model to render the Ukiyo-e with high reality. Another geometric parameter is the direction of fibers in the Japanese paper. As the result of these fiber reflection, the apearance of Ukiyo-e show anisotropic reflection. Some anisotropic shading models have been proposed based on the microfacet models and empirical models ([15], [16]). These models assume that the reflection light is distribute on a narrow field. For the fiber shading model, the strongst reflection direction is along with a cone around the direction of the fiber ([17], [18]). We develope this type fiber shading model and fit this model to the measured data in this paper. The main idea of our work come from [14] and [5]. [14] compute the normal on the surface of the isotropic reflection materials. [5] mainly compute the fiber direction in the wood and render the effect of the fiber. The fibers in the Japanese paper are more complex than the case of the wood. The direction of it near a random distribution. The appearance of the Ukiyo-e blend two effect, one is the isotropic reflection come from the pigment, another is the anisotropic reflection come from the fiber in Japanese paper. Different to the [14] and [5], we blend these two reflection together, and fit these two reflect models to the measured data. Because we combine two different shading model together, the errors between the model and the measured data are decreased and high quality rendering result can be obtained.
2
Taking Photos
We use a system named OGM (Optical Gyro Measuring Machine) to take photos of Ukiyo-e. OGM is 4 axes measuring machine which can put the light source and the camera on any position of a hemisphere dome. Figure 3 show the photo
782
X. Yin et al.
Camera
Light source
Light source arm
Camera arm
Object stage
Fig. 3. Optical gyro measuring machine (OGM)
BRDF of pigment (Golden particle)
BRDF of Japanese paper fiber
Fig. 4. BRDFs of different pixel on the Ukiyo-e image
of OGM. For measuring the color variation on Ukiyo-e, the camera is fixed on the position perpendicular to the surface of Ukiyo-e. The position of light source is changed. The record of the position in computer is a 2D array. For correspondenting the 2D array and the position of lighting source on the hemisphere dome, A concentric map ([19]) is used to set the position of light source. We use a 37 by 37 grids to set the position of light. To advoid the light source behind the arm of camera, the object stage are turned 180 degrees. Some marks are set around the Ukiyo-e to calibrate the positon of the pixel of the image of Ukiyo-e. To calibrate the light distribution on the surface, the photos of a white paper are taken also. The technique of [14] is used to calibrate image. After calibrate the color and the pixel position, the BRDFs of each pixel can be obtained.
Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e
783
Figure 4 show the two type BRDFs. The center of image is black as the light sourse is blocked by the camera at that position. In the field where is pigment, the anisotropic reflection is weak. The highlight centralize a point. In the field where is paper, a strong anisotropic reflection phenomenna can be observed. The highlight distribute along a line. These mean that we need to construct two shading models and blend them together to render the appearance of Ukiyo-e.
3
Shading Model
From the measured results, two type reflection phenomenon are observed. One is isotropic reflection come from the pigment, another is anisotropic reflection come from the fiber in Japanese paper. Modeling these two type phenomenon and fit it to the measured data will introduced in this section. 3.1
Two Type Models
From the photomicrographs of Ukiyo-e, the distribution of the pigment and the fiber can be observed clearly. Shown as Figure 5 (a), the shape of fiber is approximated as a cylinder. If the light that refract from air to fiber and back to air, the inclination will maintain same. As the result, the light enter the fiber as a line and become a cone surface when it leave the fiber. The axis of the cone is the direction of the fiber. If part of the paper surface is covered by some particles of pigment, the light will be reflect to air by the pigment directly. At this time, the light leave the surface of object is a line. Use αrp represent the angle between the surface normal Np and viewpoint vector V , then The effect of pigment Ip can be expressed by next expression. Ip = Idp + ksp • g(σ, αhp )/cos2 (αrp )
(1)
Here,Idp is the diffusion reflection and ksp is specular reflectance. g(σ, αhp ) is a normalized Gaussian with zero mean and standard deviation σ. This model can be used to represent the effect of pigment on the surface of Ukiyo-e. Simular to the shading model of the pigment, we can construct the reflection model of the fiber. Shown as Figure 5 (b), the blue plane is the normal plane
Light
Light
L Pigment
αif α rf
Fiber
V
Nf
F H
Normal plane
(a)
(b)
Fig. 5. Two type shading models of the Ukiyo-e
(c)
F
784
X. Yin et al.
Γ perpendicular to the fiber direction F . The angle between the light vector L and the normal plane Γ is αif . The anlge between the viewpoint vector V and the normal plane Γ is αr f . If the viewpoint near the surface of the cone, the reflection light is strong. If viewpoint far from the surface of the cone, the reflection light is weak. So we can construct the reflection model of the fiber by developing the traditional reflection model such as Torrance-Sparrow model. The main difference between the fiber reflection model and the tranditional reflection model is using the cone replace the vector of regular reflection. The effect of the fiber If can be represent by next expression. If = Idf + ksf • g(σ, αhf )/cos2 (αrf )
(2)
Here, Idf is the diffusion reflection of fiber. ks f is the specular reflectance of fiber. g(σ, αhf ) is the normalized Gaussian same as the above. αhf is the halfangle between the normal plane Γ and the viewpoint vector V . Then blend these two type effects of pigment and fiber, we can get final color of Ukiyo-e as follow. I = Idp • β + Idf • (1 − β)
(3)
This expression mean that the final appearance of the Ukiyo-e is the linear interpolation of the effect of pigment and the effect of fiber. The next work is fitting this shading model to the measured data and the decide the parameters of this model on each pixels of the image. 3.2
Computing the Geometry Parameter
The Ukiyo-e is near to a plane. The geometry parameters in here is the normal of micro geometric surface and the direction of fiber. Even the micro geometric surface can be obtained by integration from the normal, but we need not constructing the micro geometric surface. Using the information of the normal and the direction of fiber, we can render the appearance of Ukiyo-e well. The common of two type shading model is the normal N in the middle of the strongst reflection R and the light vector L. The differnt between these two case is that the normal of the fiber is a normal plane perpendicular to the fiber direction. As this reason, we can get the normal by computing the strongest reflection direction. As enough density data are captured by the OGM, it is ease to find the strongest reflection direction R. Then the normal can be computed by N = (L + R)/2. The left image shown in Figure 6 is the image of the surface normal. The value of RGB represent the XYZ value of normal N . The bump pattern can be seen from this image.(For print it clearly, the contrast is enlarged.) The direction of the fiber F is the axis of the cone on which the highlight can be seen. As the result, the highligh of fiber reflection is a line on the hemisphere. Figer 5 (c) show the relationship of the Normal Nf , highlight line H and the fiber direction F . The F perpendicular to the plane which is parallel to the N and the H. The H can be obtained by computing the line of the highlight (H) using Principal Components Analysis method. Then the direction of the fiber can be got by F = N × H. The right image shown in Figure 6 is the image of the
Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e
785
Fig. 6. Geometry parameter images. The left one is the normal image and the right one is the fiber direction image.
fiber direction. The value of RGB represent the XYZ value of direction of fiber F . Now we know the Normal N and direction of fiber F . Fitting parameters of the model to the neasured data will be introduced next. 3.3
Fitting the Data
Fitting the model to the measured data is a nonlinear optimization problem. This problem is to find parameters which can let the value of ρ in next express is minimal. ρ= (I − Muv ) (4) Here, I is the theory value of the shading model introduced above. Muv is the measured BDRF data by the OGM. u and v is the coordinates of the measured BRDF image. For get the correct parameters of the shading model, the good initial estimate is important. The initial diffusion value is using the mean value of the color. The initial parameter of Gaussian is computed from measured data directly. The inital β is 0.5. Then the parameters can be obtained by the steepest descent method. Now, we have all the value of parameters of the shading model. These values are stored as the texture, and use these texture, the appearance of the Ukiyo-e can be rendered.
4
Results
The Experiment is carried out based on the GPU (Graphics Processing Unit) and can render the Ukiyo-e on real time. The graph card is NVIDIA GeForce 6800 GS. And, this Experiment is carried out using a Ukiyo-e which was made in hundreds years ago. Because we captured 1225 photos (one photo size is 3888X2592 pixels) to construct the high density BRDFs, more than 10 hours for capture the data and more than 4 hours to compute the reflection properties
786
X. Yin et al.
Fig. 7. Experiment results1
Fig. 8. Experiment result2. The left one show the result without the fiber reflection effect and the right one show the result with the fiber reflection effect.
(the most time is cost by reading the photos into the computer). Based ont the computed results, rendering the Ukiyo-e can be carried out on real-time. The image shown in Figure 7 is the rendering result using the shading model proposed in this paper. When the viewpoint is changed, the color of the surface is different also. The bump pattern of snow is visible and invisible according to the position of light source. This result is similar to the phenomenon that occurs on the real Ukiyo-e. We compare the case with the fiber effect and without fiber effect also. This experiment is carried out by another Ukiyo-e. The pattern of follower is made by Karazuri and the background of the woman is made by the Kirazuri. Shown as Figure 8, the left one is the rendering resutl only using the normal of surface. The result looks like the plastic more than the paper. The right one is the rendering result using the normal of surface and the direction of fiber together. There are some natural noise on the center of the image and the image become bright. The edge of the Karazuri becomes soft according to the effecting of the fiber in the paper and looks more like the paper than the plastic. All parameters of the model is stored by the texture, the size is about 1/100 of the original BRDFs data. Using the GPU rendering technique, the rendering can
Measurement of Reflection Properties in Ancient Japanese Drawing Ukiyo-e
787
carried out at a speed of real-time. Because our method is based real measured data, we can get rendering result with high reality.
5
Conclusion
In this paper, a technique for measuring the reflection properties of ancient Japanese drawing named Ukiyo-e is proposed. It is first time to measure the reflection properties of the Ukiyo-e materials and rendering the appearance of it considering the fiber effect in the Japanese paper. Our methode can fit real data well because the isotropic reflection and anisotropic reflection is blend together. This technique can also be used for rendering other similar objects such as the cloth. In the future, new techniques for modelling the detail of the fiber in the Japanese paper from images need to be developed. we also plan to develope a VR system which permit person watch the Ukiyo-e in hand and can feel the touch feeling of the Ukiyo-e at same time.
Acknowledgments This work was supported partly by the Grants-in-Aid for Scientific ”Research Scientific Research(A) 17200013” and ”Encouragement of Young Scientists(B) 19700104” of Japan Society for the Promotion of Science. This work was also supported partly by ”Kyoto Art Entertainment Innovation Research” of Centre of Excellence Program for the 21st Century of Japan Society for the Promotion of Science.
References 1. Okamoto, T.: http://www.tatuharu.com/ (2007) 2. Okada, M., Mizuno, S., Toriwaki, J.: Virtual sculpting and virtual woodblock printing by model-driven scheme. the Journal of the Society for Art and Science 1(2), 74–84 (2002) 3. Dana, K.J., Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reflectance and texture of real-world surfaces. ACM Transactions on Graphics 18, 1–34 (1999) 4. Gardner, A., Tchou, C., Hawkins, T., Debevec, P.: Linear light source reflectometry. ACM Transactions on Graphics 22(3), 749–758 (2003) 5. Marschner, S.R., Westin, S.H., Arbree, A., Moon, J.T.: Measuring and modeling the appearance of finished wood. In: Proceedings of SIGGRAPH 2005, pp. 727–734 (2005) 6. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19(1), 139–144 (1980) 7. Rushmeier, H., Taubin, G., Gueziec, A.: Applying shape from lighting variation to bump map capture. In: Proceedings of Eurographics Workshop on Rendering, pp. 35–44 (1997) 8. Paterson, J.A., Fitzgibbon, A.W.: Flexible bump map capture from video. In: Proceedings of the Eurographics 2002 conference, short papers (2002)
788
X. Yin et al.
9. Rushmeier, H., Gomes, J., Giordano, F., El-shishiny, H., Magerlein, K., Bernardint, F.: Design and use of an in-museum system for artifact capture. In: IEEE/CVPR Workshop on Applications of Computer Vision in Archaeology (2003) 10. Hertzmann, A., Seitz, S.M.: Shape and materials by example: A photometric stereo approach, pp. 533–540. IEEE Computer Society Press, Los Alamitos (2003) 11. Rushmeier, H., Bernardini, F., Mittleman, J., Taubin, G.: Acquiring input for rendering at appropriate levels of detail: Digitizing a pieto. In: Proceedings of Eurographics Workshop on Rendering, pp. 81–92 (1998) 12. Lensch, H.P.A., Kautz, J., Goesele, M., Heidrich, W., Seidel, H.P.: Image-based reconstruction of spatial appearance and geometric detail. ACM Transactions on Graphics 22, 234–257 (2003) 13. Georghiades, A.S.: Recovering 3-d shape and reflectance from a small number of photographs. In: Proceedings of Eurographics Workshop on Rendering, pp. 230– 240 (2003) 14. Paterson, J.A., Claus, D., Fitzgibbon, A.W.: Brdf and geometry capture from extended inhomogeneous samples using flash photography. Computer Graphics Forum 24(3), 383–391 (2005) 15. Ashikhmin, M., Shirley, P.S.: An anisotropic phong brdf model. Journal of Graphics Tools 5(2), 25–32 (2000) 16. Ward, G.J.: Measuring and modeling anisotropic reflection. In: Proceedings of SIGGRAPH 1992, pp. 265–272 (1992) 17. Marschner, S.R., Jensen, H.W., Cammarano, M., Worley, S., Hanrahan, P.: Light scattering from human hair fibers. ACM Transactions on Graphics 22(3), 780–791 (2003) 18. Kawai, N.: Reproducing reflection properties of natural textures onto real object surfaces. In: Proceedings of the 4th International Workshop on Texture, pp. 101– 106 (2006) 19. Shirley, P., Chiu, K.: A low distortion map between disk and square. Journal of Graphics Tools 2, 45–52 (1997)
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence Ping Li1 , Dirk Farin1 , Rene Klein Gunnewiek2, and Peter H.N. de With3 1
Eindhoven University of Technology {p.li,d.s.farin}@tue.nl 2 Philips Research Eindhoven
[email protected] 3 LogicaCMG Netherlands B.V.
[email protected]
Abstract. This paper proposes a novel and efficient feature-point matching algorithm for finding point correspondences between two uncalibrated images. The striking feature of the proposed algorithm is that the algorithm is based on the motion coherence/smoothness constraint only, which states that neighboring features in an image tend to move coherently. In the algorithm, the correspondences of feature points in a neighborhood are collectively determined in a way such that the smoothness of the local motion field is maximized. The smoothness constraint does not rely on any image feature, and is self-contained in the motion field. It is robust to the camera motion, scene structure, illumination, etc. This makes the proposed algorithm texture-independent and robust. Experimental results show that the proposed method outperforms existing methods for feature-point tracking in image sequences.
1 Introduction Intensity similarity of image texture is used by most existing algorithms for feature matching, which typically requires that the contrasts of the two images are constant. However, a constant contrast is difficult to maintain in practice. Even if we assume that the camera hardware is identical, for slightly different points of view, the amount of light entering the two cameras can be different, causing dynamically adjusted internal parameters such as aperture, exposure and gain to be different [1]. It is favorable to establish the feature correspondences using the geometric similarity alone, because it appears that the geometric similarity is more fundamental and stable than intensity similarity, as intensity is more liable to change [2]. This paper proposes a feature-point matching algorithm that uses only the smoothness constraint1 [3], which states that neighboring features in an image usually move in similar magnitudes and directions. The smoothness constraint does not rely on any image feature, and is self-contained in the motion field. It is robust to the camera motion, scene structure, illumination, etc. This makes the proposed algorithm texture-independent and robust. 1
We consider the smoothness constraint a geometric constraint, because it is the object rigidity that gives the motion smoothness in an image. For example, a group of points on the surface of a rigid object typically move in similar speeds. This leads to smooth image motion.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 789–799, 2007. c Springer-Verlag Berlin Heidelberg 2007
790
P. Li et al.
1.1 Related Work Photometric region descriptors have recently been widely used for feature-point matching. In this approach, local image regions are described using image measurements such as the histogram of the pixel intensity, distribution of the intensity gradients [4], image derivatives [5,6], etc. Below, we summarize some well-known descriptors that fall into this category. A review of the state-of-the-art region descriptors can be found in [7]. Lowe [4] proposed a Scale-Invariant Feature Transform (SIFT) algorithm for featurepoint matching or object recognition, which combines a scale-invariant region detector and a gradient-distribution-based descriptor. The descriptor is represented by a 128-dimensional vector capturing the distribution of the gradient orientations in 16 location grids (sub-sampled into 8 orientations and weighted by gradient magnitudes). Features are matched if two descriptors show small difference. Recently, Bay et al. proposed a new rotation- and scale-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features) [8]. It is based on sums of 2D-Haar wavelet responses and makes an efficient use of integral images. The algorithm was shown to have comparable or better performance, while obtaining a much faster execution than previously proposed schemes. Another category of feature-point matching/tracking algorithms does not use descriptors. In [2], a feature-point matching algorithm is proposed using the combination of the intensity similarity and geometric similarity. Feature correspondences are first detected based on the window-based intensity correlation. The outliers are thereafter rejected by a few subsequent heuristic tests involving geometry, rigidity, and disparity. The Kanade-Lucas-Tomasi (KLT) feature tracker [9] combines a feature detector to detect the feature points located in image areas containing sufficient texture variation, and a feature tracker to determine the displacement vector by minimizing the window-based intensity residue between two image patches around the two feature points. In [10], a global feature-point matching algorithm is proposed, which converts the huge combinatorial search into an efficient implementation using concave programming on its convex-hull. In [11], an algorithm based on a proximity constraint is proposed for associating features in two images, which is thus independent from the texture. 1.2 Our Approach For feature-point tracking in a video/image sequence, the variation of the camera parameters (rotation, zooming, translation) is relatively small. Methods based on the local intensity correlation are often used because of its computation efficiency like the blockmatching technique. But they are usually not robust (to noise, scaling, light change, etc.) due to the fact that only the local image information is used. Descriptor-based algorithms are more robust, but the high computational complexity of the high-dimension descriptors makes them less efficient. Our approach concentrates on both the computational efficiency and robustness of the feature-point matching algorithm, as well as the fundamental nature of the geometric similarity. Therefore, this paper proposes an efficient and robust point-matching algorithm that uses only the smoothness constraint, targeting at feature-point tracking
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence
791
along successive frames of uncalibrated image/video sequences. Texture information is not required for the feature-point matching. The proposed algorithm is thus referred to as Texture-Independent Feature Matching (TIFM). In TIFM, the correspondences of feature points within a neighborhood are collectively determined in a way such that the smoothness of the local motion field is maximized. TIFM is robust because it is based on only the smoothness constraint, which is robust to the scene structure, camera motion, light change, etc. TIFM is efficient, as the smoothness of the motion field can be efficiently computed using a simple coherence metric (discussed in Section 3). Experimental results on both synthetic and real images show that TIFM outperforms existing algorithms for feature-point tracking in image/video sequences. Assuming the feature points are already detected using the Harris corner detector [12], our focus in this paper is to establish the two-frame correspondences, with which feature points can be easily tracked by simply linking the two-frame correspondences across more images.
2 Notations Let I = {I1 , I2 , · · · , IM } and J = {J1 , J2 , · · · , JN } be two sets of feature points in two related images, containing M and N feature points, respectively. For any point Ii , we want to find its corresponding feature point Jj from its candidate set CIi , which is defined as all the points within a co-located rectangle in the second image, as shown in Fig. 1(b). The dimension of the rectangle and density of the feature points determine the number of the points in the set.
Fig. 1. The set of feature points in neighborhood NIi in the first image and the set of candidate matching points CIi in the second image for feature point Ii
As seen in Fig. 1, the neighborhood NIi of point Ii is defined as a circular area around that point. The displacement between Ii and Jj is represented by its Correspondence Vector (CV) v Ii . The candidate set CIi for Ii leads to a corresponding set of candidate correspondence vectors VIi . Determining the correspondence for Ii is equivalent to finding the corresponding point from CIi or finding the correct CV from VIi .
792
P. Li et al.
3 Matching Algorithm 3.1 Coherence Metric Suggested by the motion coherence theory [3], we assume that the CVs within a small neighborhood have similar directions and magnitudes, which is referred to as the localtranslational-motion (LTM) assumption/constraint in this paper. CVs that satisfy this constraint are called coherent.
(a) Two coherent CVs
(b) Possible matching combinations
Fig. 2. (a) two coherent CVs v i and v j within a neighborhood (v i is the reference CV); (b) n feature points in a neighborhood; each point has a varying number of candidate CVs; only those repeated points indicated by “T” have the true/correct CVs
Given two coherent CVs v i and v j , we require that both the difference dij between their magnitudes, and the angle deviation θij between their directions, should be small, as shown in Fig. 2(a). Combining these two requirements, we obtain the following coherence metric: (1) dij < ||v i || × sin(ϕ) = R, where ϕ is the maximum allowed angle deviation between two CVs within a neighborhood, and R is a threshold based on the magnitude of the reference CV and ϕ, as illustrated in Fig. 2(a). The allowed degree of deviation ϕ specifies how similar two CVs should be in order to satisfy the coherence criterion. Difference dij is computed in simplified form as: dij = |v i − v j | = |xvi − xv j | + |yvi − yvj |.
(2)
3.2 Smoothness Computation Given a reference CV v j ∈ VIi , the smoothness S of the motion field with respect to v j within the neighborhood NIi is measured as the ratio between the number of coherent CVs found in NIi and the number of the feature points in NIi . This ratio is denoted by S(NIi , v j ) and can be computed by: Ik ∈NIi fIk (v j ) S(NIi , v j ) = , (3) n where n is the number of feature points in NIi , and fIk (v j ) is a binary indicator variable, indicating whether the CV of feature point Ik is coherent with the reference v j , and is defined by:
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence
fIk (v j ) =
1 0
dik < R else.
793
(4)
The LTM assumption suggests that CVs within a neighborhood should have similar directions and magnitudes. This means that the S(NIi , v j ) should be as high as possible to have a smooth motion field. We compute S(Ni , v j ) for every v j ∈ VIi . The maximum is considered as the smoothness for the neighborhood, which is computed by Sm (Ni ) = max S(NIi , v j ). v j ∈VIi
(5)
With the above equation, the problem to determine the correspondence for feature point Ik ∈ NIi is converted into selecting a CV v Ik ∈ VIk so that the maximum smoothness Sm (Ni ) is found. 3.3 Steps to Compute Correspondences for Feature Points Within a Neighborhood S1: Given a reference CV v j ∈ VIi (j = 1, · · · , m), find the most similar CV from VIk for every Ik ∈ NIi (k = 1, · · · , n), so that the difference dik by Eq. (2) is minimal. S2: Set the indicator variable fIk (v j ) according to Eq. (4); compute the smoothness S(Ni , v j ) of the motion field using Eq. (3). S3: Compute the maximum smoothness Sm (Ni ) using Eq. (5); true correspondences are found if Sm (Ni ) is higher than a given threshold. The LTM constraint requires that correct matches for all points within a neighborhood give coherent CVs. Once we find the true CV for one point, CVs for other points can be found as well. As a result, the point-matching process is highly constrained. The combinations between the n points and (n × m) candidates are reduced from mn to approximately m in TIFM by using the LTM constraint. Fig. 2(b) illustrates the possible matching combinations. 3.4 Rationale of the Algorithm We now explain why the maximum number of coherent CVs give the true correspondences with a high probability. According to the LTM assumption, the (n × α) true CVs are coherent2, which compose a smooth motion field with a smoothness of α. Due to the random texture, feature points appear randomly along any other non-true CV. Thus, the probability is low to find another set of coherent CVs that give a smoothness which is higher than α. Once the highest smoothness is detected, true CVs are found. We do not assign any correspondence if the highest smoothness is below a threshold. This occurs for example in low-repetition-ratio areas such as trees, grass, etc. 2
α is the repetition ratio of feature points within a neighborhood, which is defined as the ratio between the #points that appear in both images and the #points that appear in the 1st image.
794
P. Li et al.
4 Experimental Results 4.1 Evaluation Criteria Only a portion of the detected points can be matched, and only a fraction of the detected matches are correct. For two-frame point matching, the results are presented with the parameters #CorrectMatches, recall and precision, as will be introduced below. A correct match is determined based on its conformity to either the homography matrix H or the fundamental matrix F that is computed using the RANSAC [13] algorithm on the obtained data set. A match is considered correct if its associated residual error dr computed by Eq. (6) is smaller than one pixel; this error is computed by [d(x , F x) + d(x, F T x )]/2, given F (6) dr = [d(x , Hx) + d(x, H −1 x )]/2, given H where (x, x ) is a pair of matched points; d(·, ·) is the geometric distance between the point and the epipolar line given the F , or the euclidian distance between the two points given the H. Measurements recall and precision are computed as follows: recall =
precision =
#CorrectM atches , #DetectedM atches
#CorrectM atches . #AverageP ointsInT woImages
(7)
(8)
recall is the percentage of the correct matches among the total detected matches, which measures the quality of the detected correspondences; precision is the percentage of the total feature points that are correctly matched, which measure the efficiency of an algorithm (more feature points imply more computation). For experiments on image/video sequences, besides the two-frame matching results, structure from motion is conducted, and the tracking performance of TIFM is evaluated in terms of the success or failure of the 3D reconstruction. 4.2 Experiments on Synthetic Images First, we generate an 800×600 image with 1, 000 randomly-distributed points3 . Second, the 1, 000 feature points are rotated and translated with controlled rotation or translation parameters to generate the second image. Third, an equal number of randomlydistributed outliers are injected into both images to generate two corrupted images. TIFM is then applied to the corrupted images to detect feature correspondences. The homography is computed for performance evaluation. Figs. 3(a) and 3(b) show the results obtained by TIFM under different settings of Degree of Rotation (DoR) and Percentage of Injected Outliers (PIO). The Degree of Rotation is the angle that the image rotates around its image center, which measures 3
The point is first randomly generated, and then suppressed in a way similar to the Harris corner suppression such that each 3x3 block contains at most one point.
#Correct Matches
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence
1
800
0.8 %Inliers
1000 600 400 200
795
0.6 0.4 0.2
0
0 %In 30 ject ed
60 Out liers
75 10
5 Rotatio of ree Deg
n
0
0 #In 30 ject ed
(a) #CorrectMatches.
Out 50 lier
s
75 10
ion 5 otat of R ree Deg
(b) recall.
(c) CVs superimposing on the uncorrupted first image.
Fig. 3. Results on synthetic images
how strong the image motion deviates from translation; the %Injected Outliers is the percentage of outliers injected into both images, which can be considered the noise level of the image, or the inverse to the repetition ratio of the feature points. As we see from Figs. 3(a) and 3(b), TIFM is able to reliably detect the correspondences even when image contains a large portion of injected outliers and evident rotation. For example, when P IO = 50%, DoR = 4o , we found 989 correct matches. Furthermore, 94.8% of the 1, 043 detected matches are inline to the homography. The obtained CVs are shown in Fig. 3(c), where an evident rotation is observed. The simulation results demonstrate the robustness of the LTM assumption to the noise and the motion deviation from the translational motion. For real images, this means that TIFM is able to work for image areas with low repetition ratio and containing non-translational motion, which is certainly very desirable. 4.3 Experiments on Real Images Tables 1(a) and 1(b) list the test image/video sequences, and individual image pairs for experiments. For performance comparison, three other matching algorithms are implemented, i.e., SIFT, KLT and the Block Matching (BM) algorithm4 . To track feature points by TIFM, SIFT and BM, correspondences between two successive images are firstly computed; second, feature-point tracks are obtained by linking the two-frame correspondences. For KLT, about 3, 000 good features are initially selected in the first image and then tracked or discarded over the remaining frames. The search range for TIFM, BM is set to 50 for image sequences (except that it is set to 70 for BM for castle), and set to 15 for video sequences. KLT relies on the image gradient for computing the CVs. It works only for sequences with small motion, and does not work with house, kspoort, castle and church. Two-Frame Matching Results. Fig. 5 shows the results by TIFM, SIFT and BM on individual images pairs. From the figure, we see TIFM obtains the largest 4
Source codes of SIFT and KLT are available from http://www.cs.ubc.ca/˜lowe/ keypoints/ and http://www.ces.clemson.edu/∼stb/klt/, respectively. For point matching by BM, the Sum of the Absolute Difference (SAD) of the luminance intensity between two 7 × 7 windows around the feature points are computed. A correspondence is established if the SAD is minimal and smaller than a given threshold.
796
P. Li et al. Table 1. (a) Test sequences, and (b) test image pairs
Seq (#f rm) Description medusa (194) Fig. 4(f); from www.cs.unc.edu/ marc/; small motion castle (26) Fig. 4(a); from www.cs.unc.edu/ marc/; mod. motion lab (150) Fig. 4(e); by hand-held DV; small motion kspoort (22) Fig. 4(d); by hand-held DC; mod. motion house (16) Fig. 4(b); by hand-held DC; mod. motion church (25) Fig. 4(c); by hand-held DC; mod. motion leuven (6) Fig. 4(i); from www.robots.ox.ac.uk/ vgg/research/affine/; big light change; small motion (a) Sequences
(a)
(d)
castle (IP1) with CVs.
kspoort with 524 points tracked
along 22 frames.
(g)
1st image of IP3 with CVs.
(b)
(e)
ImagePair L01 L05 IP1 IP2 IP3 IP4
Description Fig. 4(i); two brightest images from leuven the brightest and darkest images from leuven Fig. 4(a); extracted from castle; mod. motion Fig. 4(b); extracted from house; mod. motion Fig. 4(g); extracted from medusa; small motion Fig. 4(h); by hand-held DC
house (IP2) with CVs.
lab with 147 points tracked
along 51 frames.
(h)
(b) Image pairs
1st image of IP4 with CVs.
(c)
(f)
church with CVs.
medusa with 67 points tracked
along 141 frames.
(i)
1st image of L01 with CVs.
Fig. 4. Test sequences/images superimposed with detected CVs or tracked feature points by TIFM; all CVs are before outlier rejection
#CorrectMatches for 5 out of 6 image pairs. The recall of TIFM is comparable to that of SIFT, while precision is much higher than SIFT and BM. This implies that TIFM is accurate and more efficient. Without the need to detect many feature points, a large number of correspondences can be obtained with a high accuracy.
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence
797
Results on L01, L05 and IP2, which contain evident light change, show that TIFM is robust to light change. Results on IP4 show the potential of the TIFM for images containing reflecting or non-Lambertian objects. The reason is its texture independence. TIFM works as long as the LTM assumption is satisfied. Without surprise, BM works only for IP1 and IP3 with small light change. We have applied TIFM to every successive image pairs of all the test sequences. Similar results as Fig. 5 are obtained.
(a) #CorrectM atches.
(b) recall.
(c) precision.
Fig. 5. Results by TIFM, SIFT, and BM on individual image pairs
Tracking Results. The performance of TIFM is further evaluated in terms of the number of tracked feature points and the failure or success of the factorization-based 3D reconstruction [14]. A feature tracking is considered of a high quality if it can render an accurate reconstruction. Removing incorrect matches from two-frame correspondences is not used in all tested methods. Table 2. #T rackedP oints and the Success (S) or Failure (F) of the 3D reconstruction for (a) medusa, and (b) house; tracking starts from frame #0 and ends at frame #f rm #f rm 5 TIFM 1914F SIFT 676F KLT 753F BM 509F
10 40 80 1363S 550S 211S 413F 87S 23S 526F 220F 97F 187F 8F 0F (a) medusa
140 66S 5F 29F 0F
180 21F 0F 10F 0F
#f rm 5 10 15 TIFM 1152S 846S 699S SIFT 591F 286F 179S BM 430F 169F 79F (b) house
Table 2(a) shows the tracking results on medusa, where we see that TIFM performs better than other algorithms in terms of both of the number and the quality of the tracked feature points. Among the six results for tracking from 6 frames to 181 frames, only the first and last tracking fail for the 3D reconstruction. Table 2(b) lists the results on house, which show similar results as medusa. That is, TIFM is able to track more points along more frames and with a higher accuracy than SIFT, KLT and BM. KLT and BM fail for all 3D reconstructions. Using local image information alone for feature-point matching is difficult to render a satisfactory tracking. Outlier rejection is necessary for such algorithms.
798
P. Li et al. 0.3
0.25
0.2
22 cameras in ’W’ shape
0.2
0.3
16 cameras
0.2
Top view of recontd 3D shape of house
0.1
Top−front view of recontd 3D shape of castle
0.15
0.1 0.1
−0.1
0.05
0
−0.05
161 cameras
0
0
−0.1
0.05
−0.2
0
−0.05
Top−left view of recontd 3D shape of medusa
−0.1
−0.2
0.1
Top view of recontd 3D shape of kspoort
4 cameras
−0.1
−0.15 −0.3 −0.2
0.4 0.15 0.2
−0.4 −0.25
0 −0.3
0.2
−0.2 1
0.8
0.6
0.4
0.2
0
(a) house.
−0.2 −0.1 0 0.1
−0.8
−0.4
(b) medusa.
−0.6
−0.4
−0.2
0
0.2
0.1 0−0.1
(c) kspoort.
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.15
0.1
0.05
0
−0.05 −0.1
(d) castle.
Fig. 6. Four 3D shapes reconstructed by TIFM
As examples, Fig. 6 depicts four 3D scene structures reconstructed by TIFM. By examining the ‘W’-shape camera track in Fig. 6(c), the zigzag shape of the house in Fig. 6(a), and by comparing Fig. 6(d) with Fig. 4(a), we can see 3D reconstructions are successful. We believe the reconstruction in Fig. 6(b) is also correct, though it is more difficult to see from the figure. The success of the 3D reconstructions on a 161-frame long track in Fig. 6(b), and on a 4-frame short track in Fig. 6(d) fully demonstrates the reliability of TIFM. Experiments on other sequences show similar results.
5 Conclusion In this paper, we have proposed a novel texture-independent feature-point matching algorithm that uses only a self-contained smoothness constraint. The feature-point correspondences within a neighborhood are collectively determined such that the smoothness of the motion field is maximized. The experimental results on both synthetic and real images show that the proposed method outperforms SIFT, KLT and BM, in terms of both the number and quality of tracked feature points in image/video sequences. It provides an attractive solution to finding feature-point tracks in image/video sequences for tasks such as structure from motion. The accuracy and high tractability of the feature points by TIFM comes from the neighborhood-based smooth feature-point matching. First, collective determination of correspondences for a group of points constrains the feature-matching process, and decreases the chance of detecting individual incorrect correspondences5. Second, the use of local neighborhoods and allowing the motion deviation from the translational motion ensures a robust and accurate feature-point matching. TIFM obtains a good balance between using global and local image information. Third, the Harris corner detector guarantees the localization accuracy.
References 1. Ogale, A.S., Aloimonos, Y.: Robust Constrast Invariant Stereo Correspondence. In: Proc. IEEE Int. Conf. Robotics and Automation, pp. 819–824 (2005) 2. Hu, X., Ahuja, N.: Matching point feature with ordered geometric rigidity, and disparity constraints. IEEE Trans. Pattern Analysis and Machine Intelligence 16(10), 1041–1049 (1994) 5
Correspondences by TIFM can be wrong for a complete neighborhood. However, such erroneous two-frame correspondence is less likely to be propagated to succeeding images.
Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence
799
3. Yuille, A., Grzywacz, N.: A Mathematical Analysis of the Motion Coherence Theory. In: Proc. 2nd Int. Conf. Computer Vision. (1988) 4. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Computer Vision 60(2), 91–110 (2004) 5. Baumberg, A.: Reliable feature matching across widely separated videws. In: Proc. IEEE Comp. Vision and Pattern Recognition. vol. 1, pp. 774–781 (2000) 6. Schaffalitzky, F., Zisserman, A.: Multi-view Matching for Unordered Image Sets. In: Proc. 7th European Conf. Computer Vision, pp. 414–431 (2002) 7. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Trans. Pattern Analysis and Machine Intelligence 27(10), 1615–1629 (2005) 8. Bay, H., Tuytelaars, T., Gool, L.V.: SURF: Speeded Up Robust Features. In: Proc. 9th European Conf. Computer Vision (2006) 9. Tomasi, C., Kanade, T.: Detecting and Tracking of Point Features. Carnegie Mellon University Technical Report CMU-CS-91-132 (1991) 10. Maciel, J., Costeira, J.P.: A Global Solution to Sparse Correspondence Problems. IEEE Trans. Pattern Analysis and Machine Intelligence 25(2), 187–199 (2003) 11. Scott, G., Longuet-Higgins, H.: An Algorithm for Associating the Features of Two Images. In: Proc. of the Royal Soc, London. vol. B–244, pp. 21–26 (1991) 12. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp. 147–151 (1988) 13. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–393 (1981) 14. Han, M., Kanade, T.: A perspective factorization method for Euclidean reconstruction with uncalibrated cameras. J. of Visualization and Computer Animation 13(4), 211–223 (2002)
Where’s the Weet-Bix? Yuhang Zhang, Lei Wang, Richard Hartley, and Hongdong Li Research School of Information Sciences and Engineering Australian National University
Abstract. This paper proposes a new retrieval problem and conducts the initial study. This problem aims at finding the location of an item in a supermarket by means of visual retrieval. It is modelled as objectbased retrieval and approached using the local invariant features. Two existing retrieval methods are investigated and their similarity measures are modified to better fit this new problem. More importantly, through the study this new retrieval problem proves itself to be a challenging task. An instant application of it is to help the customer find what they want without physically wandering around the shelves but a wide range of potential applications could be expected.
1
Introduction
Given the query image of an object, object retrieval requires to find the same objects from a collection of images [1,2]. In this paper, we propose a new object retrieval problem, in which the object to retrieve only occupies a small part of a database image and might have multiple copies there. An instant application is: suppose a customer needs a certain brand of biscuit and he has a sample or its image. Through our object retrieval, the customer will be informed the shelf where this biscuit lies. This problem is more challenging than those posed in [1,2] due to the following issues: 1. The query image is a close-up view of the object to find, whereas each database image contains dozens of objects that are small in size and different in brands and manufacturers, as shown in Figure 1. As a result, the object to retrieve is presented in very different scales between the query and database images. Worse is that in a database image all the objects other than the queried become background clutter. 2. In each database image, there are often multiple copies of an object. In that case the same or similar local invariant features may be extracted from different locations, which makes some widely used geometrical constraints not suitable anymore. For example, some features points found in a database image may not be close to each other, even if their matches are close on the query image;
The last two authors are also affilated with NICTA, a research instutute funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 800–810, 2007. c Springer-Verlag Berlin Heidelberg 2007
Where’s the Weet-Bix?
801
3. To attract the attention of customers, the appearances of objects in supermarket are often full of striking signs and patterns, leading to a very rich set of local features. This challenges the discriminating ability of the descriptors for different local features. Moreover, a large number of objects have round or non-rigid shapes and glisten are very common. These situations further increase the difficulty of this object retrieval task. 4. Products are often produced in series, for instance, “BIO” brand includes the detergent for brighter color and the softener. In this case, the appearances of objects have only slight difference. However, they are completely different for customers. This fact is reflected in the ground truth defined for this retrieval problem, as shown in Figure 2. In this paper, the image database is created and its characteristics are discussed. From each image, local invariant regions are detected and described by the SIFT feature [3]. Upon all of these features, a visual vocabulary is constructed. Each image is then represented by a histogram vector showing the occurrence frequency of each visual word. The similarity of two images is measured by comparing the corresponding histogram vectors. To address the issues aforementioned, a loose geometrical constraint is applied, which partitions each database image into multiple overlapped sub-images to mitigate the affect of background clutter. Taking the advantage of the existence of multiple copies,
Fig. 1. Top row: query images; Bottom row: database images
Query
Not ground truth Not ground truth Ground truth
Fig. 2. Definition of ground truth (Must be identical rather than similar only)
802
Y. Zhang et al.
new weighting scheme is designed to assign high weight to the visual words intensively presented in a small number of database images. An inner-product based similarity measure is developed, which encourages the “co-presence” of same visual words in query and database images but does not penalize the other cases like the “co-presence” of different visual words. Experimental study is conducted to compare our approach with the existing methods from [1,2] on this new retrieval problem, demonstrating its superior performance.
2
Related Work
This paper employs local invariant features to represent the visual content of an image. The interest point detecting approach Harris-Affine detector proposed by Krystian Mikolajczyk and Cordelia Schmid in [4] provides interested point for reliable matching for images even with significant perspective deformations. In later discussion in [5], it is also demonstrated to be more efficient than most other region detectors when dealing with occlusion and clutter. The image feature generation approach proposed by David G. Lowe in [3] gives SIFT features which are largely invariant to changes in scale, illumination, and local affine distortions. These local-invariant features are of intermediate complexity, which means they are not only distinctive enough to determine matches in big feature database but also sufficient enough to bear clutter and occlusion. There has been considerable progress in developing real-world objects recognition systems with local invariant features in recent work [1,2,6,7]. Among those work Visual vocabulary [1] has been proved to be a distinctive indexing artifice in local invariant feature based object retrieval. To build a visual vocabulary , local feature descriptors are quantized into clusters according to their similarity. Then when a new local feature comes, all the similar features can be found by assigning the new feature’s descriptor into its nearest cluster. In [1] retrieval of all occurrences of an object outlined by user from a movie is carried out. After extracting local invariant features from each frame, two visual vocabularies of different types of local invariant features are built up. As the complement of visual vocabulary, term frequency-inverse document frequency (tf-idf) weighting standard, L2 -norm similarity measure, and spatial consistency check are employed to improve the retrieval performance, showing excellent retrieval performance. In [2], visual vocabulary is extended into a hierarchical vocabulary tree and a scheme which can quickly retrieve the same CD-covers from a large database of music CDs is proposed. In [2], all the query and database images contain only a single object, namely CD cover. Their work tries to find out whether a query and a database image have the same content. However in our work, although a query image contains single object, database image always contains dozens of objects of different classes. What we try to find out is whether a database image contains the single object on a query image. In [1] the problem seems to be the same with us, but their query images are cropped from database images, which actually results in finding out where an image patch comes from. Moreover, a new issue we need to deal
Where’s the Weet-Bix?
803
with is the potential multiple appearances of query object in single database image.
3
Creation of the Database
Our database named WebMarket contains 3,153 images which were taken in a supermarket named Coles in Jan 2007. This supermarket has eighteen 30-meterlong shelves, each of which approximately has six levels. Ten shelves are captured in this database. Starting from one end of the first shelves, the photographing was carried out roughly following the order of the shelves. Two adjacent images have some overlap (less than one third of image size) to ensure that no part of the shelf is missed. Three digital cameras were used and all the images are saved in Jpeg format. The resolution of the images is either 2, 272 × 1, 704 or 2, 592 × 1, 944 pixels. Each image generally covers an area of about 1.5 meter in hight and 2 meters in width on shelves, imaging all the objects within three or four shelf levels in range. The size of each single object in an image is usually small. During photographing, no restriction on the view point and distance was imposed, although most of images are frontal views. No special illumination is used. To build the query set, about 200 different ojbects were randomly selected and put on the ground, and captured one by one. For each query three images are taken from different view angles or distances. For the same object, there is a large difference in its scale between the query and database images. This image database named WebMarket will soon be published on web.
4 4.1
Our Approach Image Representation
Local invariant features have shown excellent performance in representing images under a certain degree of change on scale, view angle, and illuminance and partial occlusion. In this work, a Harris-Affine interest region detector [4] is applied to each image and the SIFT descriptor [3] is used to describe the detected interest regions. The binaries are downloaded from the Visual Geometry Group, Univ. of Oxford, and the threshold parameter is set to 20,000 to control the number of detected regions. For each database image, about 2,000 SIFT features are extracted, leading to 6.8 million SIFT features in total. When dealing with query images, to handle the large scale difference between the query and database images, the object is manually segmented out (an automatic segmentation algorithm can also be used) and downsized to a 300 × 300 image to partially alleviate the scale problem. The local features are then extracted from it. 4.2
Construction of the Visual Vocabulary
To ensure the quality of the visual vocabulary, it is built upon all of the 6.8 million SIFT features rather than a randomly selected subset. Hierarchical kmeans clustering [8] is applied. Step by step, all of the features in database are
804
Y. Zhang et al.
clustered into 1,000, 20,000, and 200,000 clusters, leading to a visual vocabulary containing 200,000 visual words. The hierarchical clustering is stopped at 200,000 because it gives reasonably good retrieval performance on our problem. A largersized vocabulary may be used. However, it will run a risk that the features which have been sufficiently similar (up to some noise level) are separated into different clusters and this will adversely affect the retrieval. 4.3
A Loose Geometry Constraint
In our problem, the local features are much richer in a database image than in a query image. As a result, a high matching score can be easily obtained between a query image and most database images, leading to very poor retrieval performance. Spatial consistency check in [1,3] cannot be used here, because there are often multiple copies of identical object in a database image and the identical local features could be extracted from different places. This paper imposes a loose geometry constraint. It evenly partitions a database image into 25 sub-images, each of which is one ninth of the original one. A sub-image is large enough to contain the object to retrieve but has much less background clutter. 1 Two neighboring sub-images have the half area overlapped to reduce the risk of separating one object into two sub-images. Totally, about 78,000 sub-images are obtained. For a query image, the match of a subimage means the match of the corresponding database image. After partition, the computational load of retrieval remains the same. 4.4
Similarity Measure
A similarity measure defines the visual similarity between a query and the subimages. Those with higher measure values are selected as the retrieval result. Let xi = [xi1 , xi2 , · · · , xin ] denote the i-th sub-image, where xi,j (1 ≤ i ≤ m, 1 ≤ j ≤ n) is the number of occurrences of the j-th visual word in this image, m the total number of sub-images, and n the total number of visual words. Similarly, the i-th query image is represented as q = [q1 , q2 , · · · , qn ] . = In our proposed similarity measure, q and xi are mapped respectively to q i = [ [ q1 , q2 , · · · , qn ] and x xi1 , x i2 , · · · , x in ] , where 0, if qj = 0 0, if xij = 0 qj = ; x ij = (1) 1, if qj > 0 dj , if xij > 0 where
m
xkj m log m ; k=1 sign(xkj ) k=1 sign(xkj ) 0, if xij = 0 . sign(xij ) = 1, if xij > 0
dj = m
1
k=1
(2) (3)
An ideal way to implement this constraint may be to ensure that each sub-image exactly contains one object. However, this is no less difficult than the object retrieval problem itself.
Where’s the Weet-Bix?
805
and x i , The similarity measure is defined as the inner product between q i S1 (q, xi ) = q, x
(4)
Equation (2) shows the weighting scheme m designed for the problem mwe have. It is m x / sign(x ) and log(m/ a product of two terms: kj kj k=1 k=1 k=1 sign(xkj )). m Here, k=1 xkj represents the total number of occurrences of the j-th visual m word through the database, and k=1 sign(xkj ) represents the number of images containing at least one j-th visual word through the database. If a visual word appears in certain image(s) with high repetition, then it is probably a stable feature that can be extracted from different copies of a same object. On the other hand, if a feature appears in different images but only once in each, it is more likely to be a noise. These two cases are weighted via the first term. Moreover, for a very popular feature that can be extracted from many images, no matter if stable or not, its importance will be down scaled via the second term since it is not discriminating. In addition, the similarity measure computes the inner product of two unnormalized vectors. By doing so, the “co-presence” of the same visual words is rewarded between two compared images. Meanwhile, the case that a visual word only appears in one of them will not be penalized. We also tried the similarity measure which considers the exact number, xij , of a visual word appeared in a sub-image. It is defined as qj =
0, 1,
if qj = 0 ; if qj > 0
x ij
=
0, xij dj ,
if xij = 0 if xij > 0
i . S2 (q, xi ) = q, x
(5)
(6)
Theoretically, similarity measure denoted by Equation (6) awards the database images who have more copies of the visual words that query image contains, which indicates a higher possibility of true matching rather than noise. However, in this manner those database images who only has single copy of queried object may loose the game against those who don’t contain queried object but possess certain amount of similar noise. On the other hand, similarity measure denoted by Equation (2) focus more on how many types of visual word are matched, which is expected to perform better in occasion of that few copies of queried object are presented in a database image. Another two similarity measure have been proposed in [1,2]. In [1], each query or database image i is mapped to a n-dimensional vector Vi = [ti1 , ti2 , · · · , tin ] , where xij m tij = n log m . (7) k=1 xik k=1 xkj Then the similarity between each pair of images is measured by S3 (Vq , Vxi ) =
Vxi Vq , . ||Vq || ||Vxi ||
(8)
806
Y. Zhang et al.
In [2] each query or database image i is also mapped to a n-dimensional vector i = [ V ti1 , ti2 , · · · , tin ] , where m . sign(x kj ) k=1
tij = xij log m
(9)
Then the similarity between each pair of images is measured by q, V xi ) = || Vq − Vxi ||. S4 (V q || ||V xi || ||V
(10)
All the four similarity measure methods will be compared in the experiments.
5
Experimental Result
This experiment investigates the performance of the proposed similarity measures S1 and S2 , as well as the previous similarity measure proposed in [1,2] which are denoted by S3 and S4 respectively on this new retrieval task. All of the 3,153 database images are used. Thirty retrievals are conducted and the average retrieval performance is reported. The 30 query images are randomly sampled from the query set containing 600 images. The ground truth of each of them is created by manually checking the database images one by one. On average, for a query image, there are merely 6.63 database images which are the true matches. The retrieval rank of a database image is determined as the highest rank of the sub-images cropped from it. Each image is represented as a 200, 000-dimensional vector. Although in very high dimension, this vector is quite sparse. The total number of non-zero items is no more than the total number of local features extracted from this image, about 2,000 in our case. Taking the advantage of the sparsity allows us to efficiently evaluate the similarity of two images. The retrieval performance is measured by the Precision and Recall widely used in information retrieval. They are defined as P recison =
#positive retrieved #retrieved
=
#positive retrieved #total positive
, Recall
(11)
where #positive retrieved represents the number of correctly retrieved images, #retrieved represents the number of retrieved images, and #total positive represents the number of the ground truth in the database for a query. Also, the average normalized rank of true matching images in [1] is also used, which is defined as mpos 1 mpos (mpos + 1) (12) Rk − avg rank = m·mpos 2 k=1
where m is the total number of images in the database. mpos is the number of the ground truth in the database for a query and Rk is the rank of the kth ground
Where’s the Weet-Bix?
(a) Precision curve
807
(b) Recall curve
(c) % of at least one positive return Fig. 3. Comparison of the four similarity measures Table 1. Comparison of the average rank value (S1 , S2 : proposed measures) Measure avg rank
S1 15.71
S2 15.83
S3 S4 17.13 15.99 (×10−2 )
truth image for a query. The smaller this average rank value is, the better the retrieval performance is. The Precision and Recall curves are plotted in Figure 3. The horizontal axis is the number of retrieved images, and the top 50 retrieved images are evaluated. As shown in sub-figure(a), the proposed similarity measures, S1 and S2 , achieve better retrieval performance than S3 and S4 . The main difference between the two proposed measures and those in [1,2] lies in our new weighting scheme and the unnormalized inner product, and this result verifies their effectiveness. In addition, S2 outperforms the other three at the first retrieved image, whereas S1 becomes the best when the number of retrieved images increases. This shows that taking the number of occurrences of a visual word into account helps to identify a perfect matching database image which has many copies of queried objects.
808
Y. Zhang et al.
Query (8)
Query (3)
Query (5)
×
Query (9)
Query (6)
×
Query (7)
Query (5)
×
Fig. 4. Some retrieval examples. The number of ground truth is listed in bracket.
Where’s the Weet-Bix?
809
However, this also makes the similarity measure more sensitive to noise because when a database image doesn’t really contain the query object but contains multiple copies of another kind of object which shares one or two types of visual word with query object, it might score even more than those true matching image who has only one copy of query object through S2 . In other words, S2 focus on how many visual words query and database images have in common, when S1 focus on how many types of visual word query and database images have in common. Recall is shown in sub-figure(b), from which similar conclusion can be drawn. In the sub-figure(c), the horizontal axis is the number of retrieved images, and the vertical axis shows the percentage of the queries for which at least one correct match is found. As it shows, through S1 and S2 , over 70% of the query images can find at least one correct match in top 50 retrieved images. However, through S3 and S4 , this number is only between 50% and 60%. This demonstrates again that the proposed S1 and S2 gain better performance than S3 and S4 do in our retrieval task. The average ranks are listed in Table 1. It can be seen that the proposed S1 produces the lowest value. It means that, averagely, the ground truth images are assigned higher ranks under this measure. This coincides with the Precision and Recall result. Some examples of the top 3 retrieved images are shown in Figure 4.
6
Conclusion
A new retrieval problem is proposed in this paper. The experimental result demonstrates the better performance of the proposed similarity measures. Meanwhile, it can be seen that less than half of the 1st retrieved images are correct answers and that only about 40% of true matches can be found after retrieving twenty images. Such a result indicates that this retrieval task is quite challenging and more work such as the verification of matches and the consideration of feature dependency needs to be explored to further boost the retrieval accuracy. Acknowledgments. Many thanks to the Coles store located in Woden Plaza, ACT. We were allowed to collect the images in their store for our research. Their understanding and support make this work possible and are highly appreciated.
References 1. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, pp. 1470–1477. IEEE Computer Society Press, Los Alamitos (2003) 2. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2161–2168. IEEE Computer Society Press, Los Alamitos (2006) 3. Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision, pp. 1150–1157. IEEE Computer Society Press, Los Alamitos (1999)
810
Y. Zhang et al.
4. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. Interantional Journal of Computer Vision, 63–86 (2004) 5. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Computer Vision 65(1/2), 43–72 (2005) 6. Lowe, D.G.: Local feature view clustering for 3d object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 682–688. IEEE Computer Society Press, Los Alamitos (2001) 7. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 530–535 (1997) 8. Duda, R.O., Stork, D.G., Hart, P.E.: Pattern Classification, 2nd edn. vol. 115. John Wiley and Sons, Chichester (2001)
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR-Based Stopping in Nonlinear Diffusion Scale-Spaces Ram¯ unas Girdziuˇsas and Jorma Laaksonen Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box 5400, FI-02015 TKK, Finland
[email protected],
[email protected]
Abstract. Iterative smoothing algorithms are frequently applied in image restoration tasks. The result depends crucially on the optimal stopping (scale selection) criteria. An attempt is made towards the unification of the two frequently applied model selection ideas: (i) the earliest time when the ‘entropy of the signal’ reaches its steady state, suggested by J. Sporring and J. Weickert (1999), and (ii) the time of the minimal ‘correlation’ between the diffusion outcome and the noise estimate, investigated by P. Mr´ azek and M. Navara (2003). It is shown that both ideas are particular cases of the marginal likelihood inference. Better entropy measures are discovered and their connection to the generalized signal-to-noise ratio is emphasized.
1
Introduction
Scale-space methods allow to restore and enhance semantically important features of images such as edges. One particular strategy is to employ edge-preserving diffusions [9] supplied with grid-based regularization and splitting techniques [11]. The scale then becomes the diffusion time. Practice indicates the existence of an optimal stopping time which yields the diffusion outcome closest to the desired signal assumed to exist in the observations. A further need to automate the choice of the optimal time has been emphasized in [2]: “Attentive viewing of a computer screen for quite long periods of time may be necessary, and, because changes from one iteration to the next are usually imperceptible, locating the optimal point at which to terminate the process becomes highly elusive.” We shall make an attempt to unify two ideas which seem, at the first glance, to be completely different: the entropy criterion suggested in [10], and the use of the correlation studied in [8] and [5]. A maximization of the signal-to-noise ratio (SNR) will partially be covered too. The entropy criterion arises from the stability analysis of the nonlinear diffusion scale-space in a discrete space and time. It is well-known in the majorization theory that iteration with doubly stochastic matrices diminishes any Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 811–820, 2007. c Springer-Verlag Berlin Heidelberg 2007
812
R. Girdziuˇsas and J. Laaksonen
Schur-convex (isotone) function. When assuming that the signal has only nonnegative values, one constructs the Shannon entropy − k u(xk ) ln u(xk ), which is proven to be isotone in [7]. A further investigation [10] of the entropy increase contains the statement which is presumably based on unreported experiments: “This correspondence has focused on the maximal entropy change by scale to estimate the size of image structures. The minimal change by scale, however, indicates especially stable scales with respect to evolution time. We expect these scales to be good candidates for stopping times in nonlinear diffusion scale-spaces.” The idea is of a certain interest as it relates the Liapunov stability to the second law of thermodynamics and the MaxEnt inference. However, it is unsatisfactory that the authors neglect the explicit probabilistic model of observations. An image is considered as a probability density (histogram) of a single scalar-valued random variable in [10]. The assignments of an image intensity value to a given spatial location are not reflected in the stopping criteria. Mixing the concept of ‘observation’ with the ‘probability density’ raises unnecessary questions, e.g.: Is Liapunov stability supposed to replace model selection? Is there any best way to preprocess a given image, so that when viewed as a scalar-valued function of the spatial coordinates, it would become a probability density? A current status of the entropy-based stopping [10] remains summarized in [8]: “However, as the entropy can be stable on whole intervals, it may be difficult to decide on a single stopping instant from that interval; we are unaware of their idea being brought into practice in the field of image restoration.” Instead, the suggestion in [8] is to stop the diffusion at the time when the ‘correlation’ between the signal and the noise estimate is minimal. It is rather evident that most of a critique directed against the entropy-based stopping applies to correlation as well. In particular, the remark on the ‘entropic stability’ in [8] pertains to rare cases in which the correlation might have very shallow minimum as well, or no minima at all, as indicated in [5]. Experimental evidence of [5] suggests that the correlation-based stopping might be suboptimal in the SNR sense and overestimates the stopping time for textured images. According to our view, neither the entropy, nor the correlation-based stopping should be excluded by the developments related to robust statistics. We suggest a unification which allows to: (i) avoid unnecessary preprocessing of signals, (ii) arrive at a more general criterion, which merges both ideas into a single equation and further clarifies their probabilistic assumptions, and (iii) view the optimal diffusion stopping as an example where Bayesian arguments simplify the likelihood inference, not vice versa, as is commonly practiced. Section 2 presents the ‘inverse covariance trick’ applied to regularize a certain Gaussian model with a singular concentration matrix. This model unifies the entropy, correlation and SNR-based stopping, which is discussed in Section 3. Section 4 describes a univariate numerical example which shows typical evolutions of the suggested criteria. Multivariate extensions do not go beyond numerical aspects when diffusion propagators satisfy the conditions of Section 3.2. Section 5 summarizes the conclusions.
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR
2
813
Construction of Joint Probability Density
Given an image stored as a matrix of size n1/2 × n1/2 , one may consider it as a vector u0 = y = n . At present, a variety of iterative smoothing algorithms are known [11,2] which, given an initial image u0 , provide a set of images ut : t = 1, τ, . . . , mτ with scale-space properties in a certain sense, e.g. nonincrease of global and local extrema, sign changes, and a variety of Liapunov sequences. This can be summarized as umτ = P−1 θ (u0:(m−1)τ )u0 .
(1)
Here Pθ (u0:(m−1)τ ) ∈ n×n is assumed to be nonsingular, and is often chosen so that P−1 θ (u0:(m−1)τ ) is doubly stochastic or totally positive [6]. The dependence on parameters and evolution will be suppressed. The subscript θ indicates the presence of the parameters, which comprise a vector θ ∈ p and are typically set by a practitioner. A single globally optimal parameter setting may not even exist, but practice indicates that the choice of the stopping time can be automated. Assumption 1 (Gaussian hypothesis space). Let the model outputs u ∈ n and the observations y ∈ n be distributed according to the joint Gaussian probability density with a zero mean and the covariance matrix Σuu Σuy Σ= (2) , Σab ≡ Cov(A, B), ΣTuy Σyy where Cov(A, B) ≡ (A − A)(B − B)T and · denotes the expectation. As the joint covariance is not specified yet, this assumption does not tell anything more than that one prefers to work with positive definite matrices. In accordance with the maximum likelihood inference, the parameters θ are not included in the joint random variable. The covariances depend implicitly on them. Assumption 2 (Model H1 ). Let Σ uu = Σ uy = Σ yu and Σ yy = Σ uu + Σ nn , where n stands for ‘noise’ variable N . An explicit inverse reads: Σ −1 H1
=
Σ uu Σ uu Σ uu Σ uu + Σ nn
−1 =
−1 −1 Σ −1 nn + Σ uu −Σ nn −1 −1 −Σ nn Σ nn
.
(3)
This particular covariance model is one of the simplest. Formally, the conditional −1 −1 concentration Σ −1 u|y = Σ nn + Σ uu reads directly from the upper-block of the partitioned inverse, and it ensures that the conditioning is variance-reducing. Eq. (1) can now be given the meaning of a conditional expectation: μu|y ≡ U |y, H1 = Σ uu (Σ uu + Σ nn )−1 y, Σ −1 uu Σ nn = P − I .
(4)
Given a nonsingular Σ nn , the diffusion propagator P uniquely defines the product Σ uu if and only if P has no eigenvalues equal to unity. If we further assume that the noise N is uniformly (isotropically) white, i.e. Σ nn = θ0 I for some
814
R. Girdziuˇsas and J. Laaksonen
−1 θ0 > 0, then the covariance Σ −1 uu = θ0 (P − I), and the model H1 becomes completely specified. The discrete space and time propagator P attains the form
P ≡ (I − L(u0 )) (I − L(uτ )) · · · I − L(u(m−1)τ ) ,
(5)
where L : n → n×n is the generalized Laplacian matrix [11,6]. In order to preserve the average value of the signal, one applies the von Neumann boundary conditions which yield singular Laplacians L irrespectively of the evolution u0:(m−1)τ . Thus, nonlinear diffusion scale-spaces result in the propagator P which has an eigenvalue equal to unity. Therefore, the model H1 has a singular concentration Σ −1 uu , and the problem with infinite covariance matrices emerges thereupon. We suggest resolving this difficulty via the following trick. Instead of adding the uncorrelated noise variable N with the covariance matrix Σ nn to the signal variable U with the covariance matrix Σ uu , one can add an uncorrelated ‘noise’ −1 with the covariance Σ −1 uu to the ‘signal’ with the covariance Σ nn . Assumption 3 (Model H2 ). Let Σ uu = Σ uy = Σ yu = Σ −1 nn and Σ yy = −1 + Σ , and the overall covariance matrix possesses the following inverse: Σ −1 uu nn Σ −1 H2 =
Σ −1 Σ −1 nn nn −1 −1 Σ nn Σ nn + Σ −1 uu
−1 =
Σ uu + Σ nn −Σ uu −Σ uu Σ uu
.
(6)
The reader may check that the model H2 retains the conditional expectation given by Eq. (4). Contrary to the model H1 , the concentration Σ −1 uu is now allowed to be singular. The choice Σ nn = θ0 I completely specifies the model H2 even when P has an eigenvalue equal to unity. The problem with infinities is removed at the expense that the additive noise is no longer white.
3
Applications of Models H1 and H2
One can further decompose marginal likelihoods of the parameters of the models H1 and H2 into ‘decorrelation’ of the noise estimate with the model output and entropy maximization. Section 3.1 will also introduce differential entropies which: (i) avoid the normalization problems present in [10], (ii) are consistent with our experience that signals are less random when the scale is coarsened, and (iii) when diffusions are linear and time-homogeneous, the entropies do not depend on the signal, but only on the variance of the additive noise. Two sections are further included to emphasize a special role that the entropies play in the marginal likelihood maximization. Section 3.2 establishes conditions which guarantee that the entropies are monotonous in time. A Bayesian viewpoint is briefly outlined in Section 3.3, where the reduction of the marginal likelihood maximization to ‘decorrelation’ can be seen as a way of imposing a certain a priori density on the parameters which are no longer viewed as deterministic quantities.
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR
3.1
815
Marginal Likelihood, Correlation, Entropy and SNR
It is somewhat paradoxical that a rather dull formal expression of the marginal likelihood can be seen as a conglomerate of different model selection ideas. Lemma 1 (Marginal likelihood p(y|H1 )). Assume a white covariance Σ nn = θ0 I for some θ0 > 0. Then, −2 ln p(y|H1 ) =
1 y − μu|y 2 + (y − μu|y )T μu|y + ln |2π(Σ uu + θ0 I)| . (7) θ0
Proof. It follows from the definition of the marginal likelihood that 2 ln p(y|θ) = −yT Σ −1 yy y − ln |2πΣ yy |,
(8)
where Σ yy = Σ nn + θ0 I. Furthermore, yT (Σ uu + θ0 I)−1 y = yT Σ −1 uu μu|y
(9)
= θ0−1 yT (y − μu|y ) =
θ0−1 (y
(10)
− μu|y + (y − μu|y ) μu|y ) . Q.E.D. 2
T
(11)
The difference y − μu|y can be thought as the noise estimate, and minimizing the second term on the right-hand side of Eq. (7) is ‘orthogonalization’, except that (y − μu|y )T μu|y can be negative. This quantity can be compared with the correlation [8], which stops the diffusion at the time when cov(y − μu|y , μu|y ) var(y − μu|y ) var(μu|y )
(12)
is minimal. Here cov(u, v) ≡ tr(Cov(W )) with W being a joint vector which takes values w : wT = (uT , vT ) and var(u) ≡ tr(Cov(U )). The authors of [8] study only the dot-product estimator which, when neglecting the subtraction by means and normalization, turns out to be (y − μu|y )T μu|y . Given a random variable X, distributed according to the Gaussian density pμ,Σ (x) with the mean μ and the covariance Σ, the differential entropy is: 1 p(x) ln p(x) dx = ln |2πeΣ x | . (13) h(X) ≡ − 2 n Ê Thus, ln |2π(Σ uu + θ0 I)| = 2h(Y |H1 ) − n. The meaning of this entropy can also be appreciated by noticing that Σ uu + θ0 I = θ0 Σ uu (θ0−1 I + Σ −1 uu ), and, thus,
Cov(U |y, H1 )
. (14) 2h(Y |H1 ) = 2h(U |H1 ) − ln
θ0 Therefore, minimizing ln |2πΣ uu + θ0 I| reduces the uncertainty of the prior density p(u|H1 ) and maximizes the generalized signal-to-noise ratio.
816
R. Girdziuˇsas and J. Laaksonen
It should not be very hard to verify that the marginal likelihood p(y|H2 ) decomposes into:
−1
−2 ln p(y|H2 ) = θ0 μu|y 2 + (y − μu|y )T μu|y + ln 2π(Σ −1 uu + θ0 I) . (15) By noticing that Cov(Y |H2 ) = (Cov(U |y, H1 ))−1 , one discovers that h(Y |H2 ) = n ln(2πe) − h(U |y, H1 ).
(16)
In Section 3.2 we shall prove that the entropy h(Y |H2 ) does not decrease in time. Eq. (16) would then imply that the conditional entropy h(U |y, H1 ) is nonincreasing (diminishing). The conditional expectation, given by the nonlinear diffusion scale-space via Eqs. (4) and (5), tends towards the steady state which is a constant signal equal to the average value of the observations y. Intuitively, the signal becomes less random in time, which is reflected in the diminishing of h(U |y, H1 ). This can be contrasted to the view in [10], which prefers to apply the Shannon entropy functional directly to the signal. This entropy increases as the constant signal represents density, and the uniform density is known to attain the highest entropy value. However, the application of entropic arguments in [10] is inconsistent with the fact that each diffusion outcome is conditioned on the knowledge at the previous time instant. Before discussing the monotonicity, it is good to emphasize that evaluating entropies h(Y |H1 and h(Y |H2 is more difficult than the case with the criteria in [10]. However, a linear scaling w.r.t. the number of observations can be achieved, irrespectively of the dimension of the domain in which the diffusion propagator P is defined. The problem can first be reduced to the inner-product representation via identities [1]: ln |I − A| = −
∞ tr(Ak ) k=1
k
∞ 1 X T Ak X = −n , k XT X
(17)
k=1
where the first equality holds for any A ∈ n×n whose spectral radius does not exceed unity. The variable X ∼ N (0, I) is a standard normal variable which takes values x ∈ n . The reader may verify that application of Eq. (17) leads to: ∞ 1 X T P−k t X , ln |2π(Σ uu + θ0 I)| = n ln(2πθ0 ) + n k XT X k=1 T k ∞ k
X P X 1 (k − 1)! −1
−1 = n ln(2πθ ln 2π(Σ −1 + θ I) ) + n . uu 0 0 k m=0 m!(k − m)! XT X k=1
The matrix-vector product can further be split into univariate diffusions. Splitting is a common practice in approximating multivariate flows with the univariate ones and is nicely documented in [4].
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR
3.2
817
Monotonicity of Differential Entropies
Sufficient conditions can be stated which indicate when the entropies h(Y |H1 and h(Y |H2 are nondecreasing. Another way to say the same thing is that negative entropies are Liapunov functions (sequences). More colloquially, the second law of thermodynamics takes place in a virtual world of discrete nonlinear diffusions. Lemma 2 (Monotonicity of entropies in a discrete time mτ ). Let the propagator be time-homogeneous, i.e. P = (I − L)m . The entropy is nondecreasing, i.e. (18) h(Y |m + 1, H1 ) ≥ h(Y |m, H1 ), provided that the matrix −L is positive definite. If the propagator is given by a more general Eq. (5), the following inequality is true: h(Y |m + 1, H2 ) ≥ h(Y |m, H2 ),
(19)
provided that each matrix −L(umτ ) is positive semidefinite for every m ∈ . Proof. The time-behavior of the entropy h(Y |m, H1 ) is determined by the term ln |2π(Σ uu + θ0 I)| which, up to irrelevant constants, equals to − ln |I − P−1 |. Let the eigenvalues λ(−L) be denoted as λi for i = 1, . . . , n. The Taylor series expansion leads to: − ln |I − P−1 t | =−
n
n ln 1 − (1 + λi )−m = − (1 + λi )−m + h.o.t. ,
i=1
(20)
i=1
∞ which follows from ln(1 − x) = − k=1 xk /k. Clearly, if the matrices −L are positive definite, then each λi > 0 and the entropy increases w.r.t. m. If we further assume the largest term, i.e. (1 + λmin )−(t+1) with λmin > 0, is dominating, the decay of the negative entropy will be exponential in time. It follows from Eqs. (4) and (15) that, up to irrelevant constants, the entropy h(Y |H2 ) is determined by: ln |P| =
n
ln(1 + λi )m = m
i=1
n
ln(1 + λi ).
(21)
i=1
Therefore, the entropy h(Y |H2 ) grows linearly in time, and L is allowed to be singular. The nondecrease of h(Y |H2 ) can be established for nonlinear diffusions: ln |P| =
m k=0
ln |I − L(ukτ )| =
n m
ln(1 + λi (k)) .
(22)
k=0 i=1
Here each eigenvalue λi (k) ≥ 0 comes from the set λ(−L(ukτ )) and is now timedependent. The positivity of the eigenvalues guarantees that the term ln |P| is nondecreasing, which proves the inequality in Eq. (19). Q.E.D.
818
R. Girdziuˇsas and J. Laaksonen
The most significant term that determines the nondecrease of the entropy h(Y |H1 ) in the homogeneous case depends on the smallest eigenvalue λmin , whereas it is the maximal eigenvalue λmax which affects the entropy h(Y |H2 ). Very convenient bounds follow from the Schur theorem [7], which states that the eigenvalues of a Hermitian matrix majorize its diagonal elements. As a special case, the following inequalities are true: λmin ≤
min
(−lii ), λmax ≥
i∈{1,2,...,n}
max
(−lii ) ,
i∈{1,2,...,n}
(23)
where lii are the diagonal elements of L, and they are typically negative. Positive definiteness of the matrices −L is more thoroughly discussed in [6]. The ideology of [10] now gets a proper justification. Utilization of the differential entropy first establishes it as a model complexity measure, and then proves that it is indeed a Liapunov function. In the model [10], the signal is assumed to be normalized in order to satisfy the constraints of the probability density, which can be written such as umτ = δ(U − u|y, H. However, the observations must be preprocessed in order to validate this density, and the diffusions are restricted to positive evolutions. In this work, umτ ≡ μu|y ≡ U |y, H1(2) and y does not have to be preprocessed. 3.3
Correlation Prior
Eq. (7) suggests that when maximizing the marginal likelihood, the first and the third terms on its right-hand side prefer small stopping times m. Therefore, if the marginal likelihood p(y|H1 ) is to be unimodal w.r.t. the stopping time m, the unnormalized correlation (y−μu|y )T μu|y must either possess an extremum, or it should give preference to larger stopping times. In the case of the model H2 , the first term in Eq. (15) gives preference to m → ∞, and the last term acts in the opposite way, so the optimal balance should exist even without the unnormalized correlation. This is what the theory predicts assuming the correctness of the marginal likelihood criterion. Dropping out a particular term in Eqs. (7) and (15) can be seen as imposing a certain a priori density (a prior). Ignoring the first terms leads to the maximum entropy priors, but one can discover the ‘orthogonality’, or the ‘unnormalized correlation’ prior, too. For example, minimizing the orthogonality term in Eq. (7) is equivalent to the application of the prior −n/2 exp − ln p(y|μu|y , θ, H1 ) + ln p(Y |θ, H1 ) . (24) p(θ|H1 ) ∝ θ0 If one applies ln p(Y |μu|y , θ, H1 ) instead of ln p(y|μu|y , θ, H1 ), the prior −n/2
p(θ|H1 ) becomes a uniform improper prior because the term θ0 can then be conveniently introduced to the exponent as the Gaussian entropy. The exponent disappears on the basis of the identity h(A) = h(B) + h(B|A). The prior −n/2 is known as the Jeffreys’ prior for the multinomial density with p(θ|H1 ) ∝ θ0 the parameter θ0 . Thus, Bayesian inference simplifies the likelihood inference.
How Marginal Likelihood Inference Unifies Entropy, Correlation and SNR
4
819
Experiment
A synthetic problem is indicated in Fig. 1a, where neither a true signal, whose range is [0, 1], nor its edge structure is visible in the noisy values scattered in [−30, 30]. During the simulation, n, the number of the observations, is one million elements. This setting can be opposed to common experiments with ‘real data’ where the edge structure is easy to detect via ‘eyeballing’. The propagators P are implemented in [3], and we employ the gradient norm s-dependent Perona– ν Malik diffusivity c(s) ≡ 1 − exp − (s/λ) m
as suggested in [8]. The parameters
2
are τ /h = 0.025, m = 8, λ = 200, ν is detected by the software automatically. During the estimation of the gradient, the function is pre-smoothed via averaging over 1000 neighbours. The figures with sharply recovered fronts are not shown as the signals are simple and it suffices to state that at the optimal stopping time is m = 5, the location of the right edge is recovered at x = 0.245 and the left edge is restored at x = 0.749. The following five stopping criteria have been applied: (i) the marginal likelihood given by Eqs. (15), (ii) the entropy h(Y |H2 ) contained therein, (iii) (y − μu|y )T μu|y , (iv) the correlation [8] in Eq. (12), and (v) the mean absolute error between the true signal (a rectangular pulse) and the diffusion outcome. Fig. 1b summarizes the results. All the criteria are normalized by subtracting their minimal values, dividing them by their range and adjusting the sign. The optimal stopping is at m = 5. The maximum likelihood criterion underestimates the stopping time, but its simplifications are helpful indeed. Contrary to the speculations in [8], detecting the steady state of the entropy does not present difficulties.
20
1
y u ¯
15
0.8
10
-Logl. Entr. Orth. Corr. M.A.E.
0.6
5 0
0.4
-5 0.2 -10 -15 0
0.2
0.4
x
0.6
0.8
1
0 100
101 Iterations
102
Fig. 1. (a): A binary pulse with edges at x = 0.25 and x = 0.75 is blurred by retaining the first twenty components of its Fourier decomposition, which yields the result shown as u ¯. A white Gaussian noise with the variance θ0 = 25 is then added to create the observations y. Here only a sample of 2000 noisy observations out of the set of n = 106 elements is visualized. The actual range of the observations is [−30, 30]. (b): Time evolution of the criteria for optimal stopping.
820
5
R. Girdziuˇsas and J. Laaksonen
Conclusion
Consistent statistical inference postulates the joint probability density of any quantity and the estimation of unknowns emerges as the conditioning on what is known. Estimating the density itself via computer simulations and ‘histogramming’ is hard at best. However, when working with the Gaussian probability density, the ‘data-driven’ approach reduces to extending the knowledge of the conditional mean to the level of the joint covariance. This reveals axiomatic principles behind many heuristic model selection criteria. The suggested formalism clearly avoids the problems with an unnecessary image normalization in [10]. Contrary to the work [10], the introduced entropies are consistent with the fact that as the scale becomes coarser, the signal is less random. Simple arguments of positive definiteness determine whether the decrease of the negative entropy is exponential or linear in time. Up to certain scalings, the correlation statistics employed in [8] has been shown to be connected to the maximization of the entropy with an early stopping. It is important to emphasize that the presented Gaussian density construction results in singular concentration matrices. The ‘inverse covariance’ trick circumvents this difficulty, but there may exist some even better ways to solve this problem.
References 1. Barry, R.P., Pace, R.K.: Monte Carlo estimates of the log determinant of large sparse matrices. Lin. Alg. Appl. 289, 41–54 (1999) 2. Carasso, A.S.: Linear and nonlinear image deblurring: A documented study. SIAM J. Numer. Anal. 36(6), 1659–1689 (1999) 3. D’Almeida, F.: Nonlinear diffusion toolbox. MATLAB Central (2003) 4. Fischer, B., Modersitzki, J.: Inverse Problems, Image Analysis, and Medical Imaging. In: Fast Diffusion Registration. AMS Contemporary Mathematics, vol. 313, pp. 117–129 (2002) 5. Gilboa, G., Sochen, N., Zeevi, Y.Y.: Estimation of optimal PDE-based denoising in the SNR sense. IEEE Trans. Im. Proc. 15(8), 2269–2280 (2006) 6. Girdziuˇsas, R., Laaksonen, J.: When is a discrete diffusion a scale-space. In: Int. Conf. Comp. Vis. 7. Marshall, A.W., Olkin, I.: Inequalities: Theory of Majorization and Its Applications. Academic Press, London (1979) 8. Mr´ azek, P., Navara, M.: Selection of optimal stopping time for nonlinear diffusion filtering. Int. Journal of Computer Vision 52(2), 189–203 (2003) 9. Perona, P., Malik, J.: Scale–space and edge detection using anisotropic diffusion. IEEE Trans. on PAMI 12(7), 629–639 (1990) 10. Sporring, J., Weickert, J.: Information measures in scale spaces. IEEE Trans. Inf. Theory 45(3), 1051–1058 (1999) 11. Weickert, J., ter Haar Romeny, B.M., Viergever, M.A.: Efficient and reliable schemes for nonlinear diffusion filtering. IEEE Trans. on Image Processing 7(3), 398–410 (1998)
Kernel-Bayesian Framework for Object Tracking Xiaoqin Zhang1, Weiming Hu1 , Guan Luo1 , and Steve Maybank2 1
2
National Laboratory of Pattern Recognition, Institute of Automation, Beijing, China {xqzhang,wmhu,gluo}@nlpr.ia.ac.cn School of Computer Science and Information Systems, Birkbeck College, London, UK
[email protected]
Abstract. This paper proposes a general Kernel-Bayesian framework for object tracking. In this framework, the kernel based method—mean shift algorithm is embedded into the Bayesian framework seamlessly to provide a heuristic prior information to the state transition model, aiming at effectively alleviating the heavy computational load and avoiding sample degeneracy suffered by the conventional Bayesian trackers. Moreover, the tracked object is characterized by a spatial-constraint MOG (Mixture of Gaussians) based appearance model, which is shown more discriminative than the traditional MOG based appearance model. Meantime, a novel selective updating technique for the appearance model is developed to accommodate the changes in both appearance and illumination. Experimental results demonstrate that, compared with Bayesian and kernel based tracking frameworks, the proposed algorithm is more efficient and effective.
1 Introduction Object tracking is an important research topic in computer vision community, because it is the foundation of high level visual problems such as motion analysis and behavior understanding. Recent years have witnessed a great advance in the literature, e.g. snakes model [1], condensation [2], mean shift [3], appearance model [4], probabilistic data association filter [5] and so on. Generally speaking, most of the tracking algorithms involve two major issues: the algorithm framework and the target representation model. The framework of the existing tracking algorithms can be roughly divided into two categories: deterministic methods and stochastic methods. Deterministic methods usually reduce to an optimization process, which can be typically tackled by an iterative search for the minimization of a similarity cost function. In more detail, there exists two major types of similarity functions: SSD (Sum of Squared Differences) [6] and kernel [3] based cost functions. The SSD based cost function is defined as the summation of squared differences between the current image patch and the template, while the kernel based cost function is defined as the distance between two kernel densities. The deterministic methods are usually computationally efficient but often trap in local minimal. In contrast, the stochastic methods adopt a state space to model the underlying dynamics of the tracking process, and the object tracking is viewed as a Bayesian inference problem, which needs to generate a number of hypotheses to estimate and propagate the posterior distribution of the state. Compared with the deterministic counterparts, the stochastic methods usually perform more robustly, but meantime they suffer a heavy computational load due to the large Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 821–831, 2007. c Springer-Verlag Berlin Heidelberg 2007
822
X. Zhang et al.
number of hypotheses, especially in a high-dimensional state space which may result in the curse of dimensionality. Recently, some researchers have combined the merits of these two methods to achieve more reliable performances [7,8]. In [7], random hypotheses are guided by a gradient based deterministic search which is carried out based on the sum of difference between two frames. Zhou [8] et al. propose an adaptive state transition model which is extracted from the information contained in the particles configuration. In essence, these methods are based on the constant illumination assumption, which is hard to be satisfied in practice, and moreover, they are far from a general tracking framework to be extended to other representative models. The target representation model is also a basic issue to be considered in tracking algorithms. Image patch [6], which takes the set of pixels in the target region as the model representation, is a direct way to model the target, but it loses the discriminative information that is implicit inside the layout of the target. The color histogram [3] provides global statistic information of the target region which is robust to noise, but it is very sensitive to illumination changes. Recently the MOG (Mixture of Gaussians) [4,8,9] based appearance model has received more and more attentions for its following merits: (1) it can model the multi-modal distribution of the appearance; (2) it is easy to capture the changes of the appearance; (3) it possess low computation and storage resources. However, traditional MOG based appearance model considers each pixel independently and with the same level of confidence, which is not reasonable in practice. In view of the forgoing discussions, we propose a general kernel-Bayesian tracking framework by combining the merits of both deterministic methods and stochastic methods. The main contributions of the proposed tracking approach are summarized as follows: 1. The kernel based method—mean shift algorithm is embedded into the Bayesian framework to give a heuristic prior information to the state transition model, which eases the computational burden and avoids sample degeneracy in the Bayesian tracking framework. 2. The appearance of the target is modeled by a spatial constraint MOG, whose parameters are estimated via an on-line EM algorithm. 3. A novel selective adaptation scheme for updating the appearance model is adopted to reliably capture the changes in appearance and illumination and effectively prevent the model from drifting away. The rest of this paper is structured as follows. A brief review of kernel based and Bayesian based tracking algorithms are presented in Section 2. The detail of KernelBayesian tracking framework is described in Section 3. A spatial constraint MOG based appearance model and its application in the Kernel-Bayesian framework are discussed in Section 4. Experimental results are presented in Section 5, and Section 6 is devoted to conclusion.
2 Review of Kernel Based and Bayesian Based Trackers In this section, we briefly review the two typical tracking algorithms: kernel based and Bayesian based trackers.
Kernel-Bayesian Framework for Object Tracking
823
2.1 Kernel Based Tracker Kernel based tracker tries to find local minima of a similarity measure between the kernel density estimations of the candidate and target images. The most famous kernel based method should be the mean shift algorithm, which firstly appeared in [10] as the gradient estimation of a density function, and was introduced for visual tracking by Comaniciu [3] in 2000. Mean shift is a non-parameter mode seeking technique that shifts each data point to the average of data points in its neighborhood [10]. Let A be a finite set embedded in an n-dimensional space X, the mean shift vector of x is defined as follows, K(a − x)w(a)a − x, a ∈ A, x ∈ X ms = a a K(a − x)w(a)
(1)
where K is a kernel function and w is a weight function. Mean shift algorithm works by iteratively shifting the data to the direction of mean shift vector until its convergence. In the mean shift based tracking algorithm, the convergence property is described by a Bhattacharyya coefficient[3], which reflects the similarity between the target and candidate kernel densities. 2.2 Bayesian Based Tracker Another popular way is to view tracking as an on-line Bayesian inference process for estimating the unknown state st at time t from a sequential observations o1:t perturbed by noises. A dynamic state-space form employed in the Bayesian inference framework is shown as follows [11], state transition model : st = ft (st−1 , t )
(2)
observation model : ot = ht (st , νt )
(3)
where st , ot represent system state and observation, t , νt is the system noise and observation noise, ft (., .) characterizes the kinematics of object, and ht (., .) models the observation. The key idea of Bayesian inference is to approximate the posterior probability distribution by a weighted sample set {(s(n) , π (n) )|n = 1 · · · N }. Each sample consists of an element s(n) which represents the hypothetical state of an object and a (n) corresponding discrete sampling probability π (n) , where N = 1. First, the samn=1 π ple set is resampled to avoid the degeneracy problem, and the new sample is propagated according to the state transition model. Then each element of the set is weighted with probability π (n) = p(ot |St = s(n) t ), which is calculated from the observation model. Finally, the state estimate sˆt can be either be the minimum mean square error (MMSE) estimation or the maximum a posterior (MAP) estimation.
3 Kernel-Bayesain Based Tracking Framework The kernel based methods enjoy a low computational complexity but often trap in local minimal/maxima, while Bayesian based methods improve robustness of the tracking process, but they suffer a large computational load by generating a huge number of hypotheses to cover the target. As a result, we propose a unified Kernel-Bayesain tracking framework to combine the merits of both methods.
824
X. Zhang et al.
3.1 Kernel-Bayesian Framework A state transition model is a basic component to be considered as the Bayesian inference is adopted for tracking. Most of the existing approaches take the naively random walk around previous system state [12] or learn through a pre-labeled video sequences [13]. The former one contains little information about the motion of the target, and thus involves a quite large computational load since many hypotheses need to be randomly generated to cover the target. While the latter one often suffers a over fitting problem, consequently available only to the training sequences. Since the mean shift algorithm provides the motion direction to the groundtruth in its iterations, which motivates us to embed the kernel method into Bayesain framework to provide a heuristic prior. In detail, the mean shift algorithm is firstly applied to the current frame to obtain the direction of motion and the offset of the state, which are then incorporated into the transition model as prior information. In this way, the kernel based method and the Bayesian based method are combined into a unified framework. Furthermore, it is investigated [14] that symmetric kernels are amenable to mean shift iterations, which means that our framework is general to all the symmetric appearance models. 3.2 An Optimization View A reinterpretation of the Kernel-Bayesian framework in an optimization view is presented to show why this framework can combine the merits of both the kernel method and the Bayesian method. To give a clear view, an input image with three templates superimposed, corresponding to the initialization, local maximum and globe maximum is illustrated in the left column of Fig. 1, and its cost function based on our appearance model is shown in the right column of Fig. 1. As witnessed by Fig. 1, starting from the initial position, the kernel method converges to the local maximal point which is near to the global maximal point. It is clear that a few number of hypotheses generated around the local maximum point is enough to cover the the global maximal point. Otherwise, if the tracker starts from the initial position, numerous hypotheses need to be generated in order to reach the target, and the algorithm even may trap into the curse of dimensionality in the case of high-dimensionality. In our proposed framework, the deterministic optimization method is used to refine the initial position and provide a heuristic prior, and the stochastic method is then adopted to reach the globe optimal point.
Initialized position Local maximum Global maximum
Fig. 1. (left) An input image with three templates superimposed, corresponding to the initialization (red), local maximum (green) and globe maximum (blue) and (right) its cost function
Kernel-Bayesian Framework for Object Tracking
825
4 The Proposed Tracking Algorithm An overview of the proposed algorithm is systematically presented in Fig. 2. First a kernel based prior information is obtained through mean shift iterations, which controls both the number of the hypotheses and the directional offset of the state in the state transition model. After hypotheses precess, each hypothesis is evaluated by the spatial constraint MOG based observation model. Finally, a maximum a posterior (MAP) estimate of state is obtained based on the probability of each hypothesis. Meanwhile, a selective updating scheme is developed to update parameters of the appearance model to accommodate the changes of object and environment. Each component in this algorithm is described detailedly in the following sections. 39:=A98 CD98=7F=BA
/A=F=5?=J5F=BA 4F5F9 H97FBD 5F F=@9 F
-BAFDB? 8=D97F=BA .ICBF<9E9E ;9A9D5F=BA
09DA9? 65E98 CD=BD ,77GD57I B: CD98=7F=BA
195EGD9@9AF 9H?G5F=BA
-BAFDB? AG@69D
-<97>+ GC85F9 F<9 @B89? BD ABF 2GCGF F<9 EF5F9
Fig. 2. The flow chart of our Kernel-Bayesian based tracking algorithm
4.1 Spatial Constraint MOG Based Appearance Model The appearance of the target is modeled by a spatial constraint MOG, with the parameters estimated by an on-line EM algorithm. Appearance Model: Similar to[4],[8], the appearance model consists of three components S, W, F , where S component captures temporally stable images, W component characterizes the two-frame variations, and F component is a fixed template of the target to prevent the model from drifting away. However, this appearance model treats each pixel independently and discards the spatial outline of the target. So it may fail in the case that, for instance, there are several similar objects close to the target or partial occlusion. In our work, we apply a 2-D gaussian spatial constraint to the SW F based appearance model, whose mean vector is the coordinate of the center position and the diagonal elements of the covariance matrix are proportional to the size of the target in the corresponding spatial direction, as illustrated in Fig. 3. As a result, the likelihood function of the spatial constraint appearance model can be formulated as follows, p(ot |st ) =
⎧ d ⎨
j=1
⎩
N (x(j); xc , Σc ) ∗
2 πi,t (j)N (ot (j); μi,t (j), σi,t (j))
i=s,w,f
⎫ ⎬ ⎭
(4)
where N (x; μ, σ 2 ) is a Gaussian density (x − μ)2 N (x; μ, σ 2 ) = (2πσ 2 )−1/2 exp − 2σ 2
(5)
826
X. Zhang et al.
300 250 200 150 100 50 0 280
260
240
220
200
180
160
160 150 140 130 120 110 100
Fig. 3. A 2-D gaussian spatial constraint MOG based appearance model
and {πi,t , μi,t , σi,t , i = s, w, f } represent mixture probabilities, mixture centers and mixture variances respectively, d is the number of pixels inside the target, xc and Σc represent the center of the target and its variance matrix in the spatial space. Parameter Estimation: In order to make the model parameters dependent more heavily on the most recent observation, we assume that the previous appearance is exponentially forgotten and new information is gradually added to the appearance model. To avoid having to store all the data from previous frames, an on-line EM algorithm [4] is used to estimate the parameters as follows. Step 1: During the E-step, the ownership probability of each component is computed as 2 mi,t (j) ∝ πi,t (j)N (oi,t (j); μi,t (j), σi,t (j))
(6)
which fulfills i=s,w,f mi,t = 1. Step 2: The mixing probability of each component is estimated as πi,t+1 (j) = αmi,t (j) + (1 − α)πi,t (j); i = s, w, f
(7)
and a recursive form for moments {Mk,t+1 ; k = 1, 2} are evaluated as Mk,t+1 (j) = αokt (j)ms,t (j) + (1 − α)Mk,t (j); k = 1, 2
(8)
−1/τ
where α = 1 − e acts as a forgotten factor and τ is a predefined constant. Step 3: The mixture centers and the variances are estimated in the M-step μs,t+1 (j) =
M1,t+1 (j) 2 M2,t+1 (j) , σs,t+1 = − μ2s,t+1 (j) πs,t+1 (j) πs,t+1 (j)
2 2 μw,t+1 (j) = ot (j), σw,t+1 (j) = σw,1 (j) 2 2 μf,t+1 (j) = μf,1 (j), σf,t+1 (j) = σf,1 (j)
In fact, updating of the appearance model every frame may be dangerous in case that, for instance, some backgrounds are misplaced into the target or the target is occluded. Thus, we developed a selective adaptation scheme to tackle such cases, which is described detailedly in section 4.3. 4.2 Kernel-Bayesian Based Tracker As stated in the section 3, the motivation of embedding the mean shift algorithm into the Bayesian filtering framework is to provide a heuristic prediction to the state transition
Kernel-Bayesian Framework for Object Tracking
827
model, and thus to ease the computational burden and avoids the sample degeneracy problem. Suppose the target is well localized at xt−1 in frame t − 1, we first apply mean shift iterations to the frame t, and the convergent position is considered as the refined initialization denoted as x ˆt . In order to embed the spatial constraint appearance model into the mean shift algorithm, the weighted kernel function is defined as follows. ω(x) = N (x, xc , Σc )
2 πi,t (j)N (ot (x); μi,t (x), σi,t (x))
(9)
i=w,s,f
And the flat kernel is chosen, so the mean shift iteration can be written as d
x =1
x ˆt = id
w(xi )xi
xi =1
w(xi )
, xi ∈ candidate.
(10)
The result obtained from mean shift iterations is then integrated into a fist-order state transition model to form an adaptive state transition model. st = sˆt−1 + Af f ine(ˆ xt − xt−1 ) + t
(11)
Where Af f ine is denoted for the affine transformation. Meanwhile, the accuracy of the refined position is evaluated by our appearance model to adaptively control the number of hypotheses and the system noise t . Finally the Bayesian inference is carried out based on the adaptive state transition model to achieve a robust and efficient tracking algorithm. 4.3 Selective Adaptation for Appearance Model In most tracking applications, the tracker must simultaneously deal with the changes of both the target and the environment. So it is necessary to design a adaption scheme for the appearance model. However, over updating of the model may gradually introduce the noise of background into the target model, causing the model drift away finally. Thus, a proper updating scheme is of significant importance for the tracking system. In this part, we propose a selective updating scheme based on three different confidence measures of the appearance model. First the MAP estimated state is respectively evaluated by the appearance model, the SW combined components, and the F component, denoted as πa , πsw , πf . And {Ta , Tsw , Tf } represent three thresholds correspondingly. Each component of the appearance model is updated selectively as follows. It is investigated that S together with W components effectively capture the variations of the target and F prevents the model from drifting away. As a result, such a selective updating strategy not only effectively captures the variations of the target, but also reliably prevents the drifting away problem during the tracking process.
5 Experimental Results In our experiments, affine transformation is chosen to model the object motion. Specifically, the motion is characterized by s = (tx , ty , a1 , a2 , a3 , a4 ) where {tx , ty } denote the 2-D translation parameters and {a1 , a2 , a3 , a4 } are deformation parameters. Each candidate image is rectified to a 30×15 patch, and thus the feature is a 450-dimensional vector with zero-mean-unit-variance normalization. All of the experiments are realtimely carried out on a dual-CPU Pentium IV 3.2GHz PC with 512M memory.
828
X. Zhang et al. Table 1. Selective Adaptation for the Appearance Model 1: if (πa > Ta ) 2: if (πsw > Tsw )&&(πf > Tf ) 3: Update the appearance model of the target; 4: else if(πsw > Tsw )&&(πf ≤ Tf ) 5: Only update the SW components of the appearance model; 6: else if(πsw ≤ Tsw )&&(πf > Tf ) 7: Only update the F components of the appearance model; 8: else if(πsw ≤ Tsw )&&(πf ≤ Tf ) 9: Keep the appearance model of the target 10: end if 11: end if
(a) Tracking performance in Kernel-Bayesian framework
(b) Tracking performance in traditional Bayesian framework
(c) Tracking performance in traditional kernel based framework Fig. 4. Experimental performance in the different tracking frameworks
5.1 Single Object Tracking In this section, three parts of experiments are presented to demonstrate the claimed contributions of the proposed tracking algorithm. The first part shows experimental performance of our tracking framework, and a comparison to the traditional Bayesian framework and kernel based framework in both tracking accuracy and efficiency. As illustrated in Fig. 4, the first row shows the tracking performance of our algorithm, where the tracker efficiently and effectively catches the target. The second row gives the similar tracking performance in the traditional Bayesian framework with 400 hypotheses. In the third row, it is clear that the kernel method usually traps in local maximal, leading to the inaccurate localization. Furthermore, the accuracy and efficiency of these tracking frameworks are quantitatively evaluated to have a profound analysis. The tracking time with respect to the frame index is shown in the left panel of Fig 5. And the tracking accuracy is also measured by the
Kernel-Bayesian Framework for Object Tracking 10 kernel Bayesian kernel−Bayesian
100 90 80
Time (ms)
70 60 50 40 30 20 10 0
0
50
100
150 Frame Index
200
250
300
MSE between tracked position and groundtruth
110
829
kernel Bayesian kernel−Bayesian
9 8 7 6 5 4 3 2 1 0
0
50
100
150 Frame index
200
250
300
Fig. 5. (left) Tracking time with respect to the frame index, (right) MSE between estimated points and groundtruth (red: kernel, green: Bayesian, blue: kernel-Bayesian)
MSE (mean square error) between tracked position and groundtruth which is shown in the right panel of Fig 5. The results in Fig. 5 shows that the kernel method performs efficiently but has a poor performance in localization. In contrast, Bayesian based tracking algorithm achieves more accurate performance due to large number hypotheses. It possesses 81ms (millisecond) tracking time in each frame and the average MSE is 8.6521. while in Kernel-Bayesian framework, the tracking time taken by each frame is only 55ms on average, which greatly eases the computational burden of the Bayesian framework. The average MSE for our algorithm is only 5.8012, because the kernel method provide a heuristic prior to the state transition model which avoids the sample degeneracy suffered by the Bayesian framework and thus leads to the accurate localization. The comparison between spatial constraint MOG based appearance model with traditional MOG based appearance model is presented in the second part. It is clear that the SMOG based appearance model gives a good solution to handle the case where there are some similar objects around the target, while the traditional MOG based appearance model fails, as shown in Fig. 6. The mechanism behind it is that the former one extracts the spatial layout of the target, which makes the model more discriminative. The last part tests the proposed algorithm in the variational scenes. In Fig. 7(a), it is clear that the selective updating scheme easily absorbs the the illumination changes. Fig. 7(b) shows the result of our algorithm to track a girl’s head with an out-plane rotation, from which we notice that the scheme also effectively captures the variations of appearance.
(a) Tracking with spatial constraint SWF model
(b) Tracking with traditional SWF model Fig. 6. Experimental results with different appearance models in the clutter scene
830
X. Zhang et al.
(a) Scene with large illumination changes
(b) Object with out-plane rotation Fig. 7. Experimental results in different scenes (illumination change, out plane rotation)
Fig. 8. Results of multiple objects tracking with the proposed tracking algorithm
5.2 Multiple Object Tracking Although the major tracking task in the experiments above performs with single object, our algorithm can be easily extended to multiple object tracking. As shown in Fig. 8, three objects are initialized manually, and can be tracked well in the following sequences including some partial occlusion cases, because the spatial constraint of appearance is employed to make the model less dependent on the peripheral pixels and the selective updating scheme effectively prevents introducing the noise into the appearance model. Due to its computational efficiency of our algorithm, it performs better and has more potential to handle various problems in the multiple object tracking than other tracking methods.
6 Conclusion This paper has proposed a robust and efficient Kernel-Bayesian framework for visual tracking. In this framework, the object to be tracked is characterized by a spatial constraint MOG based appearance model, which is proved more discriminative than the traditional MOG based appearance model. Our proposed tracking framework combines the merits of both the stochastic and deterministic tracking approaches in a unified framework: the mean shift algorithm is embedded into the Bayesian framework seamlessly to give a heuristic prediction to the state transition model, aiming at effectively alleviating the great computational load and avoiding sample degeneracy suffered by the conventional Bayesian trackers. Moreover, a selective updating scheme is developed to effectively accommodate the changes in both appearance and illumination. Experimental results have demonstrated the efficiency and effectiveness of the proposed tracking algorithm.
Kernel-Bayesian Framework for Object Tracking
831
Acknowledgment This work is partly supported by NSFC (Grant No. 60520120099 and 60672040) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision 1(4), 321–332 (1988) 2. Isard, M., Blake, A.: Condensation: conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 234–240 (2003) 4. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1296–1311 (2003) 5. Rasmussen, C., Hager, G.D.: Probabilistic Data Association Methods for Tracking Complex Visual Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 560– 576 (2001) 6. Hager, G.D., Hager, P.N.: Efficient region tracking with parametric models of geometry andillumination. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(10), 1025–1039 (1998) 7. Sullivan, J., Rittscher, J.: Guiding random particles by deterministic search. In: International Conference on Computer Vision, pp. 323–330 (2001) 8. Zhou, S., Chellappa, R., Moghaddam, B.: Visual Tracking and Recongnition Using Appearance-adaptive Models in Particles Filters. IEEE Transaction on Image Processing 13(11), 1491–1506 (2004) 9. McKenna, S., Jabri, S., Gong, S.: Tracking Colour Objects Using Adaptive Mixture Models. Image and Vision Computing 17(3-4), 225–231 (1999) 10. Fukunaga, K., Hostetler, L.: The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21(1), 32– 40 (1975) 11. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particles filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 12. Yang, C.J., Duraiswami, R., Davis, L.: Fast multiple object tracking via a hierarchical particle filter. In: Proceeding of International Conference on Computer Vision, pp. 212–219 (2005) 13. Nummiaro, K., Koller-Meierand, E., Van Gool, L.: An Adaptive Color-Based Particle Filter. Image and Vision Computing 21(1), 99–110 (2003) 14. Parameswaran, V., Ramesh, V., Zoghlami, I.: Tunable Kernels for Tracking. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 2179–2186 (2006)
Markov Random Field Modeled Level Sets Method for Object Tracking with Moving Cameras Xue Zhou, Weiming Hu, Ying Chen, and Wei Hu National Laboratory of Pattern Recognition, Institute of Automation, Beijing, China
Abstract. Object tracking using active contours has attracted increasing interest in recent years due to acquisition of e«ective shape descriptions. In this paper, an object tracking method based on level sets using moving cameras is proposed. We develop an automatic contour initialization method based on optical flow detection. A Markov Random Field (MRF)-like model measuring the correlations between neighboring pixels is added to improve the general region-based level sets speed model. The experimental results on several real video sequences show that our method successfully tracks objects despite object scale changes, motion blur, background disturbance, and gets smoother and more accurate results than the current region-based method.
1 Introduction Object tracking is an active research topic in computer vision community, because it is the foundation of high level visual problems such as motion analysis and behavior understanding. Current object tracking methods generally use predefined coarse shape models (rectangle or ellipse) to track objects [5,11]. Due to inflexibility in dealing with scale changes, these methods have diÆculty in accurately tracking non-rigid objects especially with moving cameras. In order to overcome this disadvantage, the methods based on active contours have been proposed, which provide detailed shape information for the rigid or non-rigid objects. Level set is an implicit representation of active contours [3]. Due to being numerically stable and capable of handling topological changes, the level set method is getting more and more popular, compared with explicit representation modes characterized by parameterized contours. Active contour-based tracking can be viewed as an iterative process of evolving the initial contour to the desired object boundary based on minimizing an energy function. This energy function often consists of three terms corresponding to internal energy, external energy and shape energy respectively. The first term concerns the internal constraints such as the evolution force based on curvature, the second term concerns the image attachment which has no correlation with contour itself and the last term reflects the shape prior constraints constructed by some statistical learning methods [18,19]. In this paper, we mainly consider the first two terms. In terms of dierent measurements used in the external energy, active contour-based methods are classified into two categories: edge-based ones and region-based ones. Snake [1] is a typical edge-based Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 832–842, 2007. c Springer-Verlag Berlin Heidelberg 2007
MRF Modeled Level Sets Method for Object Tracking with Moving Cameras
833
active contour model considering the gradient of image near the boundary of the object. An improved geodesic model which considers the intrinsic geometric measures of image compared with snake is proposed by Caselles et al. [12]. Edge-based methods are subject to a number of problems: (1) it only considers the local information around the contour and initialization near the object is necessary; (2) it is sensitive to image noises. Consequently, an eective alternative is region-based methods which consider the global image information. The measurements of an image can be some statistical quantities, such as mean, variance, texture or the histogram of the region concerned. Zhu and Yuille [15] present a statistical and variational framework for image segmentation using a region competition algorithm. Recently, Yilmaz et al. [16] adopt the features of both object and background regions in level sets speed model. Current region-based methods usually establish the energy function in a Bayesian framework based on the segmentation idea which means to divide an image into object and background regions. They assume the pixels in each region are independent when computing the region likelihood function. This assumption in some sense ignores the correlations between pixels, resulting in that the contour is sensitive to the background disturbance (similar color or texture between object and background) and not smoothed. Another diÆculty in contour-based tracking is the contour initialization. it is still an open problem. Although the general manual method is accurate, it needs human interaction. Background subtraction methods [13] for initializing contours are only effective when using stationary cameras. In [9], the initial contour can be anywhere in the image, however, it is time-consuming to converge to the correct boundary. The method we propose in this paper tries to solve the problems mentioned above. In our method, we adopt the region-based active contours method and represent contour using a level sets mode. Our method has the following features: – We model correlations between neighboring pixels (instead of treating them as independent) using a Markov Random Field (MRF)-like model. The computation of single pixel’s likelihood function not only depends on the pixel itself, but also considers the neighboring pixels. The correlations between neighboring pixels are measured by a penalty term. With this penalty term, the contour can be evolved to the desired object boundary more tightly and smoothly. Furthermore, our method gets rid of the influence of the background disturbance to some extent, comparing with general methods without this penalty term. – An automatic and fast initialization method based on optical flow detection is proposed. Closed initial contours near the boundaries of extracted objects are obtained. The remainder of this paper is organized as follows: Section 2 describes the initialization process which comprises generating initial contour and establishing the prior models. Section 3 introduces the penalty term and the improved level sets speed function. Section 4 shows experimental results. The last section summarizes the paper.
2 Initialization The initialization process of our method consists of two steps: (1) locating initial contour and establishing the level sets function; (2) modeling object and background regions using features such as color, texture, etc.
834
X. Zhou et al.
2.1 Initialization of Closed Contours Previous methods draw a closed contour near an object manually. The methods using the motion detection boundaries acquired by background subtraction as the initial contours of the moving objects are only eective with stationary cameras. With respect to a moving camera, motion detection based on optical flow is very popular [4]. Thus, we use the motion detection boundaries obtained by optical flow as the initial contours. Generally, the optical flow field can be viewed as the motion field. For each pixel, a velocity vector ( ) is defined, where and represent the velocity components in x and y direction respectively. The detailed process of our method for initializing contours is described as follows: 1. Optical flow is computed iteratively using consecutive frames based on the gradient of the image. Then, the velocity vector ( ) of each pixel is obtained. 2. Reducing noises. The velocity vector is set to (0 0) when its magnitude is less than a predefined threshold T , i.e. ( ) (0 0) if (
2
2 T )
(1)
3. Find the most probable moving regions which should have large and coherent motions. A shape model is moved to detect the most probable moving regions using a three-step algorithm. Firstly, a series of inial contour candidates are obtained through changing the positions of the coarse shape model with fixed parameters (e.g. radius of a circle). Assign each candidate a weight according to: weight
x
x x 2
2
(1
)arg(
)
(2)
where x (x y) is the coordinate vector of the pixel which belongs to the internal area of the contour, arg( ) is the variance of phase, and is the parameter (ranging between 0 and 1) used to weight the two terms. Secondly, sort these candidates in terms of their weights in a descending order and choose the top N (N 1) ones as the initial contours. The principle that the weights of the top N initial contours are far more than the others is determined. N denotes the number of detected initial contours. An assumption that the distances between detected initial contours are not too close is made to guarantee non-repetitive initialization. Thirdly, after the optimal positions are obtained, we refine the parameter of each detected initial contour with the position fixed. For each contour the optimal parameter £ should satisfy: £ arg max(weight) (3)
Consequently, the coarse shapes near moving objects are obtained as the initial contours of these objects. Although our initialization method is not accurate as some manual methods, the region-based active contour methods guarantee the precise initialization is not necessary [14] and the coarse initial contour still can be evolved to enclose the object tightly. After the initial contours are obtained, we compute the level sets function (x y t) for each contour. The level sets function is the signed Euclidean distance between the point x(x y) and the contour C(t). In our method, we assume (x y t) is positive when x belongs to the external part of the contour C(t) and negative for the internal part of C(t).
MRF Modeled Level Sets Method for Object Tracking with Moving Cameras
835
2.2 Construction of Prior Models Modeling the object and background region is indispensable to region-based active contour methods. In this paper, we present a hierarchical method for fusing color and texture features using a Gaussian Mixture Model (GMM) which is a variant of Stauer’s method [7]. The first step is to train a GMM model using the color feature, where HSV color space is chosen. The second step is to label each sample based on the trained color GMM model. The label is the jth Gaussian in the mixture. The final step is to model these labeled samples using the texture feature which is computed by a gray level co-occurrence matrix (GLCM) method [6]. We assume that the samples with the same color are often adjacent. Thus, the samples with the same label are modeled as a single Gaussian. As a whole, the estimated probability density function (pdf) at pixel xi in the joint color-texture space can be formulated as: p(xi )
k
j (xci cj cj ) (xti tj tj )
(4)
j 1
where xci and xti are respectively the color feature and texture feature at pixel xi ,
(x j j ) is a Gaussian pdf, is the weight parameter of the GMM model and k is the number of the Gaussian modes. The method for updating GMM parameters is similar to [7] which uses a parameter as the learning rate.
3 Evolving the Contour Our method for evolving contours is motivated by the segmentation idea [15,16]. Its objective is to find the optimal partition operator represented by a contour based on the initial contour in the current frame. The segmentation result in the current frame is used as the initial contour of the object in the next frame. Then, in consecutive images the object is tracked iteratively. The segmentation problem in the current frame can be modeled as a MAP problem. We let the posterior probability of getting a partition (R) of a given image I be represented by P( (R)I). In accordance with the Bayesian formula and considering the item P(I) is a constant, this posterior is proportional to: P( (R)I) P(I (R))P( (R))
(5)
Generally the prior probability P( (R)) is modeled as a smoothness regularization item which depends on the length of the contour [17]. In our method, P( (R)) is omitted, because the penalty term introduced later has the better smoothness eect than the prior probability P( (R)) which only focuses on minimizing curve length of the contour and lacks of interaction with the image. The assumption that the regions of the optimal partition are independent is made. This assumption is reasonable since the aim of the segmentation is to separate out the regions of the image where the properties are dierent. Thus, the following equation is obtained: P(I (R)) P(I (Rin ))P(I (Rout ))
(6)
836
X. Zhou et al.
where Rin and Rout denote respectively the regions inside and outside the contour, P(I (Rin )) and P(I (Rout )) are object and background region likelihood functions respectively. Current methods usually assume the pixels in each region are also independent [15,16], so the above formula can be rewritten as: P(I (R))
P(I(x) (Rin))
x¾Rin
P(I(x) (Rout ))
(7)
x¾Rout
However, the hypothesis of pixels independence in each region is very weak for textured regions or those with repeated patterns where there is a local interaction between the pixels. To avoid the above problem, we rewrite the region likelihood function which takes account of the neighboring relationships. Taking object region likelihood function for example, the background region likelihood function is formulated by analogy. MRF theory depicts the conditional probability only depends on the neighborhood [10]. So the object region likelihood function can be approximated as: P(I (Rin ))
P(xi xi in )
(8)
xi ¾Rin
where in is the parameters of the object GMM model in cin inc tin int , xi is the neighborhood of pixel xi in 2D image lattice. We introduce a penalty term to measure the influence of neighboring pixels on the center pixel. The penalty term encourages nearby pixels to fall into the same region, which is reasonable in most applications. A (2w 1) (2w 1) square neighborhood centered at pixel xi is defined. Before explaining the penalty term in detail, let’s define the label set first. The label of a pixel depends on its membership belonging to object or background. If the pixel’s posterior belonging to object is more than that belonging to the background, the label of that pixel is set 1, otherwise 0. The single pixel’s object likelihood function considering the interaction in the neighborhood is proportional to the product of the pure likelihood function and the penalty term: P(xi xi in ) P(xi in ) exp[ (
max(N1 N0 ) 2 2 ) ] sign N1 N0
(9)
where the pure likelihood function is the general likelihood function which only considers the pixel itself and is computed using (4), the exponent function is the penalty term, N1 and N0 are the numbers of neighboring pixels with label 1 and label 0 respectively, is the parameter controlling how fast the exponent function converges to zero, and sign is a piecewise function:
1 if (L sign 0 if (L 1 if (L
We define L as: L
12)(N1 12)(N1 12)(N1
N0 ) 0 N0 ) 0 N0 ) 0
1 when computing P(x
i xi in ) 0 when computing P(xi xi out )
(10)
(11)
MRF Modeled Level Sets Method for Object Tracking with Moving Cameras
837
The single pixel’s background likelihood function P(xi xi out ) can be formulated similarly just by replacing in with out in Formula (9). The penalty term has the following features: 1. If the label of the center pixel is the same as the labels of most neighboring pixels, the penalty term has an increasing eect on the likelihood function. The increasing extent depends on the dierence between N1 and N0 . The bigger the dierence, the more the likelihood is increased. 2. If the label of the center pixel isn’t identical with the labels of most neighboring pixels, the penalty term has a decreasing eect on the likelihood function. The bigger the dierence between N1 and N0 , the more the likelihood is decreased. 3. If N1 is equal to N0 , the penalty term equals to one and has no influence on the likelihood function. With the penalty term, the posterior partition probability is expressed as: P( (R)I)
P(xi xi in )
xi ¾Rin
P(x j x j out )
(12)
x j ¾Rout
Converting the MAP problem to the energy minimization problem, the energy equation is obtained: E
log P( (R)I)
xi ¾Rin
log P(xi xi in )dxi
x j ¾Rout
log P(x j x j out )dx j
(13) Minimizing the above energy function by solving the correlated Euler-Lagrange equations [16], we obtain the level sets evolution speed model in which a (2l 1) (2l 1) square neighboring subregion around the center pixel is defined resembling the definition of the square neighborhood in the penalty term. The object and the background posterior probabilities which we denote by PRin (I x˜ ) and PRout (I x˜ ) are also calculated in the speed model with the assumption that they have the same prior probabilities: PRin (I x˜ )
P( x˜ x˜ in ) P( x˜ x˜ in ) P( x˜ x˜ out )
(14)
PRout (I x˜ )
P( x˜ x˜ out ) P( x˜ x˜ in ) P( x˜ x˜ out )
(15)
The level sets advection speed model of each pixel is obtained by: F x y
l l
log PRin (I x˜ )Ha (( x˜ t))
i l j l
l l
log PRout (I x˜ )(1
Ha (( x˜ t)))
(16)
i l j l
where x˜ is the neighboring pixels of (x y): x˜ (x i y j) and Ha (( x˜ t)) is a Heaviside function: Ha (( x˜ t))
0 ( x˜ t) 0 1 ( x˜ t) 0
(17)
838
X. Zhou et al.
The contour is evolved to the desired boundary by modifying iteratively with overall speed F in the normal direction:
(Fadv Fcurv ) 0 t
(18)
where Fadv F xy is the external force reflecting the data attachment and Fcurv is the internal force proportional to the curvature . The detailed stable numerical approximation scheme of the above equation is given in [8].
4 Experiments To verify our method, we have performed a number of experiments on various sequences. Furthermore, some comparisons have been implemented. In our experiments, all the videos are captured with a moving camera, a tracked object is represented with a grayed contour (colored in color image). The evolution of level sets is implemented using the fast Narrow Band approach [2]. The contour is initialized based on the optical flow detection results in the first frame. Both object and background features are adopted. T in (1) is set to be 02. The parameter in (9) is set to be 021 for all the sequences. l in (16) is independent of sequences and is fixed to 2. Among the above parameters, the most important one is w which controls the size of the square neighborhood in the penalty term. Obviously, the bigger the w, the more smoother contour we can obtain. When w is zero, the neighborhood only has the pixel itself and the penalty term doesn’t work. This is evident from Fig. 1, where we show the dierent tracking results with dierent w. The experiment is implemented only considering the color feature. From the experimental results, we can find that when w is increasing, the contour is more smoothed at the cost of losing more details. Thus, choosing an appropriate parameter w is a trade-o between accuracy and smoothness of tracking results. In our experiments, parameter w is set to be 6.
Fig. 1. Tracking results with di«erent value of parameter w; from (a) to (f), w is 0, 2, 3, 4, 7 and 10
MRF Modeled Level Sets Method for Object Tracking with Moving Cameras
839
Fig. 2. Tracking results of several real sequences: (a) tracking results of indoor human walking sequence, the frame numbers are, respectively, 1, 16, 36, 70 and 114; (b) tracking results of two moving faces sequence, the frame numbers are, respectively, 1, 82, 144, 180 and 269; (c) tracking results of the football sequence, the frame numbers are, respectively, 1, 48, 75, 102 and 141
4.1 Results on Real Video Sequences We first present tracking results on three real video sequences. In the first experiment, we track a person with a moving camera from Frame 1 to 114. The background is cluttered up with some stus which have the similar color as the object. Both color and texture features are adopted to model object and background regions in this sequence. As shown in Fig. 2 (a), although with the background disturbance, we still track the contour of the walking person accurately where the arms and legs are also successfully tracked. (Please see the supplied video “indoor human tracking.avi”) In the second experiment, we demonstrate the performance of our method on the sequence of two moving faces. The camera zooms and moves as people change their faces’pose continuously. Both color and texture features are considered. Dierent people are labeled with dierent grays (colors in the color image). The tracking results are shown in Fig. 2 (b). We are still able to track the contours of these two faces with high accuracy even though faces’ scale change a lot. (Please see the supplied video “two moving faces tracking.avi”) In the third experiment, we track a fast moving football from Frame 1 to 141. Based on results of optical flow detection, we obtain many dierent initial contours corresponding to dieren moving objects (including players and football). We choose the football as the interested object and only model object and background regions around the football. The color feature is enough to distinguish between foreground and background. As we can see from the tracking results shown in Fig. 2 (c), we keep good track of the football pointed by an arrow when it is moving with a very high speed. In the
840
X. Zhou et al.
Fig. 3. Tracking results for two comparisons: the first and third column are obtained by our method with the penalty term, and the second and fourth column are using Yilmaz’s speed model without considering the penalty term. (a) the first comparison, from top to bottom, the frame numbers are, respectively, 1, 28, 96 and 153; (b) the second comparison, from top to bottom, the frame numbers are, respectively, 1, 24, 55 and 74.
102th frame, there exists severe motion blur, but the football is still tracked robustly. The zoom-in contour is shown on the top right corner of each image. (Please see the supplied video “football tracking.avi”) 4.2 Comparison One of the characteristics of our method is to encode the Markov property into an additional penalty term expressed in our speed model, while Yilmaz’s method [16] doesn’t. Yilmaz’s method is a typical region-based active contours method which doesn’t consider the interaction between pixels when computing the likelihood function. Here we show two comparisons between these two methods, where only the color feature is taken into account. The first comparison is implemented on the Mickey head tracking sequence, in which we make some disturbances to the tracked object artificially. The color of the scrips in background is the same as the color of the Mickey head. The camera zooms and tilts as the object moves. From these comparison results shown in Fig. 3 (a), it’s obvious that the tracking results obtained using our method are more accurate and much smoother than those obtained using Yilmaz’s method. The influence of background disturbance is eliminated to some extent by our method. (Please see the supplied video “comparison 1.avi”)
MRF Modeled Level Sets Method for Object Tracking with Moving Cameras
841
A similar comparison is implemented on the real outdoor human walking sequence captured with a moving camera. The colors of some areas in the background are similar to those of the person’s clothes. The tracking results are shown in Fig. 3 (b) from Frame 1 to 74. The comparison results have demonstrated that the method without the penalty term is more sensitive to background disturbance and our method more accurately tracks the objects when background disturbance occurs. (Please see the supplied video “comparison 2.avi”)
5 Conclusion In this paper, we have proposed a MRF modeled level sets method for object tracking with moving cameras. In our method, the contour is initialized automatically based on optical flow detection. The penalty term reflecting the interaction between pixels is introduced to reduce the influence of background disturbance and smooth the contour. Our method has been tested on several real video sequences. Objects are accurately tracked even only considering the color information. The comparison experiments have demonstrated that our method outperforms the general region-based method which doesn’t consider correlations between neighboring pixels.
Acknowledgments This work is partly supported by NSFC (Grant No. 60520120099 and 60672040) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. In: IJCV, vol. 1, pp. 321–331 (1988). 2. Adalsteinsson, D., Sethian, J.: A fast level set method for propagating interfaces. J. Comput. Phys. 118, 269–277 (1995) 3. Osher, S., Sethian, J.: Fronts propagation with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79, 12–49 (1988) 4. Horn, B., Schunck, B.: Determining optical flow. Artificial intelligence 17, 185–203 (1981) 5. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. IEEE Trans. PAMI 19, 780–785 (1997) 6. Partio, M., Cramariuc, B., Gabbouj, M.: Rock texture retrieval using gray level co-occurrence matrix. In: NSPS, NORSIC 2002,October4-7, 2002 p. 5 (2002) 7. Stau«er, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: CVPR, vol. 2, pp. 246–252 (1999) 8. Sethian, J.A.: Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science. Cambridge University Press, Cambridge (1999) 9. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans.on IP 10, 266–277 (2001) 10. Xu, D.X., Hwang, J.N., Yuan, C.: Segmentation of multi-channel image with markov random field based active contour model. In: JVLSI, vol. 31, pp. 45–55 (2002)
842
X. Zhou et al.
11. Haritaoglu, I., Harwood, D., Davis, L.S.: W 4 : real-time surveillance of people and their activities. IEEE Trans. PAMI 22, 809–830 (2000) 12. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. In: IJCV, vol. 22, pp. 61–79 (1997) 13. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Trans. PAMI 22, 266–280 (2000) 14. Bailloeul, T.: Active contours and prior knowledge for change analysis: application to digital urban building map updating from optical high resolution remote sensing images, Phd Thesis, October (2005) 15. Zhu, S.C., Yuille, A.: Region competition: unifying snakes, region growing and bayes»mdl for multiband image segmentation. IEEE Trans. PAMI 18, 884–900 (1996) 16. Yilmaz, A., Li, X., Shah, M.: Object contour tracking using level sets. In: ACCV (2004) 17. Shi, Y., Karl, W.C.: Real-time tracking using level sets. In: CVPR, vol. 2, pp. 34–41 (2005) 18. Leventon, M., Grimson, E., Faugeras, O.: Statistical shape influence in geodesic active contours. In: CVPR, vol. 1, pp. 316–323 (2000) 19. Cremers, D.: Dynamical statistical shape priors for level set based tracking. IEEE Trans. PAMI 28, 1262–1273 (2006)
Continuously Tracking Objects Across Multiple Widely Separated Cameras Yinghao Cai, Wei Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences P.O.Box 2728, Beijing, 100080, China {yhcai,wchen,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. In this paper, we present a new solution to the problem of multi-camera tracking with non-overlapping fields of view. The identities of moving objects are maintained when they are traveling from one camera to another. Appearance information and spatio-temporal information are explored and combined in a maximum a posteriori (MAP) framework. In computing appearance probability, a two-layered histogram representation is proposed to incorporate spatial information of objects. Diffusion distance is employed to histogram matching to compensate for illumination changes and camera distortions. In deriving spatio-temporal probability, transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. Experimental results demonstrate the effectiveness of the proposed method.
1
Introduction
Nowadays, a distributed network of video sensors is applied to monitor activities over a complex area. Instead of having a high resolution camera with a limited field of view, multiple cameras provide a solution to wide area surveillance by extending the field of view of a single camera. Various types of camera overlap and non-overlap can be employed in multi-camera surveillance systems. Continuously tracking objects across cameras is usually termed as “object handover”. The objective of handover is to maintain the identities of moving objects when they are traveling from one camera to another. More specifically, when an object appears in one camera, we need to determine whether it has previously appeared before in other cameras or it is a new object. In earlier work of handover, either calibrated cameras or overlapping fields of view are required. Subsequent approaches to handover recover the relative positions between cameras by statistical consistency. Statistical information reveals a trend of how people are likely to move between cameras. Possible cues for tracking across cameras include appearance information and spatio-temporal information. Appearance information includes size, color, height of moving object, etc, while spatio-temporal information refers to transition time, velocity, entry zone, exit zone, trajectory, etc. These cues present a constraint on possible transitions between cameras, such as a person leaves the field of view of one camera Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 843–852, 2007. c Springer-Verlag Berlin Heidelberg 2007
844
Y. Cai et al.
at exit zone A will never appear at entry zone B of another camera at the opposite direction of his or her moving direction. Combining appearance information with spatio-temporal information is promising since it does not require a priori calibration and is able to adapt to changes in the cameras’ positions. In this context, tracking objects across cameras is achieved through computing the probability of correspondence according to appearance and spatio-temporal cues. Since cameras are non-overlapping, the appearances of moving objects under multiple non-overlapping cameras may exhibit significant differences due to different illumination conditions, poses and camera parameters. Even under the same scene, the illumination conditions vary over time. As to spatio-temporal information, the transition time from one camera to another differs dramatically from person to person. Some people may wander along the way, while others are rushing against time. In addition, as pointed out in [1], the more dense the observations and the longer the transition time, the more likely the false correspondences. In this paper, we solve these problems under a maximum a posteriori (MAP) framework. The probability of two observations under two cameras generated from the same object is dependent on both appearance probability and spatiotemporal probability. At the off-line training stage, we assume the correspondences between objects are known. The parameters for appearance matching and transition distributions between each pair of entry and exit zone are learned. At the testing stage, correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework. Experimental results demonstrate the effectiveness of the proposed algorithm. In the remainder of this paper, an overview of the related work is in Section 2. In Section 3, experimental setup is described. The MAP framework is presented in Section 4 with appearance probability and spatio-temporal probability described. Experimental results and conclusions are given in Section 5 and Section 6 respectively.
2
Related Work
To compensate color variations under two separated cameras, one solution is by color normalization. Niu et al. [2] employ a comprehensive color normalization algorithm (CCN) to remove image dependency on lighting geometry and illuminant color. This procedure is an iterative process until no change is detected. An alternative solution to the problem is by finding a transformation matrix [3] or a mapping function [4] which map the appearance of one object to its appearance under another view. In [3], the transformation matrix is obtained by solving a linear matrix equation. Javed et al. [4] show that all brightness transfer functions (BTF) from one camera to another lie in a low dimensional subspace. [4] assumes planar surfaces and uniform lighting which are undesirable in real applications. In determining the spatio-temporal relationship between pairs of cameras, Javed et al. [5] employ a non-parametric Parzen window technique to estimate the spatio-temporal pdfs between cameras. In [6], it is assumed that all pairs
Continuously Tracking Objects Across Multiple Widely Separated Cameras
845
of arrival and departure events contribute to the distribution of transition time. Observations of transition time are accumulated into a reappearance period histogram. The peak of the reappearance period histogram indicates the most popular transition time. No appearance information is used in [6]. Furthermore, [2,3] weight the temporally correlating information by appearance information, only those observations which look similar in appearance are used to derive spatiotemporal pdfs. Both [6] and [2,3] assume a single mode transition distribution and are not flexible to deal with multi-modal transition situations. In this paper, a two-layered histogram representation is proposed to incorporate spatial information of objects. This representation provides more descriptive ability than computing the histogram of the whole body directly. Furthermore, instead of modeling color changes between cameras explicitly as a mapping function or a transformation matrix, we propose diffusion distance [7] to histogram matching to compensate for illumination changes and camera distortions. To deal with multi-modal transition situations, we model the spatio-temporal probability between each pair of entry zone and exit zone as a mixture of Gaussians. Correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework.
(a)
(b)
Fig. 1. (a) The layout of the camera system, (b) Three views from three widely separated cameras
3
Experimental Setup
The experimental setup consists of three cameras with non-overlapping fields of view. The cameras are widely separated, including two outdoor settings and one indoor setting. The layout is shown in Figure 1(a). As we can see from Figure 1(b), illumination conditions are quite different. In single camera motion detection and tracking, Gaussian Mixture Model(GMM) and Kalman filter are applied, respectively. Figure 2(a), (b) and (c) show numbers of people in camera C1 , C2 , C3 respectively. The number of people in each view is obtained by single camera tracking.
Y. Cai et al. 40
35
35
30
30
25 20 15 10 5 0 0
30
25
Num of people
40
Num of people
Num of people
846
25 20 15 10
20
30
40
50
60
70
80
90 100 110 120
0 0
15
10
5
5
10
20
10
20
30
40
50
60
70
80
90 100 110 120
0 0
10
20
30
40
50
60
70
Time(min)
Time(min)
Time(min)
(a)
(b)
(c)
80
90 100 110 120
Fig. 2. (a-c) Numbers of people in camera C1 , C2 , and C3 respectively
Dense observations make the handover problem more difficult. However, the proposed method provides a satisfactory result given the difficulties above.
4
Bayesian Framework
Suppose we have m people p1 , p2 , ..., pm under n cameras C1 , C2 , ..., Cn , observations under camera i(j) of moving object pa (pb ) is represented as Oia (Ojb ). Observations of the moving object pa include appearance and spatio-temporal properties which are represented as Oia (app) and Oia (st) respectively. According to Bayesian theory, given two observations Oia and Ojb under two cameras, the probability of these observations generated from the same object is [5]: P (a = b|Oia , Ojb ) =
P (Oia , Ojb |a = b)P (a = b) P (Oia , Ojb )
(1)
where the denominator P (Oia , Ojb ) is the normalization term, P (Oia , Ojb |a = b) depends on both appearance probability and spatio-temporal probability. P (a = b) is a constant term denoting the probability of a transition from camera i to camera j defined as P (a = b) =
N um of transitions f rom Ci to Cj N um of people exiting Ci
(2)
Since the appearance of each object does not depend on its spatio-temporal property, we assume the independence between Oia (app) and Oia (st). So we have P (a = b|Oia , Ojb ) ∝ P (Oia (app), Ojb (app)|a = b) × P (Oia (st), Ojb (st)|a = b) (3) The handover problem is now formalized as: given observation Oia under camera i, we need to find out observations Qai in a time sliding window of Oia under camera j which maximize the posterior probability P (a = b|Oia , Ojb ): h = arg max P (a = b|Oia , Ojb ) ∀Ojb ∈Qa i
(4)
The appearance probabilityP (Oia (app), Ojb (app)|a=b) and the spatio-temporal probability P (Oia (st), Ojb (st)|a = b) are computed in section 4.2 and 4.3, respectively.
Continuously Tracking Objects Across Multiple Widely Separated Cameras
4.1
847
Moving Object Representation
The purpose of moving object representation is to describe appearance of each object so as to be discriminable from other objects. Histogram is a widely used appearance descriptor. The main drawback of histogram-based methods is that they lose spatial information of the color distribution which is essential to discriminate different moving objects. For example, histogram-based methods can not tell a person wearing a white shirt and blue pants from another person who dresses in a blue shirt and white pants.
Fig. 3. A two-layered histogram representation: (a,e) Histogram of the body, (b-d, f-h) Histograms of head, torso and legs respectively
In this paper, we propose a new moving object representation method based on a two-layered histogram. As pedestrians are our primary concern, human body is divided into three subregions: head, torso and bottom in vertical direction similar to the method in [8]. The first layer of the proposed representation corresponds to the color histogram of human body Htotal , while the second layer consists of histograms of head, torso and legs, represented by Hh , Ht , Hl respectively. Histograms are quantized into 30 bins in R, G, B channel separately. It is worthwhile pointing out that coarse quantization discards too much discriminatory information, while fine quantization will result in sparse histogram representations. Our preliminary experiments validate the adequacy of thirty bins in terms of discriminability and accuracy. Figure 3 shows the separated regions and their histogram representations. A two-layered histogram representation captures both global image description and local spatial information. Figure 3 shows that two different people have visually similar Htotal , however, their Ht s are quite different which demonstrate that the proposed two-layered representation provides more discriminability than computing the histogram of the whole body directly. Each layer of representation under one view is matched against its corresponding layer under another view in next subsection.
848
4.2
Y. Cai et al.
Histogram Matching
As we mentioned in Section 1, appearances of moving objects under multiple nonoverlapping cameras exhibit significant differences due to different illumination conditions, poses and camera parameters. To compute appearance probability given observations under two cameras, we first obtain a two-layered histogram representation in Section 4.1. Histogram representation provides robustness to pose changes to some degree. In this section, we propose diffusion distance to histogram matching to compensate for illumination changes and camera distortions. Diffusion distance is first proposed by Ling et al. [7]. This approach models the difference between two histograms as a temperature field. Firstly, an initial distance between two histograms is defined. The diffusion process on this temperature field diffuses the difference between two histograms by a Gaussian kernel. When time increases, the difference between these two histograms will approximate zero. Therefore, the distance between two histograms can be defined as the sum of dissimilarities over its process [7]: K(hist1, hist2) =
N
k(|di (x)|)
(5)
i=0
where d0 (x) = hist1(x) − hist2(x) di (x) = [di−1 (x) ∗ φ(x, σ)] ↓2 i = 1, ...N
(6) (7)
“↓2 ” denotes half size downsampling. σ is the standard deviation of the Gaussian filter which can be learned from the training phase. k(|.|) is chosen as the L1 norm. Subsequent distance di (x) is defined as half size downsampling of its
time = 1 time = 2 time = 3 time = 4 time = 5
(a)
time = 1 time = 2 time = 3 time = 4 time = 5
(b)
Fig. 4. Diffusion distance plotted on the same figure. (a) Diffusion process for the difference of histograms of the same person under two views, (b) Diffusion process for different people.
Continuously Tracking Objects Across Multiple Widely Separated Cameras
849
former layer. Then, the ground distance between two histograms is defined as the sum of norms over N scales of the pyramid. An intuitive illustration is shown in Figure 4. Figure 4(a) shows the diffusion process for the difference of histograms of the same person under two views, and Figure 4 (b) shows the diffusion process for different people. We can see that (a) decays faster than (b). In our method, we compare Htotal , Hh , Ht , Hl of one object with its corresponding histograms under another view by diffusion distance. The histogram representation is one dimensional since we treat each channel R, G, B separately. Four diffusion distances dtotal , dh , dt and dl are combined by the weighted sum technique. We obtain a Gaussian distribution for distances between the same object under different views at the training stage. Finally, distances are transformed into probabilities to obtain the appearance probability P (Oia (app), Ojb (app)|a = b). A comparison with other histogram distances is shown in Section 5. 4.3
Spatio-temporal Information
To estimate the spatio-temporal relationship between pairs of cameras, at the off-line training stage, we group locations where objects appear(entry zone) and disappear(exit zone) by k-means clustering. Transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. In this paper, we choose K as 3, three gaussian distributions correspond to people walking slowly, at normal speed and walking quickly. The probability of testing transition time x is P (x) =
3
ωi ∗ ηi (x, μi , σi )
(8)
i=1
where ωi is the weight of the ith Gaussian in the mixture. ωi can be interpreted as a prior probability of the random variable generated by the ith Gaussian distribution. μi and σi are the mean value and the standard deviation of the ith Gaussian. η is the Gaussian probability density function (x−μ)2 1 e− 2σ2 η(x, μ, σ) = √ 2πσ
Transition Distribution
0.045
0.035
0.03 0.025 0.02 0.015
0.03 0.025 0.02 0.015
0.01
0.01
0.005
0.005
0 20
30
40
50
60
70
80
90
Original Distribution Mixed Gaussian Single Gaussian
0.04
Probability
Probability
0.035
Transition Distribution
0.045
Original distribution Mixed Gaussian Single Gaussian
0.04
100
(9)
0 20
30
40
50
60
70
Transition Time
Transition Time
(a)
(b)
80
90
100
Fig. 5. (a) Transition distribution from Camera 1 to Camera 2, (b) Transition distribution from Camera 2 to Camera 3
850
Y. Cai et al.
Parameters of the model are estimated by expectation maximization(EM). It should be noted that a single Gaussian distribution can not accurately model the transition time distributions between cameras due to the variability of walking paces. Figure 5 shows the transition distribution and its approximations by a mixture of Gaussian distributions and a single Gaussian distribution.
5
Experimental Results
Experiments are carried out on two outdoor settings and one indoor setting as shown in Figure 1. The off-line training phase lasts 40 minutes, and evaluation of the effectiveness of the algorithm is performed using ground-truthed sequences lasting an hour. At the off-line training stage, locations where people appear and disappear are grouped together as entry zones and exit zones respectively in Figure 6. It takes approximately 40-70 seconds to exit from Camera 1 to Camera 2 and from Camera 2 to Camera 3. Some sample images under the three views are shown in Figure 7. Our first experiment consists of transitions from Camera 1 to Camera 2 with two outdoor settings. Our second experiment is carried out on Camera 2 and Camera 3. Numbers of correspondence pairs in the training stage, transitions and detected tracks in the testing stage are summarized in Table 1.
(a)
(b)
(c)
Fig. 6. (a-c) Entry zones and exit zones for Camera 1, 2 and 3, respectively
Fig. 7. Each column contains the same person under two different views Table 1. Experimental Description Training Stage Testing Stage Correspondence Pairs Transition Nums Detected Tracks Experiment 1 100 107 150 Experiment 2 50 75 100
1
1
0.95
0.95
0.9
0.9
0.85
0.85
Accuracy
Accuracy
Continuously Tracking Objects Across Multiple Widely Separated Cameras
0.8 0.75 0.7
app & st app st
0.65 0.6 0.55
1
2
3
4
Performance of ranked matches
0.8 0.75 0.7
app & st app st
0.65 0.6
5
(a)
851
0.55
1
2
3
4
Performance of ranked matches
5
(b)
Fig. 8. Rank Matching Performance. “app” denotes using appearance information only, “st” means using spatio-temporal information only, “app & st” means both appearance information and spatio-temporal information are employed. (a) Rank Matching Performance of Experiment 1. (b) Rank Matching Performance of Experiment 2.
Camera2__4 Camera1__1
Camera3__11 Camera3__10
Camera1__2 Camera1__2
Camera1__2
(a)
(b)
(c)
Fig. 9. Continuously tracking objects across three non-overlapping views
Fig. 10. Rank 1 rates for diffusion distance, L1 distance and histogram intersection
Figure 8 shows our rank matching performance. Rank i (i = 1...5) performance is the rate that the correct person is in the top i of the handover list. Different people with similar appearances bring uncertainties into the system which can explain the rank one accuracy of 87.8% in Experiment 1 and 76% in Experiment 2. By taking the top three matches into consideration, the performance is improved to 97.5% and 98.6% respectively. People are tracked correctly in Figure 9. As a comparison between diffusion distance, the widely used L1 distance and histogram intersection distance [9], we use the same framework and replace the diffusion distance with L1 and histogram intersection distance. The rank 1 rates for different distances are shown in Figure 10, which demonstrates the superiority of the proposed diffusion distance.
852
6
Y. Cai et al.
Conclusion and Future Work
In this paper, we have presented a new solution to the problem of multi-camera tracking with non-overlapping fields of view. People are tracked correctly across the widely separated cameras by combining appearance and spatio-temporal cues under the MAP framework. Experimental results validate the effectiveness of the proposed algorithm. The proposed method requires an off-line training phase where parameters for appearance matching and transition probabilities are learned. Future work will focus on evaluation of the proposed method on larger datasets.
Acknowledgement This work is partly supported by National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.
References 1. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Computer Vision, 2005. Proceedings. Ninth IEEE International Conference on (2005), pp. 1842–1849. IEEE Computer Society Press, Los Alamitos (2005) 2. Niu, C., Grimson, E.: Recovering non-overlapping network topology using far-field vehicle tracking data. In: ICPR 2006. Pattern Recognition 18th International Conference on (2006), pp. 944–949 (2006) 3. Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) 4. Javed, O., Shafique, K., Shah, M.: Appearance modeling for tracking in multiple nonoverlapping cameras. In: CVPR 2005. Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society, Los Alamitos (2005) 5. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (2003), pp. 952–957. IEEE Computer Society Press, Los Alamitos (2003) 6. Makris, D., Ellis, T., Black, J.: Bridging the gaps between cameras. In: CVPR 2004. Computer Vision and Pattern Recognition, pp. 205–210. IEEE Computer Society Press, Los Alamitos (2004) 7. Ling, H., Okada, K.: Diffusion distance for histogram comparison. In: Computer Vision and Pattern Recognition, pp. 246–253. IEEE Computer Society Press, Los Alamitos (2006) 8. Hu, M., Hu, W., Tan, T.: Tracking people through occlusions. In: ICPR 2004. Pattern Recognition 18th International Conference on (2006), pp. 724–727 (2004) 9. Swain, J., Ballard, M.: Indexing via color histograms, pp. 390–393 (1990)
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues Pankaj Kumar, Michael J. Brooks, and Anthony Dick University of Adelaide School of Computer Science South Australia 5005
[email protected],
[email protected],
[email protected]
Abstract. We consider the problem of reliably tracking multiple objects in video, such as people moving through a shopping mall or airport. In order to mitigate difficulties arising as a result of object occlusions, mergers and changes in appearance, we adopt an integrative approach in which multiple cues are exploited. Object tracking is formulated as a Bayesian parameter estimation problem. The object model used in computing the likelihood function is incrementally updated. Key to the approach is the use of a background subtraction process to deliver foreground segmentations. This enables the object colour model to be constructed using weights derived from a distance transform operating over foreground regions. Results from foreground segmentation are also used to gain improved localisation of the object within a particle filter framework. We demonstrate the effectiveness of the approach by tracking multiple objects through videos obtained from the CAVIAR dataset.
1
Introduction
Reliably tracking multiple objects in video remains a highly challenging and unsolved problem. If, for example, we aim to track several people in an airport or shopping mall, we face difficulties associated with appearance and scale changes as each person moves around. Compounding this are occlusion problems that can arise when people meet or pass by each other. This paper is concerned with improving the reliability of multiple object tracking in surveillance video. Visual tracking of multiple objects is formulated in this work as a parameter estimation problem. Parameters describing the state of the object are estimated using a Bayesian technique where the constraints of Gaussianity and linearity do not apply. In Bayesian estimation, the posterior probability density function (pdf) p(Xt |Z T ) of the state vector Xt given a set of observations Z T obtained from the camera is computed at every step, as new observations become available. Many tracking algorithms with a fixed object model have already been designed [1], [2]. However, trackers with a fixed object model are typically unable to track objects for long because of changes in lighting conditions, pose, scale and view point and also due to camera noise. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 853–863, 2007. c Springer-Verlag Berlin Heidelberg 2007
854
P. Kumar, M.J. Brooks, and A. Dick
One of the ways of improving object tracking has been to update the object model with the observation data. Nummiaro et al. [3] developed an adaptive particle filter tracker, updating the object model by taking a weighted average of the current and a new histogram of the object. Zhou et al. [4] proposed an observation generated by adapting the appearance model, motion model, noise variance and number of particles. Ross et al. [5] proposed an adaptive probabilistic real time tracker that updates the model using an incremental update of a so-called eigenbasis. Another way to improve the tracking of an object in video is to use multiple cues such as colour, texture, motion, shape, etc. Brasnett et al. [6] integrated colour and texture cues in a particle filter framework for tracking an object. Wu and Huang [7] investigated the relationship amongst different modalities for robust visual tracking and identified efficient ways to facilitate tracking with simultaneous use of different modalities. Spengler and Schiele [8] integrated skin colour and intensity change cues using CONDENSATION [2] for tracking multiple human faces. Perez et al. [9] proposed a multiple cue tracker for tracking objects in front of a web cam. They introduced a generic importance sampling mechanism for data fusion and applied it to fuse various subsets of colour, motion, and stereo sound for tele-conferencing and surveillance using fixed cameras. Appearance update is not factored into the approach. Shao et al. [10] improved a multiple cue particle filter by using a motion model comprising background and foreground motion parameters. R. Collins and Y. Liu [11] and B. Han and L. Davis [12] presented methods of online selection of the most discriminative feature for tracking objects. In these methods multiple feature spaces are evaluated and adjusted while tracking, inorder to improve tracking performance. The hypothesis is that the features that best discriminate between object and background are also best for tracking the object. In this paper we utilise multiple cues and object model adaptation to achieve improved robustness and accuracy in tracking. We make use of two object description cues: a colour histogram capturing appearance and spatial dimensions obtained from background-foreground segmentation capturing location and size. Object model adaptation is implemented via an autoregressive update with the region where the mode of the particles of the state vector for an object lies in the current frame.
2
Proposed Scheme
A particle filter is a special case of a Bayesian estimation process (see [13] for a tutorial on particle filters incorporating real-time, nonlinear, non-Gaussian Bayesian tracking). The key idea of a particle filter is to approximate the probability distribution of the state Xt of the object with a set of Ns particles/hypotheses and weights, as per s (1) {Xti , wti }N i=1 . Each particle is a hypothetical state of the object and the weight/belief for each hypothesis is computed using a likelihood function. Particle filter based
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
855
tracking algorithms have four main components, namely: object representation, observation representation, hypotheses generation and hypotheses evaluation. This paper proposes improvements in (a) the object and observation representation using the information obtained from background foreground segmentation, (b) hypotheses evaluation methodology. Background foreground segmentation is quite a developed technology and many real time tracking systems use it for detection of moving objects. There are algorithms to detect foreground moving objects even when the camera is gradually moving [14]. Figure 1 presents a schematic of the approach taken in this paper. The image frame obtained from the video stream is processed by background subtraction using the method presented in [15]. Each foreground blob is measured as a rectangular region, specified by centroid, width and height. A data association and merge split analysis is carried out between the objects and measurements using the method presented in [16]. A distance transform [17], [18] is applied to the foreground segmentation result. Foreground pixel intensity obtained from the distance transform is used to weight the pixel’s contribution when building the object’s histogram model. Our contention is that this gives better object and candidate representation than that obtained using other Kernel functions. The hypothesis of the object’s state is also evaluated using the measurement of an object obtained from foreground segmentation process. The beliefs from the two hypothesis evaluation processes are combined to compute the weights of the particles. The mode of the particles is then evaluated, and the state at the mode of the particles is used to update the object mode in an auto-regressive formulation. Object update is suspended for objects which have undergone a merge. 2.1
Hypothesis Generation
An object state is given by Xt = [xc , yc , W, H]T , where xc , yc are the co-ordinates of the centroid of the object and W, H are the width and height of the object in the image frame. The hypothesis generation process is also known as the prediction step, and is denoted as p(Xt+1 |Xt ). New particles are generated using a proposal function q(.), called an importance density, and the object dynamics. Using the predicted particles and hypotheses evaluation from the observation, the posteriori probability distribution of the object state is computed. We use a random walk for object dynamics for the following reasons: 1. Alternative use of constant velocity or constant acceleration object dynamics increases the dimensionality of the state space, which in turn increases exponentially the number of particles needed to track the object with similar accuracy. 2. In real life situations, especially with humans walking and interacting with other objects of the scene, it is very difficult to know the object dynamics beforehand. Different people will have different dynamics. Human motions and their interactions are relatively unpredictable.
856
P. Kumar, M.J. Brooks, and A. Dick
Hypothesis Generation
Object Dynamics
Mode of the Particles Segmentation Cue Evaluation Integration of Cues Object Measurement Data Association and Split Merge Analysis
Object Model
Colour Cue Evaluation
Distance Transform Weights
Foreground Segmentation
Visual Sensor Data
Fig. 1. This schematic highlights the flow of information in the proposed multi-cue, adaptive object model tracking method
The particles are predicted using the update Xt+1 = Xt + vt
(2)
where vt is independent identically distributed, zero mean Gaussian noise. The importance density is chosen to be the prior q(Xt+1 |Xti , Zt+1 ) = p(Xt+1 |Xti ).
(3)
The result of using this importance density is that, after resampling, the particles of the current instance are used to generate the particles for the next iteration. 2.2
Object Representation
An object is represented by its (previously specified) state Xt and a colour model. The non-parametric representation of the colour histogram of the object is P = p(u) u=1...m where m is the number of bins in the histogram. It has been argued in previous works [3], [1] that not all pixels contribute equally to object or candidate model. Thus, for example, pixels on the boundary of a region are
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
857
typically more prone to errors than the pixels in the interior of the region. A common strategy in overcoming this problem has been to use a kernel function like the Epanechnikov kernel [19] to weight the pixels’ contribution to the histogram. The same kernel function is applied irrespective of the position of the region. Our contention is that blind application of a kernel function can lead to (a) a drift problem when the object model is updated and (b) poor localisation of the object during merge. Small errors can accumulate and ultimately the target model can be completely different from the actual object. Our strategy in building the object and candidate histogram is to weight a pixel’s contribution by taking into account background-foreground segmentation information. To achieve this, the foreground segmentation result is first cleaned up using morphological operations. The Manhattan distance transform [18] [17] is then applied to get the weights of the pixels for their contribution to the object/candidate histogram. In a binary image the distance transform replaces the intensity of each foreground pixel with the distance of that pixel to its nearest background pixel. Thus, centrally located pixels (in the sense of being further from the background) receive greater weight and pixels on the boundary separating foreground-background receive small weights. The distance transform appears to be better suited for this purpose than more traditional kernel functions. Scores p(u) of the bins of the histogram model of the object, P = p(u) u=1...m , are computed using the following equation w(xj ) δ(g(xj ) − u), (4) p(u) = xj ∈F oreground Region
where δ is the Kronecker delta function, g(xj ) assigns a bin in the histogram to the colour at location xj , and w(xj ) is the weight of the pixel at location xj obtained on application of the distance transform to the foreground segmented region. The weights for background pixels are almost zero, which makes it very unlikely that the tracker will shift to background regions of the scene. When two or more objects merge, it is detected using a merge-split algorithm [16], and the updating of the object model is temporarily halted. 2.3
Observation Representation
To estimate the posterior probability of the state of the object, Ns hypotheses of the object are maintained (recall eq. (1)). Each hypothesis gives rise to an observation representation which is used to evaluate the likelihood of it being the tracked object. The histogram for a hypothesised region using the current image frame is Q = q (u) u=1...m , where m is the number of bins in the histogram, is defined to be q (u) = w(xj ) δ(g(xj ) − u), (5) xj ∈F oreground Region
analogously to eq.(4). The observation from foreground segmentation are the centroid, width and height of different foreground blobs in the current frame.
858
P. Kumar, M.J. Brooks, and A. Dick
m Nearest-neighbour data association is used to associate a measurement xm c , yc , m m W , H to an object. In the event there is a merger of objects, then the centroid, for evaluation of a hypothesis, is computed as the weighted mean of the foreground pixels in the region defined by the hypothesis/particle. The weights used are from the distance transform applied to the foreground segmentation result.
2.4
Hypothesis Evaluation
Each hypothesis for the object state is evaluated using colour information and foreground information. A likelihood function [6] is used to compute the weight of each particle, integrating the colour and foreground cues as follows: L(Zt |Xti ) = Lcolour (Zc,t |Xti ) × Lf g (Zf g,t |Xti ),
(6)
where Zc,t is the current frame and Zf g,t is the measurement from current frame, after foreground segmentation and eight-connected analysis, associated with the object. Here, (7) Lcolour (Zc,t |Xti ) = exp (−dc (Pt , Qt )/σZc ), 1 − ρ(P, Q) is the Bhattacharyya distance based on the Bhatdc (P, Q) = m (i) (i) tacharyya coefficient, ρ(P, Q) = i=1 p q and σZc is zero mean Gaussian noise associated with colour observation. The term Lf g (Zf g,t |Xti ) is the likelihood based on the foreground segmentation measurement and is given by Lf g (Zf g,t |Xti ) = exp (−df g (Xti , Xtm )/σzf g ), where df g (Xti , Xtm ) = (1 − exp (−λ) and λ=[
2 i m 2 i m 2 i m 2 (xic − xm c ) + (yc − yc ) + (Wc − Wc ) + (Hc − Hc ) ] Wci × Hci
(8)
(9)
when there is a match for the the object by data association. In the case of 2 i M 2 i i a merge df g (Xti , Xtm ) = (1 − exp (−[((xic − xM c ) + (yc − yc ) )/(Wc × Hc )])) M where xM , y is the weighted centroid of the foreground pixels in the region c c defined by particle Xti . For meaningful balanced integration of cues, the functions for dc and df g should have similar behaviour. To test the distance function behaviour we plotted dc against (1 − ρ(P, Q)) = [0, 1], where ρ = 1 means best match and ρ = 0 means worst match of histograms P and Q. Simultaneously we plotted df g for λ = [0, 2], where value zero means a good match and value two means a bad match. Figure 2 shows that the plots of the two distance functions are very similar. 2.5
Model Update
To handle the appearance change of the object due to variation in illumination, pose, distance from the camera, etc., the object model is updated using the auto-regressive learning process Pt+1 = (1 − α)Pt + αPtest .
(10)
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
859
Fig. 2. Left is the plot of dc against (1 − ρ(P, Q)) on x-axis, for matching colour observation. Right is the plot of df g against λ on x-axis. Both plots are quite similar.
Here Ptest is the histogram of the region defined by the mode of the particles used in tracking the object, and α is the learning rate. The higher the value of α the faster the object model will be updated to the new region. The model update is applied when the likelihood of the current estimate of the state of the object Xtest , with respect to the current measurement Zt , given by L(Zt |Xtest ) = Lcolour (Zc,t |Xtest ) × Lf g (Zf g,t |Xtest )
(11)
is greater than an empirical threshold.
3
Results
Figure 3 shows the tracking result from a video in the CAVIAR data set. The person on the left of the frame undergoes significant scale and illumination change. As the person walks past the shop window the illumination on the person changes and hence there is significant change in appearance. This is evident from the model histogram plots of the object for the different instances of time in Figure 4. This shows the colour model for the person tracked with dashed bounding box in Figure 3. In such a case an ordinary colour tracker will have the problem of large errors in localisation and an adaptive colour tracker is likely to drift to other parts of the scene. The tracker proposed in this paper tracks the object accurately throughout the duration of the video. Figure 5 shows the improvement in localisation of two targets brought about when there is overlap of targets. Figure 5a shows the tracking result just by colour cue. Since the colour of the two targets are different, the mode of the particles precludes converging to positions that include parts of the other object. Under such circumstances if the tracker is adaptive then it is quite possible that it will drift to other parts of the scene than the object of interest. Incorporation of cues from foreground segmentation gives better localisation of the targets in case of overlap as is evident from Figures 5b and 6. Figures 6 and 7 shows some more tracking results. These two sequences are particularly difficult because there are instances of long and complete occlusions of targets by each other. In the former sequence there is a case of occlusion which lasted for 280 frames. In the latter sequence the sizes of the objects are small, there are partial occlusions from background objects and noise level is high. The complete tracking results can be downloaded from http://www.cs.adelaide.edu.au/ ˜vision/projects/accv07-tracking/.
860
P. Kumar, M.J. Brooks, and A. Dick
Fig. 3. These frames show successful tracking of three objects in a video from the CAVIAR data set
Fig. 4. The images show the RGB histogram model of the person on the left, for three different instances as tracking progresses. Because of the change in illumination due to shop windows there is significant change in appearance and hence the object model.
Fig. 5. The left image shows the poor localisation of the object when only the colour cue is used for tracking. The right image shows the improved localisation of the object when both colour and segmentation cues are integrated for tracking.
The proposed approach to tracking is more reliable for tracking objects when there are changes in scale, pose, illumination, and occlusion. In our experiments we have been able to track objects with as few as 20 particles. However, two drawbacks which were observed in the proposed method of tracking: (1) When shadows are detected as foreground then the localisation of the object is less accurate. This can be improved by using shadow removal
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
861
Fig. 6. These frames show successful tracking of objects in spite of almost complete occlusion
Fig. 7. These frames show successful tracking of people in spite of poor illumination, small size, and several occlusions. Left and middle images show tracking of two persons. Right images shows tracking of the three persons present simultaneously in the scene.
methods; (2) During almost complete occlusions there are errors in localisation but correct tracking resumes when objects separate after occlusion. Correct localisation of occluded target during complete occlusion with a single sensor is a very difficult problem. Given the unconstrained environment of real life situations in the CAVIAR dataset. quite good tracking results are obtained by the scheme presented in the paper.
4
Conclusion
An enhanced scheme for tracking multiple objects in video has been proposed and demonstrated. Novel contributions of this work include a new weight function for construction of the object and candidate model. The measurement obtained from foreground segmentation is integrated with a colour cue to achieve better localisation of the object. Sometimes there are errors in segmentation and sometimes the colour cue is not reliable, but integration of the two cues gives a better result. The proposed method improves handling of object models undergoing change, rendering the system less susceptible to the drift problem. Furthermore the tracker can follow an object with as few as 20 particles. The method can be extended to moving cameras by using optical flow, mosaic or epipolar constraint techniques to segment the moving foreground objects.
862
P. Kumar, M.J. Brooks, and A. Dick
References 1. Dorin, C., Visvanathan, R., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 2. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Nummiaro, K., Koller-Meier, E., Gool, L.J.V.: Object tracking with an adaptive color-based particle filter. In: Proceedings of the 24th DAGM Symposium on Pattern Recognition, pp. 353–360. Springer, Heidelberg (2002) 4. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance adaptive models in particle filters. IEEE Transactions on Image Processing 13(11), 1434–1456 (2004) 5. Ross, D., Lim, J., Yang, M.H.: Adaptive probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 470–482. Springer, Heidelberg (2004) 6. Brasnett, P.A., Mihaylova, L., Canagarajah, N., Bull, D.: Particle filtering with multiple cues for object tracking in video sequences. In: Proceedings of SPIE. Image and Video Communications and Processing, vol. 5685, pp. 430–441 (2005) 7. Wu, Y., Huang, T.S.: Robust visual tracking by integrating multiple cues based on co-inference learning. Int. J. Comput. Vision 58(1), 55–71 (2004) 8. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Machine Vision and Applications 14, 50–58 (2003) 9. Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proceedings of the IEEE 92(3), 495–513 (2004) 10. Shao, J., Zhou, S.K., Chellappa, R.: Tracking algorithm using background foreground motion models and multiple cues. In: ICASSP apos 2005. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 233–236 (2005) 11. Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005) 12. Han, B., Davis, L.: Object tracking by adaptive feature extraction. In: ICIP 2004. International Conference on Image Processing, vol. 3, pp. 1501–1504 (2004) 13. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 14. Kang, J., Cohen, I., Medioni, G., Yuan, C.: Detection and tracking of moving objects from a moving platform in presence of strong parallax. In: Proceedings of the Tenth International Conference on Computer Vision, Beijing, China, vol. 1, pp. 10–17 (2005) 15. Kumar, P., Ranganath, S., Huang, W.: Queue based fast background modelling and fast hysteresis thresholding for better foreground segmentation. In: The Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, vol. 2, pp. 743–747 (2003) 16. Kumar, P., Ranganath, S., Sengupta, K., Huang, W.: Cooperative multitarget tracking with efficient split and merge handling. IEEE Transactions on Circuts and Systems for Video Technology 16(12), 1477–1490 (2006)
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
863
17. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice Hall International, Englewood Cliffs (1989) 18. Rosenfeld, A., Pfaltz, J.: Distance functions in digital pictures. Pattern Recognition 1, 33–61 (1968) 19. Dorin, C., Visvanathan, R., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, vol. 2, pp. 142–149 (2000)
Image Assimilation for Motion Estimation of Atmospheric Layers with Shallow-Water Model ´ Nicolas Papadakis1, Patrick H´eas1 , and Etienne M´emin1,2,3 IRISA/INRIA, Campus de Beaulieu 35042 Rennes, France CEFIMAS, Avenida Santa Fe 1145 C1059ABF,Buenos Aires, Argentina Fac. de Ing. de la Univ. Buenos-Aires, Av. Paseo Col´ on 850, C1063ACV Buenos Aires, Argentina 1
2
3
Abstract. The complexity of dynamical laws governing 3D atmospheric flows associated to incomplete and noisy observations makes very difficult the recovery of atmospheric dynamics from satellite images sequences. In this paper, we face the challenging problem of joint estimation of timeconsistent horizontal motion fields and pressure maps at various atmospheric depths. Based on a vertical decomposition of the atmosphere, we propose a dense motion estimator relying on a multi-layer dynamical model. Noisy and incomplete pressure maps obtained from satellite images are reconstructed according to shallow-water model on each cloud layer using a framework derived from data assimilation. While reconstructing dense pressure maps, this variational process estimates timeconsistent horizontal motion fields related to the multi-layer model. The proposed approach is validated on a synthetic example and applied to a real world meteorological satellite image sequence.
1
Introduction
Geophysical motion characterization and image sequence analysis are crucial issues for numerous scientific domains involved in the study of climate change, weather forecasting, climate prediction or biosphere analysis. The use of surface station, balloon, and more recently in-flight aircraft measurements and low resolution satellite images has improved the estimation of wind fields and has been a subsequent step for a better understanding of meteorological phenomena. However, the network’s temporal and spatial resolutions may be insufficient for the analysis of mesoscale dynamics. Recently, in an effort to avoid these limitations, another generation of satellites sensors has been designed, providing image sequences characterized by finer spatial and temporal resolutions. Nevertheless, the analysis of motion remains particularly challenging due to the complexity of atmospheric dynamics at such scales. Tools are needed to exploit this new generation of satellite images and we believe that it is very important that the computer vision community gets involved in such domain as they can potentially bring relevant contributions with respect to the analysis of spatio-temporal data. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 864–874, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Assimilation for Motion Estimation of Atmospheric Layers
865
Nevertheless, in the context of geophysical motion analysis, standard techniques from computer vision, originally designed for bi-dimensional quasi-rigid motions with stable salient features, appear to be not well adapted [1,2]. The design of techniques dedicated to fluid flow has been a step forward, towards the constitution of reliable methods to extract characteristic features of flows [3,4]. However, for geophysical applications, existing fluid-dedicated methods are all limited to frame to frame estimation and do not use the underlying physical laws. Moreover, geophysical flows are quite well described by appropriate physical models. As a consequence in such contexts, physic-based approach can be very powerful for analyzing incomplete and noisy image data, in comparison to standard statistical methods. The inclusion of physical a priori leads to unusual advanced techniques for motion analysis which may be of interest for the computer vision community. This yields to new application domains impacting potentially studies of capital interest for our everybody life, and obviously to the devise of proper efficient techniques. This is thus a research domain with wide perspectives. Our work is a contribution towards this direction. The method proposed in this paper is significantly different from previous works on motion analysis by satellite imagery. Indeed our method estimates physical sound and time consistent motion fields retrieved at different atmospheric levels for the whole image sequence. More precisely, we use a shallowwater formulation of the Navier-Stokes equations to control the motion evolution across the sequence. This is done through a variational approach derived from data assimilation principle which combines the a priori exact dynamic and the pressure difference observations obtained from satellite images.
2 2.1
Data Assimilation Principle Introduction
Data Assimilation is a technique related to optimal control theory which allows estimating over time the state of a system of variables of interest [5,6,7,8]. This method enables a smoothing of the unknown variables according to an initial state of the system, a dynamic law and noisy measurements of the system’s state. Let V be a Hilbert space identified to its dual defined over Ω. The evolution of the state variable X ∈ W(t0 , tf ) = {f |f ∈ L2 (t0 ; tf ; V} is assumed to be described through a (possibly non linear) differential dynamical model M : V → V: ∂t X(x, t) + M(X(x, t)) = 0 X(t0 ) = X0
(1)
where X0 is a control parameter. We then assume that noisy observations Y ∈ O are available, where O is another Hilbert space. These observations may live in a different space (a reduced space for instance) from the state variable. We will nevertheless assume that there exists a differential operator H : V → O, that goes from the variable space to the observation space. A least squares estimation of the control variable regarding the whole sequence of measurements available
´ M´emin N. Papadakis, P. H´eas, and E.
866
within a considered time range [t0 ; tf ] comes to minimize with respect to the control variable X0 ∈ V , a cost function of the following form: J(X0 ) =
1 2
tf
||Y − HX(X0 , t)||2R dt,
(2)
t0
where R is the covariance matrix of the observations Y . A first approach consists in computing the functional gradient through finite differences. Denoting N the dimension of the control parameter X0 , such a computation is impractical for control space of large dimension since it requires N integrations of the evolution model for each required value of the gradient functional. Adjoint models as introduced first in meteorology by Le Dimet and Talagrand in [7] authorize the computation of the gradient functional in a single backward integration of an adjoint variable. The value of this adjoint variable at the initial time provides the value of the gradient at the desired point. This first approach is widely used in environmental sciences for the analysis of geophysical flows [7,8]. 2.2
Differentiated Model
To obtain the adjoint model, the system of equations (1) is firstly differentiated ∂X dX0 : with respect to a small perturbation dX = ∂X 0 ∂t dX(x, t) + ∂X MdX = 0 dX(t0 ) = dX0
(3)
where ∂X M is the tangent linear operator of M defined by its gˆateaux derivative. The gradient of the functional in the direction dX0 must also be computed:
∂J , dX0 ∂X0
= =
tf
t0 tf t0
(Y − HX(X0 ), H
∂X dX0 ∂X0
dt R
(4)
∗
H R(Y − HX(X0 ), dXV dt,
where H∗ is the adjoint operator of H defined by: ∀X ∈ V, Y ∈ O; X, HY V = H∗ X, Y O .
2.3
Adjoint Model
We then introduce the adjoint variable λ ∈ W(t0 , tf ). The first equation of the differentiated model (3) is multiplied by this adjoint variable and integrated in the time interval [t0 ; tf ]:
tf t0
∂t dX(x, t) + ∂X MdX, λV = 0.
After an integration by parts, we have:
tf t0
−∂t λ + ∂X M∗ λ, dX(x, t)V dt = λ(t0 ), dX(t0 )V − λ(tf ), dX(tf )V ,
(5)
Image Assimilation for Motion Estimation of Atmospheric Layers
867
where the adjoint operator ∂X M∗ is defined by: ∀X, Y ∈ V; X, ∂X MY V = ∂X M∗ X, Y V .
To perform the computation of the gradient functional, we assume that λ(tf ) = 0 and define the following adjoint problem: −∂t λ + ∂X M∗ λ = H∗ R(Y − HX(X0 )) λ(tf ) = 0.
2.4
(6)
Functional Gradient
Combining (4), (5) and (6), we finally obtain the gradient functional as: ∂J = λ(t0 ). ∂X0
(7)
Hence, assimilation principle enables to compute the functional gradient with a single backward integration. In the next section, we adapt this process to the control of high dimensional state variables, characterizing the dynamics of layered atmospheric flows.
3 3.1
Application to Atmospheric Layer Motion Estimation Layer Decomposition
The layering of atmospheric flow in the troposphere is valid in the limit of horizontal scales much greater than the vertical scale height, thus roughly for horizontal scales greater than 100 km. In order to make the layering assumption valid in the case of satellite images of kilometer order, low resolution observations relevant of a coarser grid are considered. Thus, one can decompose the 3D space into elements of variable thickness, corresponding to layers. Analysis based on such decomposition presents the main advantage of operating at different atmospheric pressure ranges and avoids the mix of heterogeneous observations. Let us present the 3D space decomposition that we chose for the definition of the K layers. The k-th layer corresponds to the volume lying in between an upper surface sk+1 and a lower surface sk . These surfaces sk are defined by the height of top of clouds belonging to the k-th layer. They are thus defined only in areas where there exists clouds belonging to the k-th layer, and remains undefined elsewhere. The membership of top of clouds to the different layers is determined by cloud classification maps. Such classifications which are based on thresholds of top of cloud pressure, are routinely provided by the EUMETSAT consortium, the European agency which supplies the METEOSAT satellite data, as illustrated on figure 1. 3.2
Sparse Pressure Difference Observations
Top of cloud pressure images are also routinely provided by the EUMETSAT consortium. They are derived from a radiative transfer model using ancillary data
868
´ M´emin N. Papadakis, P. H´eas, and E.
(a)
(b)
(c)
(d)
Fig. 1. Top of cloud classification. Satellite image of the visible channel at 0.8μm (a), visualization (in the same channel) of top of clouds classified by the EUMETSAT consortium : low layer (b), middle layer (c) and high layer (d).
obtained by analysis or short term forecasts. Multi-channel techniques enable the determination of the pressure at the top of semi-transparent clouds [9]. We denote by C k the class corresponding to the k-th layer. Note that the top of cloud pressure image denoted by p is composed of segments of top of cloud pressure functions p(sk+1 ) related to the different layers. That is to say: p = { k p(sk+1 , s); s ∈ C k }. Thus, pressure images of top of clouds are used to constitute sparse pressure maps of the layer upper boundaries p(sk+1 ). As in satellite images, clouds lower boundaries are always occluded, we coarsely approximate the missing pressure observations p(sk ) by an average pressure value pk observed on top of clouds of the layer underneath. Finally, for layer k ∈ [1, K], we define observations hkobs as pressure differences in hecto Pascal (hPa) units: hkobs
3.3
= pk (s) − p if s ∈ C k =0 if s ∈ C¯ k ,
(8)
Shallow-Water Model
In order to provide a dynamical model for the previous pressure difference observations, we use the shallow-water approximation (horizontal motion much greater than vertical motion) derived under the assumption of layer incompressibility (layers are characterized by mean densities ρk which can be approximated according to the layer average pressure [10]). The shallow-water approximation is valid for mesoscale analysis in a layered atmosphere. As friction components can be neglected, the vertical integration of the momentum equation between boundaries sk and sk+1 yields for the k-th layer to the equation [6,11,12]: 1 1 ∂(qk ) 0 −1 φ k + div( k qk ⊗ qk ) + k ∇xy (hk )2 + ghk ∇xy (sk+1 ) + f q =0 1 0 ∂t h 2ρ
(9)
with hk = p(z = sk ) − p(z = sk+1 ), p(z=sk ) 1 vdp, vk = (uk , v k ) = k h p(z=sk+1 )
(10) (11)
Image Assimilation for Motion Estimation of Atmospheric Layers qk = hk vk ,
k
div(
1 k q ⊗ qk ) = hk
∂(h (uk )2 ) ∂x ∂(hk uk v k ) ∂x
+ +
∂(hk uk v k ) ∂y ∂(hk (v k )2 ) ∂y
869 (12)
,
(13)
where f φ represents the coriolis factor depending on latitude φ. By adding the integrated continuity equation to Eq. 9, we obtain independent shallow-water equation systems [12] for layers k ∈ [1, K]: ⎧ ⎨ ⎩
∂hk ∂t k
+ div(qk ))
∂(q ) ∂t
+
div( h1k qk
=0 0 −1 φ k ∇xy (h ) + f q = 0, 1 0
⊗q )+ k
1 2ρk
k 2
(14)
where we have assumed that surfaces sk and sk+1 are locally flat in the vicinity of a pixel. This expression is discretized spatially with non oscillatory schemes [13] and integrated in time with a third order Runge-Kutta scheme. This equation system describes the dynamics of physical quantities expressed in standard units. Thus, some dimension factors appear in the equation when it is discretized on a pixel grid with velocities expressed in pixels per frame and pressure in hecto pascal hP a. As one pixel represents Δx meters and one frame corresponds to Δt seconds, the densities ρk expressed in pascal by square seconds per square meter (P a s2 /m2 ) must be multiplied by 10−2 Δx2 /Δt2 , and coriolis factor f φ expressed per seconds must be multiplied by Δt. By a scale analysis and as also observed in our experiments, for Δt = 900 seconds, the third term of equation 9 has a magnitude similar to other terms if Δx ≥ 25km. This is in agreement with the shallow water assumption. 3.4
Assimilation of Layer Motion and Pressure Differences
We can now define all the components of the assimilation system allowing the recovery of pressure difference observations obtained from section 3.2 through the dynamical model presented in section 3.3. The final system enables the tracking of pressure difference hk and average velocities qk related to the set of k ∈ [1, K] layers. Referring to section 2, we then have X k = [hk , qk ]T . The evolution model M is given by the mesoscale dynamics (14). The only observations available are the pressure difference maps hkobs . For each layer k, the observation operator then reads: H = [1, 0] and the process minimizes:
Jk (hk0 , qk0 ) =
tf
||hkobs − hk (hk0 , qk0 , t)||2Rk dt,
(15)
t0
through a backward integrations of the adjoint model (∂X M)∗ defined by: ⎧ k −∂t λkh (t) + w k · (w k · ∇)λkq − hk div(λkq ) ⎪ ⎪ ⎪ ⎪ ⎨ 0 1 φ k −∂t λkq (t) − (w k · ∇)λkq − (∇λkq )w k − ∇λkh + f λq −1 0 ⎪ ⎪ k ⎪ ⎪ ⎩ λhk (tf ) λq (tf )
= Rk (hkobs (t) − hk (t)), = 0, = 0, = 0. (16)
870
´ M´emin N. Papadakis, P. H´eas, and E.
In this expression, λkh and λkq are the two components of the adjoint variable λk of layer k [11]. More details on the construction of adjoint models can be found in [8]. One can finally define a diagonal covariance matrix Rk using the mask of observation C k :
Rk (s, s) =
= α if s ∈ C k ¯k, = 0 if s ∈ C
(17)
where α is a fixed parameter (set to 0.1 in our applications) defining the observation covariances. However, as observations are sparse, a nine-diagonal covariance matrix is employed to diffuse information in a 3x3 pixel vicinity. As the assimilation process is not insured to reach a global minima, results depend on initialization. Thus, state variables hk0 are initialized with a constant value while initial values for variables qk0 are provided by an optic-flow algorithm dedicated to atmospheric layers [3].
4 4.1
Results Synthetic Experiments
For an exhaustive evaluation, we have relied on image observations generated by short time numerical simulation of atmospheric layer motion according to shallow-water dynamical model (Eq. 14). Realistic initial conditions on layer pressure function and motion have been chosen to derive a synthetic sequence of 10 images. The sequence has then been deteriorated by different noises and by a masking operation to form 4 different data sets. The two first synthetic image sequences named e1 and e2 are thus composed of dense observations of hkobs in hecto-pascal units (hPa) corrupted by Gaussian noises with standard deviation respectively equal to 10 and 20% of the pressure amplitude. A real cloud classification map (used in the next experiment) has been employed to extract regions of data sets e1 and e2 in order to create two noisy and incomplete synthetic sequences e3 and e4 (see figure 3). For initializing the assimilation system, we have not relied on an optic-flow algorithm in this synthetic case. We have used instead known values of variables hk0 and qk0 deteriorated by Gaussian noises. Results of the joint motion-pressure estimation performed by image assimilation are evaluated in table 2. It clearly appears that for noisy observations, the assimilation process induces a significant decrease of the RMSE between real and estimated velocities and pressure. Moreover, this table evaluates and demonstrates the efficiency of the proposed estimator for incomplete and noisy observations for both estimating dense motion fields and reconstructing pressure maps hk . Examples of reconstruction for experiments e2 and e3 are presented in figure 3. 4.2
Real Meteorological Image Sequence
We then turned to qualitative comparisons on a real meteorological image sequence. The benchmark data was composed by a sequence of 10 METEOSAT
Image Assimilation for Motion Estimation of Atmospheric Layers
e1 e2 e3 e4
Mask Noise % 10 20 x 10 x 20
871
hkobs RMSE final hk RMSE initial |v0k | RMSE final |v0k | RMSE (hPa) (hPa) (pixel/frame) (pixel/frame) 15.813880 5.904791 0.22863 0.03457 22.361642 8.133384 0.21954 0.05078 15.627055 6.979769 0.22351 0.04978 22.798671 10.930078 0.21574 0.05944
Fig. 2. Numerical evaluation. Decrease of the Root Mean Square Error (RMSE) of estimates hk and |v0k | by image assimilation for noisy (experiments e1 , e2 , e3 and e4 ) and sparse observations (experiments e3 and e4 ).
e2
e3 (a) Actual maps
(b) Noised (and masked) maps
(c) Estimated maps
Fig. 3. Synthetic sequences: Results of experimentations e2 and e3 , where the pressure maps have been noised (e2 and e3 ) and masked (e3 )
Second Generation (MSG) images, showing top of cloud pressures with a corresponding cloud classification sequence. The 1024 × 1024 pixel images cover an area over the north Atlantic Ocean during part of one day (5-June-2004), at a rate of one image every 15 minutes. The spatial resolution is 3 kilometers at the center of the whole Earth image disk. Clouds from a cloud-classification were used to segment images into K = 3 broad layers, at low, intermediate and high altitude. In order to make the layering assumption valid, low resolution observations on an image grid of 128 × 128 pixels are obtained by smoothing and sub-sampling for each layer the original data. By applying the methodology described in section 3.4 to the image at this coarser resolution, average motion and pressure difference maps are estimated from the image sequence for these 3 layers. Estimated vector fields superimposed on observed pressure difference maps are displayed in figure 4 for each of the 3 layers. The motion fields estimated for the different layers on the cloudy
872
´ M´emin N. Papadakis, P. H´eas, and E.
Low layer
Middle layer
High layer Fig. 4. First (left) and last (right) estimated horizontal wind fields superimposed on observed pressure difference maps (original images have been subsampled into images of 128 × 128 pixels)
Image Assimilation for Motion Estimation of Atmospheric Layers
873
observable parts are consistent with the visual inspection of the sequence. In particular, several motion differences between layers are very relevant. For instance, near the bottom left corner of the images, the lower layer possesses a southward motion while the intermediate layer moves northward. Moreover, the temporal coherence of the retrieved motion demonstrates the efficiency of this spatio-temporal method under physical constraints.
5
Conclusion
In this paper, we have presented a new method for estimating time-consistent horizontal winds in a stratified atmosphere from satellite image sequences of top of cloud pressure. The proposed estimator applies on a set of sparse image observations related to a multi-layer atmosphere, which verify independent shallow-water models. In order to manage the incomplete and noisy observations while considering this non-linear physical model, a variational assimilation scheme is proposed. This process estimates time-consistent motion fields related to the layer components while performing the reconstruction of dense pressure difference maps. The merit of the joint motion-pressure estimator by image assimilation is demonstrated on both synthetic images and real satellite images. In view of the various meteorological studies relying on the analysis of experimental data of atmospheric dynamics, we believe that the proposed multi-layer horizontal wind field estimation technique constitutes a valuable tool.
Acknowledgments This work was supported by the European Community through the IST FET Open FLUID Project (http://fluid.irisa.fr).
References 1. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 2. Leese, J., Novack, C., Clark, B.: An automated technique for obtained cloud motion from geosynchronous satellite data using cross correlation. Journal of applied meteorology 10, 118–132 (1971) 3. H´eas, P., M´emin, E., Papadakis, N., Szantai, A.: Layered estimation of atmospheric mesoscale dynamics from satellite imagery. IEEE Trans. Geoscience and Remote Sensing (2007) 4. Zhou, L., Kambhamettu, C., Goldgof, D.: Fluid structure and motion analysis from multi-spectrum 2D cloud images sequences. In: Proc. Conf. Comp. Vision Pattern Rec. Hilton Head Island, USA, vol. 2, pp. 744–751 (2000) 5. Bennet, A.: Inverse Methods in Physical Oceanography. Cambridge University Press, Cambridge (1992) 6. Courtier, P., Talagrand, O.: Variational assimilation of meteorological observations with the direct and adjoint shallow-water equations. Tellus 42, 531–549 (1990)
874
´ M´emin N. Papadakis, P. H´eas, and E.
7. Le Dimet, F.X., Talagrand, O.: Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects. Tellus, 97–110 (1986) 8. Talagrand, O., Courtier, P.: Variational assimilation of meteorological observations with the adjoint vorticity equation. I: Theory. J. of Roy. Meteo. soc. 113, 1311–1328 (1987) 9. Schmetz, J., Holmlund, K., Hoffman, J., Strauss, B., Mason, B., Gaertner, V., Koch, A., Berg, L.V.D.: Operational cloud-motion winds from meteosat infrared images. Journal of Applied Meteorology 32(7), 1206–1225 (1993) 10. Holton, J.: An introduction to dynamic meteorology. Academic Press, London (1992) 11. Honnorat, M., Le Dimet, F.X., Monnier, J.: On a river hydraulics model and Lagrangian data assimilation. In: ADMOS 2005. International Conference on Adaptive Modeling and Simulation, Barcelona (2005) 12. de Saint-Venant, A.: Th´eorie du mouvement non-permanent des eaux, avec application aux crues des rivi`eres et l’introduction des mar´ees dans leur lit. C. R. Acad. Sc. Paris 73, 147–154 (1871) 13. Xu, Z., Shu, C.W.: Anti-diffusive finite difference weno methods for shallow water with transport of pollutant. Journal of Computational Mathematics 24, 239–251 (2006)
Probability Hypothesis Density Approach for Multi-camera Multi-object Tracking Nam Trung Pham1,2 , Weimin Huang1 , and S.H. Ong2 Institute for Infocomm Research, Singapore Department of Electrical and Computer Engineering, National University of Singapore 1
2
Abstract. Object tracking with multiple cameras is more efficient than tracking with one camera. In this paper, we propose a multiple-camera multiple-object tracking system that can track 3D object locations even when objects are occluded at cameras. Our system tracks objects and fuses data from multiple cameras by using the probability hypothesis density filter. This method avoids data association between observations and states of objects, and tracks multiple objects in single-object state space. Hence, it has lower computation than methods using joint state space. Moreover, our system can track varying number of objects. The results demonstrate that our method has a high reliability when tracking 3D locations of objects.
1
Introduction
Tracking moving objects is an important part of many applications. Some people proposed methods to track objects by using one camera [1]. However, when persons might be occluded by other persons in the scene, using one camera to track these persons is difficult. This is because information of these persons from one camera is not enough to solve the occlusion problem. An idea to solve this problem is to use multiple cameras to recover information that might be missing from a particular camera. Furthermore, multiple cameras can be used to recover the 3D information of objects. There are some approaches for tracking with multiple cameras. Most of them have two stages. They are single-view stage and multiple-view data fusion stage. In the single-view stage, they extract observations, estimations. Then in the second stage, these data are fused to obtain the final results. Some methods are proposed to track one object using multiple cameras [2], [3]. These methods track an object and switch to another camera when the system predicts that the current camera no longer has a good view of the object. However, these methods need to consider data association when extending from tracking one object to multiple objects. Some other methods can track multiple objects [4], [5], [6], [7]. Among them, methods match objects between different camera views [4], [5] or incorporate classification methods [6] to do the data association between observations and objects in multiple views. These methods can collaborate multiple cameras for multiple-object tracking. However, when the appearances of objects Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 875–884, 2007. c Springer-Verlag Berlin Heidelberg 2007
876
N.T. Pham, W. Huang, and S.H. Ong
are similar or occlusions occur, these methods might not be suitable. This is because some wrong matches may occur. The other idea is to find 3D observations that correspond with observations from different views [7]. However, the association of observations from different views can increase computational cost in 3D observation searching. Recently, there has been increasing research interest on using random set theory to solve multiple-object tracking. Here, the states of objects and measurements are represented as random finite sets (RFS). Mahler [8] presented a probability hypothesis density (PHD) filter that operates on a single-object state space. Vo [9], [10] proposed implementations of the PHD filter. Especially, the implementation in [10] is a closed-form of the PHD filter. It is called Gaussian mixture probability hypothesis density (GMPHD) filter. In this paper, we extend the GMPHD filter from single sensor to multiple sensors to track several people using multiple cameras in a room. It is assumed that we have projection matrices from 3D space to cameras. Our method can recover the 3D object locations and handle the occlusion at each camera. We assume that color models are available. Then, the proposed tracking method can be efficiently applied to track a varying number of objects. Further, because the fusion stage of multiple cameras to obtain 3D object locations is based on the GMPHD filter, it reduces the amount of computation compared with other methods such as search based method or the particle filter.
2
PHD Filter Approach
In multiple-object tracking, it is difficult to obtain the posterior density function when the number of objects increases. Fortunately, this density function can be approximately recovered from a probability hypothesis density (PHD) [8]. To obtain the PHD at each time step, the PHD filter [8] can be applied. Now, we review an implementation of the PHD filter. It is the GMPHD filter [10]. The GMPHD filter is a closed-form of the PHD filter with assumptions on linear Gaussian system. These assumptions are as follows. Each object follows a linear Gaussian model, i.e., fk|k−1 (x|ζ) = N (x; Fk−1 ζ, Qk−1 ),
(1)
gk (z|x) = N (z; Hk x, Rk ),
(2)
where N (.; m, P ) denotes a Gaussian density with mean m and covariance P , Fk−1 is the state transition matrix, Qk−1 is the process noise covariance, Hk is the observation matrix, and Rk is the observation noise covariance. The survival and detection probabilities are pS,k and pD,k , respectively. The intensity of the spontaneous birth RFS is
Jγ,k
γk (x) =
i=1
(i)
(i)
(i)
wγ,k N (x; wγ,k , Pγ,k )
(3)
Probability Hypothesis Density Approach
877
where Jγ,k is the number of birth Gaussian components. It is assumed that the posterior intensity at time k − 1 is a Gaussian mixture of the form
Jk−1
vk−1 (x) =
(i)
(i)
(i)
wk−1 N (x; wk−1 , Pk−1 )
(4)
i=1
where Jk−1 is the number of Gaussian components of vk−1 (x). Under these assumptions, the predicted intensity to time k is given by vk|k−1 (x) = vS,k|k−1 (x) + γk (x)
(5)
where
Jk−1
vS,k|k−1 (x) = pS,k
(j)
(j)
(j)
wk−1 N (x; mS,k|k−1 , PS,k|k−1 ),
j=1 (j) mS,k|k−1
(j)
= Fk−1 mk−1 ,
(j)
(j)
T PS,k|k−1 = Qk−1 + Fk−1 Pk−1 Fk−1 .
Because vS,k|k−1 (x) and γk (x) are Gaussian mixtures, vk|k−1 (x) can be expressed as a Gaussian mixture of the form Jk|k−1
vk|k−1 (x) =
(i)
(i)
(i)
wk|k−1 N (x; mk|k−1 , Pk|k−1 )
(6)
i=1
Then, the posterior intensity at time k is also a Gaussian mixture, and is given by vD,k (x; z) (7) vk (x) = (1 − pD,k )vk|k−1 (x) + z∈Zk
where Jk|k−1
vD,k (x; z) =
(j)
(j)
(j)
wk (z)N (x; mk|k , Pk|k ),
j=1 (j)
(j) wk (z)
(j)
pD,k wk|k−1 qk (z) = , Jk|k−1 (l) (l) κk (z) + pD,k l=1 wk|k−1 qk (z)
(j)
(j)
(j)
qk (z) = N (z; Hk mk|k−1 , Rk + Hk Pk|k−1 HkT ), (j)
(j)
(j)
(j)
mk|k (z) = mk|k−1 + Kk (z − Hk mk|k−1 ), (j)
(j)
(j)
Pk|k = [I − Kk Hk ]Pk|k−1 , Kk = Pk|k−1 HkT (Hk Pk|k−1 HkT + Rk )−1 . (j)
(j)
(j)
878
3
N.T. Pham, W. Huang, and S.H. Ong
System Overview
We propose a method to track 3D locations of heads of people using multiple cameras with assumptions that the cameras are calibrated and the field of views of cameras overlap. The proposed method, as shown in Fig. 1, consists of two major components: single-view tracking and multiple-camera fusion. In the first component, at each camera at time k, we find color observations and then use the i i , ..., ym,k } GMPHD filter to estimate the 2D locations of objects. Let Yki = {y1,k be the set of 2D estimations of objects at time k, view i. We have n single views, so the set of 2D estimations of objects at time k can be defined by (8) Yk = Yk1 ; Yk2 ; . . . ; Ykn More details on the first step will be shown in Section 5. In the second component, we consider the set of 2D estimations of objects Yk as observations for a data fusion step to estimate the 3D locations of objects by the GMPHD filter. This method can avoid the data association between observations and states of objects. More details of the second step will be shown in Section 6.
Fig. 1. The sketch of our system for multiple object tracking using multiple cameras
4
Color Likelihood
The state of single object in each camera view is described by x = {xc , yc , Hx , Hy }. This is a rectangle with center and size defined by {xc , yc } and {Hx , Hy }, respectively. Let the color histogram of object be denoted as p(u), the color histogram of template as q(u). The similarity function between an object and a template is measured by the Bhattacharyya distance [11]. p(u)q(u)du (9) D = 1−
Probability Hypothesis Density Approach
879
In multiple-object tracking, we can have many color models of templates, and let these models be as {q1 (u), q2 (u), ..., qn (u)}. The similarity function between an object and templates is modified by
D = min 1− p(u)qi (u)du (10) i
The color likelihood function is defined as in [1]
D2 1 exp − 2 lz (x) = N (D; 0, σ 2 ) = √ 2σ 2πσ
(11)
where z is the current image, x is the state of object and σ 2 is the variance of noise.
5
Single-View Tracking
At each single view, we assume that the object state does not change much between frames and each object in multiple-object tracking is evolved from a dynamic moving equation (12) xk = xk−1 + wk where the state of an object in a single view xk = {xc , yc , Hx , Hy }, and wk is the process noise. Single-view tracking consists of two parts: obtaining the color measurement random set, and using these color measurements to obtain the PHD. Now, we consider the ith camera. Let vki (x) be the PHD of the ith camera at time k and i (x) be the predicted PHD of the ith camera at time k. From [12], we have vk|k−1 i vki (x) ∝ v˜ki (x) = lz (x)vk|k−1 (x)
(13)
where lz (x) is the color likelihood that is defined in Section 4. Hence, peaks of v˜ki (x) are also peaks of vki (x). We apply the method in [12] to collect peaks in v˜ki (x). The set of these peaks is considered as the color measurement random set. Secondly, we use the color measurement random set to update the PHD by the updating step in the GMPHD filter (Equation (7)). After updating predicted i (x) with the color measurement random set, we obtain PHD vki (x). PHD vk|k−1 From PHD vki (x), we find Gaussian components whose weights are larger than a threshold (0.5). The set of means of these Gaussian components are 2D estii i mations of objects at the ith camera. They are denoted as Yki = {y1,k , ..., ym,k }. (See [12] for more details of single-view tracking).
6
Multiple-Camera Fusion
We assume that the dynamic moving equation for 3D tracking is xk = xk−1 + wk
(14)
880
N.T. Pham, W. Huang, and S.H. Ong
where the state of an object xk = {x1,k , x2,k , x3,k } is a 3D coordinate, wk is the process noise. The observations are 2D estimations from multiple cameras. So, the measurement equation at the ith camera is described by ⎞ ⎛ ⎞ ⎛ i i i i ⎞ x1,k ⎛ l1,k a11 a12 a13 a14 ⎜ ⎟ ⎝ l2,k ⎠ = ⎝ ai21 ai22 ai23 ai24 ⎠ ⎜ x2,k ⎟ ⎝ x3,k ⎠ l3,k ai31 ai32 ai33 ai34 1 i y1,k l /l (15) = 1,k 3,k + uk i y2,k l2,k /l3,k where uk is the measurement noise, and aimn are projection parameters from 3D coordinate to the ith camera plane. Assuming that cameras are calibrated, we have projection parameters aimn . The idea of fusing data from multiple cameras is to use the GMPHD filter sequentially at each camera. There are some related work that used sequential sensor updating method in the PHD approach [8]. Let Vk (x) be the PHD for multiple-camera tracking at time step k. We propose the fusion stage as follows – Step 1: Assuming that we have the PHDs of previous time step k − 1 of 1 multiple-camera fusion stage Vk−1 (x) and single-view tracking stage vk−1 (x) at camera 1, we employ the method in Section 5 to obtain the set of 2D estimations of objects, Yk1 , and PHD vk1 (x). Then, from Vk−1 (x), we use dynamic moving equation (14) and measurement equation (15) to predict 1 (x) at camera 1 by Equation (5). Because measurement equation (15) Vk|k−1 is not linear, we have to use unscented transform in the prediction step (more details is in [10]). Then, the set of 2D estimations of objects at the camera 1 1, Yk1 , is used to update the Vk|k−1 (x) to Vk1 (x) by the updating step in the GMPHD filter (Equation (7)). From assumptions on the GMPHD filter, Vk−1 (x) is a Gaussian mixture, so Vk1 (x) is also a Gaussian mixture. – Step 2: Set i = 2 i (x) = Vki−1 (x). Assuming that we have – Step 3: At the camera i, set Vk|k−1 the PHD of previous time step k−1 of single-view tracking stage at camera i, i (x), the method described in Section 5 is performed to obtain the set of vk−1 i 2D estimations of objects at camera i, Yki , and PHD vki (x). Because Vk|k−1 (x) is a Gaussian mixture, we can use the updating step of the GMPHD filter i (x) with observations in Yki . This means to update Vk|k−1 Vki (x) = (1 − pD,k )Vki−1 (x) +
VD,k (x; y)
(16)
y∈Yki
Then, we obtain the Vki (x). – Step 4: Set i = i + 1. If i ≤ n then we repeat the step 3. Otherwise, we have Vkn (x). The PHD of the system is Vk (x) = Vkn (x). For estimating the 3D object locations, we investigate the PHD of the system Vk (x) and choose
Probability Hypothesis Density Approach
881
Gaussian components whose weights are larger than a threshold (0.5) to obtain the 3D estimations of objects. We note that the multiple-camera fusion stage is implemented by sequential sensor updating. Hence, the most reliable camera should be updated first. Another notice is that the GMPHD filter in [10] does not include the track labels of objects. For label tracking, our method is described as follows. Each Gaussian component is associated with a label. For birth Gaussian components, we assign them a special label (for example -1). After the updating step in the first camera, Gaussian components with labels become the predicted Gaussian components for the second camera and then they are used to update the PHD in the second camera. At the last camera, for each label, we choose the Gaussian component that has the largest weight. The estimations of object locations are from the means of these largest Gaussian components. If a Gaussian component has a special label and its weight is large enough, we assign it a new label. This means a new person occurs. Hence, the identifications of people are defined in the tracking. This track label method is extended from the work in [13] from single sensor to multiple sensors and then applied in multiple-camera multiple-object tracking.
7
Experimental Results
We test the performance of our method with data from the first and second cameras in scenarios seq24-2p-0111, seq35-2p-1111, and seq44-3p-1111 in test database [14]. There are about 4500 time steps (9000 image frames). The errors of 3D estimations are measured by the Wasserstein distance [9] and are shown in Table 1. For visualization, we show the results from test case ’seq44-3p-1111’. In this scenario, there are three persons. They appear and disappear at different times. This scenario is challenging because occlusions occur between persons when they cross together. Moreover, in this scenario, the lighting of the room changes through the tracking, so it is difficult to apply segmentation methods. In addition, because the color models of heads are different between views, it is sometimes difficult to apply methods such as Stereo Matching to find the correspondences. Hence, the 3D reconstructions from correspondences are not reliable in this data. However, our method successfully track 3D object locations in this scenario. At each camera, we use 400 samples to detect peaks of PHD. The maximum of Gaussian mixture components are 30. We assume that persons enter the tracking Table 1. Error of 3D estimation Scenarios Mean error (m) seq24-2p-0111 0.06 seq35-2p-1111 0.05 seq44-3p-1111 0.07
882
N.T. Pham, W. Huang, and S.H. Ong
Fig. 2. 3D results of tracking multiple people using PHD filter
areas from two entrances. Hence, the birth intensity is the mixture of Gaussian components whose means are locations at these entrances. The clutter density in the multiple-view camera fusion is an uniform distribution on the tracking area 3m × 2m × 2m and the clutter density in the single-view tracking stage is an uniform distribution of the size of image (it is the projection from tracking area to cameras) and the range of radius Hx and Hy ([5,15]). The probability of survival is pS = 0.99 and the probability of detection is pD = 0.98. These parameters are set by experiments. Figure 2 shows the performance of 3D people tracking. The dots are groundtruth and the lines are estimations from our methods. The results indicate that tracks of people are maintained. The x and y components are reliable while the z component has some errors, for example at steps 600 to 700. This is because at steps 600 to 700, the color of the background near the person’s location at camera 2 is similar to color of templates. However, these errors are quite small. In this sequence, when a person moves out of the view and then moves back, we will assign it a new label, which is treated as correct detection. Figure 3 shows the results when we project 3D locations to the camera plane. Each cell in the figure has two images. The left image is from camera 1 and the right image is from camera 2. In this figure, we can see that at time k = 99, 144, 247, the first, second, third persons appear in the overlapped region sequentially. They are detected and tracked automatically. At time k = 264, 295, the occlusions between the second and third persons occur in camera 1 and 2. However, the tracks are maintained after the occlusions. At time k = 809, the occlusion between the first and
Probability Hypothesis Density Approach
883
Fig. 3. Projection 3D estimations to two camera planes
third persons occurs at camera 1 and the occlusion between the first and second persons occurs at camera 2. We can see in the figure that our method can handle these cases. This is because the PHD from camera 1 is a good prediction for the PHD at camera 2. Information from two cameras is fused to obtain the reliable 3D estimations without using data association methods.
8
Conclusion
The paper described a method of using the GMPHD filter to track 3D locations of objects. The method can track a varying number of objects. Moreover, it can solve some occlusion problems for which single-camera system has difficulty. The fusion stage using the GMPHD filter reduced a lot of computations compared with other methods that search whole space or the particle filter with multiple objects. Experimental results have shown that the proposed approach is promising.
884
N.T. Pham, W. Huang, and S.H. Ong
Acknowledgements The authors would like to thank Prof. Ba Ngu Vo at Melbourne University for his helps and fruitful discussions. This work is partially supported by EU project ASTRALS (FP6-IST-0028097).
References 1. Czyz, J., Ristic, B., Macq, B.: A color-based particle filter for joint detection and tracking of multiple objects. In: ICASSP (2005) 2. Cai, Q., Aggarwal, J.K.: Automatic tracking of human motion in indoor scenes across multiple synchronized video streams. In: ICCV, Bombay, India (1998) 3. Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., Gool, L.V.: Color-based object tracking in multi-camera environments. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, Springer, Heidelberg (2003) 4. Chang, T., Gong, S.: Tracking multiple people with a multi-camera system. In: IEEE Workshop on Multi-Object Tracking, IEEE Computer Society Press, Los Alamitos (2001) 5. Mittal, A., Davis, L.S.: M2Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. International Journal of Computer Vision 51(3), 189–203 (2003) 6. Kim, K., Davis, L.S.: Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 7. Dockstader, S., Tekalp, A.M.: Multiple camera tracking of interacting and occluded human motion. Proceedings of the IEEE 89(10) (2001) 8. Mahler, R.: Multi-target Bayes filtering via first-order multi-target moments. IEEE Trans. on Aerospace and Electronic Systems 39(4), 1152–1178 (2003) 9. Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo methods for Bayesian multi-target filtering with random finite sets. IEEE Trans. Aerospace and Electronic Systems 41(4), 1224–1245 (2005) 10. Vo, B.N., Ma, W.K.: The Gaussian mixture probability hypothesis density filter. IEEE Transaction Signal Processing 54(11), 4091–4104 (2006) 11. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: ICCV (1999) 12. Pham, N.T., Huang, W.M., Ong, S.H.: Tracking multiple objects using probability hypothesis density filter and color measurements. In: ICME (2007) 13. Clark, D., Panta, K., Vo, B.: The GM-PHD filter multiple target tracker. In: Proceedings of FUSION 2006, Florence (2006) 14. Lathoud, G., Odobez, J., Perez, D.: Av16.3: an audio-visual corpus for speaker localization and tracking. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, Springer, Heidelberg (2005)
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients Chi-Chen Raxle Wang and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {raxle,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. We developed a novel learning-based human detection system, which can detect people having different sizes and orientations, under a wide variety of backgrounds or even with crowds. To overcome the affects of geometric and rotational variations, the system automatically assigns the dominant orientations of each block-based feature encoding by using the rectangular- and circulartype histograms of orientated gradients (HOG), which are insensitive to various lightings and noises at the outdoor environment. Moreover, this work demonstrated that Gaussian weight and tri-linear interpolation for HOG feature construction can increase detection performance. Particularly, a powerful feature selection algorithm, AdaBoost, is performed to automatically select a small set of discriminative HOG features with orientation information in order to achieve robust detection results. The overall computational time is further reduced significantly without any performance loss by using the cascade-ofrejecter structure, whose hyperplanes and weights of each stage are estimated by using the AdaBoost approach. Keywords: Human Detection, Histograms of Oriented Gradients, Cascaded AdaBoost.
1 Introduction Human detection is a key capability for applications in robotics, surveillance, or automated personal assistance. The main challenge is the amount of variations in visual appearance owing to clothing, articulation, cluttering backgrounds and illumination conditions particularly in outdoor scenes. A number of different approaches for detecting human in images using some feature representations and learning methods have been proposed in the literatures. The work in [4] offers detailed survey results for human detection or analysis. Papageorgiou et al. [14] describe a polynomial Support Vector Machine (SVM) method to detect pedestrians in images, this work used the Haar wavelet features to represent a detection window. A variant version, by Mohan et al. [12], presented a part-based approach. An optimized version was presented by Depoortere et al. [2]. Viola et al. [19], and Yang et al. [21], which used Haar-like features and AdaBoost algorithm, and then build a cascaded system for efficient moving person detector. Felzenszwalb et al. [3] build an Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 885–895, 2007. © Springer-Verlag Berlin Heidelberg 2007
886
C.-C. Raxle Wang and J.-J.J. Lien
articulated body based on its parts, and then each part is represented by a Gaussian derivative filter with different scales and orientations. Their approach is similar to the works of Ioffe & Forsyth [7] and Ronfard et al. [15]. Gavrila et al. [5] implement a realtime pedestrian detection system by comparing edge images to an exemplar dataset using chamfer distance. Leibe et al. [8] implement the Implicit Shape Model (ISM) to combine with a global verification stage based on silhouettes to detect human. The improved version by Seemann et al. [17] presented a 4-D ISM approach to detect humans in images. Mikolajczyk et al. [11] used a single hierarchical codebook representation and Munder et al. [13] used SVM with LRF features to present their approach, which was capable of detecting humans in images. The use of orientation histograms has been extensively used in [1], [9], [11], [20], and [22]. The Histograms of Oriented Gradients (HOG) features have provided excellent performance contrast to other existing edge- and gradient-based features by Dalal & Triggs [1]. They used a dense grid of normalizd HOG features and compute over 16×16 fixed-size blocks to represent a 64×128-pixel detection window. Subsequently, they trained a linear SVM as a binary classification function for their human detection system. Unfortunately, the computational time of their system was approximately 7 seconds, performed by processing a 320×240-pixel image using a dense scaning methodology. Zhu et al. [22] improved the Dalal & Triggs approach by integrating a cascade-of-rejectors approach with the HOG features of variable-size blocks in order to achieve a fast and accurate human detection system. This has speeded up to 70 times, while maintaining an accuracy level similar to the Dalal & Triggs approach [1]. Inspired by their works, this paper designs a large set of blocks, which contain datasets of multiple types, sizes and locations. The system automatically assigns the dominant orientations for each block to overcome the affects of geometric and rotational variations. Therefore, each local pixel in the block can be described as relating to its dominant orientations in order to achieve the rotation invariance. For construction of the HOG features for each block, this method differed from the method suggested by Zhu et al. [22], which omited the Gaussain block-weighted window and tri-linear interpolation steps. These steps are important to construct the HOG features, which have been demonstrated in our experiment. The AdaBoost approach has established itself as a powerful learning algorithm that can be used for feature selection [18]. Therefore, this work uses the AdaBoost approach to select a small set of discriminative HOG features, which well suited for human detection by constructing a cascade-of-rejecter system. The performance of our system has proved better than pervious works in our experimental results.
2 Training Process Our human detection system consists of a training process and a testing process, as shown in Figs. 1 and 6. The training process contains four modules. The gradient computation module creates the gradient image for each positive or negative training example image. The second module designs rectangular- and circular-type blocks, which varied in their sizes and positions in the training examples. The system automatically evaluates the dominant orientations of each block to achieve the
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients
887
rotation invariance. The HOG feature construction module is then encoded information of each block, to construct its HOG features. Finally, the human detection module applies a cascaded AdaBoost approach for human detection. 2.1 Gradient Computation The gradient of all the sample pixels in the positive and negative training example images, each of which has the 64×128-pixel resolution, are computed by using a 1-D discrete derivations mask [-1, 0, 1]. The central difference across x and y at pixel location (x, y) is dx(x, y) and dy(x, y), respectively. d x ( x, y ) = I ( x + 1, y ) − I ( x − 1, y )
(1)
d y ( x, y ) = I ( x, y + 1) − I ( x, y − 1)
(2)
where I(x, y) is the pixel grayvalue at location (x, y) in the positive or negative training example images. The gradient magnitude m(x, y) at a sample pixel location (x, y) can be evaluated by following equation: m ( x, y ) = d x ( x, y ) 2 + d y ( x, y ) 2
(3)
θ
The gradient directional angle (x, y) at sample pixel locations (x, y) is relative to the x axis of the image space, and can be evaluated by following equation:
θ ( x, y ) = tan −1 (d y ( x, y ) d x ( x, y ) )
(4)
Fig. 1. Workflow of the training process
2.2 Block Type Assignment and Block Rotation Invariant Block Type Assignment: In [1], the authors used only fixed-size blocks of 16×16 pixels to construct the HOG features. Each block consisted of 2×2 spatial cells, and each cell was 8×8 pixels. However, the fixed-size blocks encoded very limited information in both the positive and negative training example images. Therefore,
888
C.-C. Raxle Wang and J.-J.J. Lien
Zhu et al. [22] used variable-size blocks to improve the detection performance of [1]. They considered all blocks, whose size range was from 12×12 to 64×128 pixels, then the ratio between width and height was assigned by any of the following ratios: (1:1), (1:2), and (2:1). The three ratios were regarded as three types of blocks, as shown in Figs. 2 (a), (c), and (e). Finally, 5031 blocks were defined in a 64×128-pixel training example image, as shown in approach [22]. However, they used only the rectangular structure of blocks to encode information in the training example images. In this work, we additionally use circular structure of blocks to encode more information of the training example images. The sizes and ratios of the circular structures of block variation are the same as the rectangular structure of block, as shown in Fig. 2 (b), (d), and (f). Therefore, a total of 10062 blocks are defined in a 64×128-pixel training example image in this work.
Fig. 2. Six types of blocks and corresponding multivariate Gaussian-weighted windows. (c: cell, (a:b): ratio between width and height of block).
Block Rotation Invariant: For each block, our system automatically evaluates one or more orientations. Therefore, each local pixel in the block can be described relative to block orientations for rotation invariance. This process is very different from approaches in [1] and [22], which omit the orientation information of blocks. The gradient direction angles of all the sample pixels in the block are voted into a 36-bin orientation histogram, these bins are evenly spaced over 00-3600. Each sample pixel contributing its weight to the orientation histogram, which is weighted by its gradient magnitude and then is multiplied by a multivariate Gaussian-weighted window with the covariance matrix C, as defined in Equation (5). 0 ⎤ ⎡1.5 × SW C=⎢ 1.5 × S H ⎥⎦ ⎣ 0
2
(5)
where SW and SH are one-half the width, and one-half the height of block, respectively. The multivariate Gaussian-weighted window for each type of block is shown in Fig. 2. A parabola is then fitted to the values of the 3 nearest bins at the histogram to interpolate the orientations for improved accuracy. Finally, the maximum value in the orientation histogram dominates the direction of the block. However, after the maximum value in the histogram is detected, other peak values higher than 80% of the maximum value is used to consider new block examples based on that orientation. This process is very similar to the work in [9] for the orientations of the keypoint assignment work.
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients
889
2.3 HOG Feature Construction The HOG feature is the fundamental nonlinearity of the normalized gradient descriptor, which is very similar to the SIFT descriptor. Each descriptor covered a 2×2 subregion in the SIFT algorithm [9]. In Dalal & Triggs approach [1], each cell of a 16×16-pixel block consisted of 8×8 gradient direction angles, which was then weighted and voted into a 9-bin orientation histogram, these bins were evenly spaced over 00–1800 angle (“unsigned” gradient), assigned by its gradient magnitudes and Gaussian block-weighted window, with standard deviation equal to one half the width of block. The votes were accumulated into orientation bins of histograms spread over its rectangular region. To reduce aliasing, the tri-linear interpolation was used to distribute the value of each gradient into adjacent orientation bins in the histogram. Therefore, four orientation histograms of each block was integrated into a 36-dimensional (36-D) HOG feature vector. Each HOG feature vector was then normalized into an L2 unit length by Lowe’s normalization method [9]. Finally, each 64×128-pixel training example image, with a dense, in face overlapping, grid of 105 HOG features, these features were used to train a SVM based window classifier. In this work, the HOG features are constructed from 10062 blocks at multiple types, sizes, and locations. The system automatically evaluates one or more orientations to each block, which where based on local image properties located in the last module. Thus, we encode information of block to construct the HOG feature by referring to its orientation. For constructing HOG feature, each cell of block is weighted and voted into a 9-bin orientation histogram according to its gradient magnitude and multivariate Gaussian block-weighted window with a covariance S 0 matrix K= ⎡⎢⎣ 0 . S ⎤⎥⎦ . Following this, four orientation histograms are integrated to a 36-D HOG feature vector, as shown in Fig. 3. Each HOG feature is then normalized to an L2 unit length. Therefore, more than 10062 normalized HOG features are defined in a 64×128-pixel training example image in this work. Although the approach in [22] demonstrates that the variable-size block method had higher detection accuracy than did the fixed-size block method in their experiment results, it did not use multivariate Gaussian block-weighted windows and tri-linear interpolation when each sample pixel of cell were weighted and voted into an orientation histogram. In our experiment, the performance reduces the recall rate 2
W
H
Fig. 3. Two examples of HOG features are created by computing the gradient magnitude and direction angle of each image sample pixel within the block. These sample pixels of each cell were then accumulated and weighted, then voted into a 9-bin orientation histogram by their gradient magnitudes and corresponding multivariate Gaussian block-weighted window are indicated by the overlaid circle. Four orientation histograms were integrated into a 36-D normalized HOG feature vector.
890
C.-C. Raxle Wang and J.-J.J. Lien
from 94% to 91% at 10-4 FPPW. Therefore, we do not omit both the multivariate Gaussian block-weighted windows and tri-linear interpolation approaches in this work. 2.4 Human Detection Using a Cascaded AdaBoost Approach Currently, more than 10062 HOG features defined in each positive or negative 64×128-pixel training example image. We intend to define a meaningful set of HOG features, which have the discriminative and distinctive properties. In this work, we use the AdaBoost algorithm [18] to select a small number of weighted HOG features, i.e. weak classifiers, to integrate into a strong classifier. Each weak classification is selected by evaluating training datasets of positive and negative, each classifier showing the lowest error is chosen. We use the normalized difference score s to be the similarity measurement function of weak classifier as following equation:
s (q, p ) =
q q
−
p p
(6)
where q is the separating hyperplane, which exhibited the lowest error rates of HOG feature; and p is the HOG feature of the query image.
Fig. 4. The cascaded AdaBoost consists of a sequence of detection stages. The first several stages can eliminate a large number of negative examples and retain almost all positive examples, with little processing time. The last several stages eliminate remaining negative examples, but take much more processing time than did the first several stages.
Fig. 5. Some details of the cascaded AdaBoost detector. (a) The number of weak classifiers in each stage. (b) The rejection rate as cumulative sum over cascaded stages.
To increase the speed of the detector, we construct a cascaded AdaBoost approach, which rejects many negative examples, while detecting almost all positive examples. The cascade process is similar to a sequence of detection stages and is designed to have a high detection rate in order to achieve a high rejection rate. In this work, we require a minimum detection rate of 0.9995 and a maximum false positive rate of 0.55 for each stage. The cascaded training process took a few days using a PC with 3.2GHz CPU and 4GB memory. The schematic depiction of the cascaded AdaBoost
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients
891
approach is shown in Fig. 4. The final detector is a 34-stage cascaded AdaBoost detector, which includes 813 HOG features. Each HOG feature involves parameters concerning the type, size, and position of blocks in a 64×128-pixel detection window. The first stage in the cascade is constructed using five HOG features and rejects approximately 60% of non-human (negative), while correctly detecting nearly 100% of humans (positive). The next stage has five features and rejects 80% of non-humans while detecting almost all humans. More stages are added until the false positive rate equal nearly zero, while still maintaining a high correct detection rate. Details regarding the cascaded AdaBoost detector are shown in Fig. 5.
3 Testing Process For the testing process, as shown in Fig. 6, the testing images, e.g. 320×240 pixels, are first down-sampled iteratively by a factor of 8/9 from the original resolution of 320×240 pixels (level 0) to 178×133 pixels (level 5). The height of the image at level 5, is slightly bigger than the height of the detection window. These detection windows are generated for the detection process by shifting pixel by pixel across the image of each level, and therefore approximately 70000 windows are produced for each input 320×240-pixel image. Then, each detection window is classified as a human (positive) or non-human (negative) example by our cascaded AdaBoost detector. In addition, an average about 8.5-blocks evaluation is needed to classify a 64×128-pixel detection window, which has been accelerated more than 12.3 times, compared to the approach in [1] with 105 blocks. The computational time of our system is 0.55 seconds per 320×240-pixel image. This is better than the approach [1], which required 7 seconds.
Fig. 6. Workflow of our testing process Table 1. Two datasets compiled from the webside of the work in [1] for research purpose. Each example image contains 64×128 pixels. (Unit: number of example). Dataset Training dataset (2416+1218 images) Testing dataset (741 images)
Positive examples Negative examples 2416 12180 555 persons in the 741 images.
892
C.-C. Raxle Wang and J.-J.J. Lien
Usually, each successful human detector may result in redundant detection windows of the same or different scaling sizes, around each human in an image. The work in [18] combined all the overlapped candidate windows together and has good results in face detection. But the non-overlapping constraint may be too strict for closely-spaced targets which cause overlapped candidate windows. To solve the overlapping problem, this work uses a non-maximum suppression method [1]. In [1], the authors propose a robust fusion of overlapping detections in 3-D position (x, y, response) and scale s space. The method is referred to as a non-maximum suppression algorithm, which applies a mean-shift model detection procedure in order to locate a pre-defined model density, which has been defined as a human, in the 3-D position and scale space. Therefore, the list of all of the located models gives the final fused detections.
4 Experimental Results 4.1 Databases and Performance Evaluation Method Databases: We downloaded two datasets from the webside of [1], as shown in Table 1. The first dataset is the training dataset, containing 2416 person images and 1218 personfree images. We selected 2416 person images as positive training examples and collect 12180 person-free images as negative training examples, which sampled randomly from 1218 person-free images. The sizes of both the positive and negative training example images are 64×128 pixels. To reduce the occurrence of false alarms, additional negative example (non-human) images are compiled from the false acceptance windows by applying the human detection process to the 1218 person-free images. The second set is the testing dataset, containing 555 persons in the 741 images in different size. Performance Evaluation Method: The performance of detection method is the same as [1] by plotting the Detection Error Tradeoff (DET) curve featuring miss rate versus FPPW (False Positive Per Window). The term is defined as follows: miss rate=1-recall rate = FPPW =
Number of false alarms Number of detected positives + Number of false alarms
Number of false alarms Total number of testing negative examples
(7) (8)
Based on the DET curve, the better performance of detector should achieve minimum miss rate and FPPW. In this work, we will often use miss rate at the 10-4 FPPW as a reference point, the same as [1], in the DET curve. 4.2 Experiments We performed a variety of experiments with our proposed system to evaluate its accurate performance by using training and testing datasets, as shown in Table 1. In the first experiment, we compared the performances of our system with four combinations of block types by evaluating the testing dataset in order to choose the
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients
893
best combination. Fig. 7 (a) shows the classification performances of our system with four combinations of block types to present four curves. The DET curves clearly show that only the 16×16-pixel blocks showed the least accuracy and increasing block types does contribute significantly to the detection performance of our system. The combination of six types of blocks demonstrated that the detection performance has the lowest miss rate (by 6% at 10-4 FPPW) when compared with other combinations of block types, but restricting to the combination of block types 1, 3, 5, the same as shown in [22], reduced the performance by 2.5% at 10-4 FPPW. In the second experiment, we demonstrated that block orientation assignment is significant to increase detection performance of our system. In Fig. 7 (b), the detection performance increases by approximately 3% at 10-4 FPPW, while we performed the block orientation assignment before the HOG feature construction module. Thus, the results confirmed that the local orientation assignment for each block given its most stable performance. In the third experiment, we proved the importance of the multivariate Gaussian block-weighted window and tri-linear interpolation steps during construction of the HOG features. In Fig. 7 (c), the results indicated that our proposed system, without including two steps decreases the detection performance from 94% to 91% at 10-4 FPPW. However, it is effective to avoid all aliasing effects, which the HOG descriptor suddenly changes as a sample shift smoothly, from being within one histogram to the other, or from one orientation to the other. In [1], the authors have demonstrated that the HOG feature outperforms the other existing feature representation methods, such as Harr-like wavelets, PCA-SIFT, and Shape Contexts, according to their experimental results. Therefore, we compared only
Fig. 7. (a) Classification performances of our detection system by using four combinations of block type. (b) Without using block orientation assignment to achieve invariance to image rotation, it decreases the performance by about 3%. (c) Without using multivariate Gaussian block-weighted window and tri-linear interpolation approaches to construct the HOG features, it decreases the accurate detection rate by about 3%. (d) DET curves for comparing the approaches in [1] and [22] with our detection system. Our detection system achieves the lower miss rate with 10-4 FPPW.
894
C.-C. Raxle Wang and J.-J.J. Lien
Fig. 8. Some typical results by using our detection system
Dalal & Triggs approach [1] and Zhu et al. approach [22] with our detection system in the final experiment. The DET curves are presented in Fig. 7 (d). The results indicate that our detection system has better detection performance than approaches in [1] and [22]. However, we observe that the HOG features located in some specific types, sizes, locations and orientations of blocks can achieve a much higher accuracy by using the AdaBoost approach. The results demonstrated that the performance of our system is superior to those of the approaches in [1] and [22] by selecting the most informative blocks, which are contrary to the blocks in background. Furthermore, the computational time of our system, using the cascaded AdaBoost approach, can significantly accelerated 12.3 times when compared with that of the approach in [1]. In Fig. 8, we show some typical results of our proposal detection system.
5 Conclusion We have developed a novel learning-based human detection system. For the feature representation, we use the rectangular- and circular-type HOG features, which are insensitive to various lightings and noises, and constructed a large set of blocks at multiple types, sizes, locations, and orientations to overcome the affects of the geometric and rotational variations. The discriminative HOG features are automatically selected, using the AdaBoost approach. This work has experimented on the affects of block types and orientation assignment for constructing HOG features to obtain good performance. In addition, we have demonstrated that HOG features construction, with Gaussian block-weighted windows and tri-linear interpolation can increase 3% detection performance. A cascaded AdaBoost approach significantly accelerated the computational time of 12.3 times faster than the approach in [1]. Finally, our detection system achieves the lower miss rate than the previous approaches in [1] and [22] with 10-4 FPPW.
References 1. Dalal, N., Triggs, B.: Histogram of Oriented Gradients for Human Detection. In: CVPR. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 886–893 (2005) 2. Depoortere, V., Cant, J., Bosch, B.V., Prins, J.D., Fransens, R., Gool, L.V.: Efficient Pedestrian Detection: A Test Case for SVM Based Categorization. In: Workshop on Cognitive Vision (2002)
AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients
895
3. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. International Journal of Computer Vision (IJCV), 55–79 (2005) 4. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding 73, 82–98 (1999) 5. Gavrila, D.M., Giebel, J., Munder, S.: Vision-Based Pedestrian Detection: The Protector System. In: IEEE Intelligent Vehicles Symposium, pp. 13–18. IEEE Computer Society Press, Los Alamitos (2004) 6. Gerónimo, D., Sappa, A.D., López, A., Ponsa, D.: Pedestrian Detection Using Adaboost Learning of Features and Vehicle Pitch Estimation. In: International Conf. on Visualization, Imaging and Image Processing, pp. 400–405 (2006) 7. Ioffe, S., Forsyth, D.A.: Probabilistic Methods for Finding People. In: IJCV, pp. 45–68 (2001) 8. Leibe, B., Seemann, E., Schiele, B.: Pedestrian Detection in Crowded Scenes. In: IEEE Conf. on CVPR, pp. 878–885. IEEE Computer Society Press, Los Alamitos (2005) 9. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. In: IJCV, pp. 91– 110 (2004) 10. Mikolajczyk, K., Leibe, B., Schiele, B.: Multiple Object Class Detection with a Generative Model. In: IEEE Conf. on CVPR, pp. 26–36. IEEE Computer Society Press, Los Alamitos (2006) 11. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human Detection Based on a Probabilistic Assembly of Robust Part Detections. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, pp. 69–81. Springer, Heidelberg (2004) 12. Mohan, A., Papageorgiou, C., Poggio, T.: Example-Based Object in Image by Components. IEEE Tran. on Pattern Analysis and Machine Intelligence, 349–361 (2001) 13. Munder, S., Gavrila, D.M.: An Experimental Study on Pedestrian Classification. IEEE Tran. on Pattern Analysis and Machine Intelligence (PAMI), 1863–1868 (2006) 14. Papageorgiou, C., Poggio, T.: A Trainable System for Object Detection. In: IJCV, pp. 15– 33 (2000) 15. Ronfard, R., Schmid, C., Triggs, B.: Learning to Parse Pictures of People. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, pp. 700–714. Springer, Heidelberg (2002) 16. Schneiderman, H., Kanade, T.: Object Detection Using the Statistics of Parts. In: IJCV, pp. 151–177 (2004) 17. Seemann, E., Leibe, B., Schiele, B.: Multi-Aspect Detection of Articulated Objects. In: IEEE Conf. on CVPR, pp. 1582–1588. IEEE Computer Society Press, Los Alamitos (2006) 18. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: IEEE Conf. on CVPR, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 19. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. In: IJCV, pp. 153–161 (2005) 20. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: ICCV, pp. 90–97 (2005) 21. Yang, T., Li, J., Pan, Q., Zhao, C., Zhu, Y.: Active Learning Based Pedestrian Detection in Real Scenes. In: International Conf. on Pattern Recognition, pp. 20–24 (2006) 22. Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: IEEE Conf. on CVPR, pp. 1491–1498. IEEE Computer Society Press, Los Alamitos (2006)
Multi-posture Human Detection in Video Frames by Motion Contour Matching Qixiang Ye, Jianbin Jiao, and Hua Yu Graduate University of Chinese Academy of Science {Qxye,jiaojb,Yuh}@gucas.ac.cn
Abstract. In the paper, we proposed a method for moving human detection in video frames by motion contour matching. Firstly, temporal and spatial difference of frames is calculated and contour pixels are extracted by global thresholding as the basic features. Then, skeleton templates with multiple representative postures are built on these features to represent multi-posture human contours. In the detection procedure, a dynamic programming algorithm is adopted to find best global match between the built templates and with extracted contour features. Finally a thresholding method is used to classify a matching result into moving human or negatives. And in the matching process scale problem and interpersonal contour difference are considered. Experiments on real video data prove the effectiveness of the proposed method. Keywords: Human detection, motion contour, dynamic programming.
1 Introduction Detecting humans in video frames is important for many applications, such as visual surveillance, traffic system, smart room or early threat assessment [1][2]. Precise moving human detection algorithm with low false alarm rates will push the development of automated visual surveillance technique on optical or infrared images. In the past years lots of works have been done on human detection in images and video frames. Various image features and methodologies are employed in these works, which can be categorized into three classes, 1) human detection by background subtraction 2) human detection directly in single image (frame) by image features and 3) human detection by motion features, tracking cue or the combination of motion and static image features. In the early years, human detection is performed based on background subtraction technologies (frames difference or background modeling) and simple regions analysis technology such as region area, region width/height ratio, region moment features etc [2][3]. It is difficult for these methods to discriminate moving human with other objects since pure region features cannot represent human effectively. When the background is moving or there are illumination changes, lots of false alarms would be detected. In [4], Gavrila et al. used shape-based grey value template matching method to detect human. The dissimilarity between template and human candidate is evaluated Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 896–904, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multi-posture Human Detection in Video Frames by Motion Contour Matching
897
on the chamfer distance. They think that template matching method can deal with the challenging scenario of a moving camera mounted on a vehicle for human detection. The author also included a second verification state based on a neural network architecture with operates on image patches detected by template matching. In [5], Mohan and T. Poggio proposed a human detection method by learning component classifiers with Harr Wavelet features and SVM (support vector machine) classifier. In [6], Dalal et al. presented a descriptor of oriented gradient and motion flow for human shape representation. And then they trained a SVM classifier on the descriptor to detect pedestrian in video frames. In [7] Leibe et al. combined local information from sampled appearance features with global cues about an object’s silhouette for human detection. By using both segmentation and classification methods the flexible nature of their approach allows it to operate on complex background. Although at present using pure grey value features for human contour representation is an intuitionistic idea, it is usually need very large training samples to build a model, especially when facing human of various views and postures. In [8], Cutler et al. used time-frequency features from short-time Fourier transformation (STFT) for pedestrian detection by considering that the moving of human is periodic and repetitive. Viola et al. used an Adaboost classifier trained on human shape and motion features for human detection. Their shape features were extracted by rectangle filters on frame grey value and optical flow. The method had obtained good detection results on a large pedestrian dataset [9]. Haga et al. classified moving objects into human or others by motion uniqueness and continuity features with a linear classifier [10]. In [11], Zhang et al. proposed a method detecting human in moving background (moving camera on car). They firstly calculated image interested points and then use FOE (focus of expansion) with residual to judge whether the object can be classified into moving human. There are also some research works on pedestrian detection on infrared images in recent year [12]. By using motion features, static features or the combination these methods, the performance of human detection can be improved to some extent. However, the multi-posture problem in human detection has not been fully considered in these works, which may decrease their usability in lots of real applications. Our approach builds on the motion contour features and template matching. Contributions of this paper include: i) development of a multi-posture human contour representation method, and ii) implementation of human detection by template matching with dynamic programming idea which is a global optimization algorithm and has tolerance on small contour deformation.
2 Human Detection Algorithm There are three parts in the human detection algorithm. The first part is feature extraction on which multi-posture templates are built. The second one is optimized contour matching with dynamic programming and the third one is human/non-human determination. In the human detection, features are extracted as the process of feature extraction in template building. In the presentation of detection algorithm we will consider only the single scale human detection. Multi-scale human detection will be taken out by resizing the original frame’s size.
898
Q. Ye, J. Jiao, and H. Yu
2.1 Template Building on Contour Features Videos with simple and static background are selected when building the templates, the aim of which is to ensure that the foreground objects can be easily segmented from the background. Each video frame is smoothed and then intra and inter-frame difference values are calculated as follows.
(
FIntra ( x, y , σ ) = (Lt ( x − 1, y ,σ ) − Lt ( x, y , σ ) ) + (Lt ( x, y , σ ) − Lt ( x, y − 1, σ ) )
)
2 1/ 2
2
( 1)
FInter ( x, y, σ ) = Lt ( x, y,σ ) − Lt − n ( x, y, σ )
(2)
Lt ( x, y,σ ) = G ( x, y,σ ) ∗ I t ( x, y )
( 3)
where I t ( x, y) represents the brightness of a pixel at location ( x, y) at t th frame. G( x, y, σ ) is a Gaussian function, in which σ is the smooth scale factor. Based on the result of (1) and (2), the contour features by the following function.
F ( x, y ) can be calculated
⎧ 1, if ( FInter > TInter & & FIntra > TIntra ) F ( x, y ) = ⎨ ⎩ 0, otherwise
(4)
where TInter and TIntra are two global thresholds determined by histogram cavity analysis methods[13]. When detecting moving human in a video, the values of TInter and TIntra should ensure that enough possible moving pixels can be obtained so that moving objects are not missed.
Fig. 1. Multi-posture human contour templates
The built model templates should consider the variation of human heights. At present 4 persons with heights of 1.60, 1.70, 1.80, and 1.90 meters respectively are employed to capture the templates. 10 types of typical postures of moving human are selected for each person and the templates of one of them are shown in Fig.1. It is clear that 40 templates (10 types of postures of 4 persons) can not represent all of moving human because there are certainly some differences between a human in templates with a human that was not included. Therefore, some tolerance for the deformation of human body is required for the matching algorithm. In this paper, we just to justify the feasibility of the matching method. More postures of more persons will be used to build the templates in the future work. Foreground features can be extracted from a given video frame using equation (4). Then a built template will be matched with these moving pixels to search whether there are moving humans. Taking the first template as an example, the detected
Multi-posture Human Detection in Video Frames by Motion Contour Matching
899
foreground pixels form a profile as shown in Fig.2. Discrete key points are then sampled on this profile to get human contour features. In the sampling process, equally spaced horizontal lines are used (as shown in the second image of Fig.2b) to scan the foreground profile. Each line has two intersection points with the profile on the left and right sides respectively, which are 2 features we want. The reason we use discrete features instead of continuous profile is that experiment has shown that the former has larger tolerance to various human body shapes and some body deformation.
(a)
(b)
(c)
(d)
(e)
Fig. 2. Human contour features. (a) is extracted profile, (b)(c) are the discrete features, (d) the angle and distance between two features and (e) searching window for features.
2.2 Single Template Matching
Given a built human contour template we can represent the features by {Ft }, t = 0, 1, ..., T , where T represents the feature number which can be calculated by searching from the top left of the profile to the bottom right and then to the top right, which is illustrated in the second image in Fig.2. During the matching process we can firstly extract foreground features and represent the features as
,
~
{Ft ,i }, t = 0, 1, ..., T , i = 0, 1, ..., N t where N t is the feature number on tth step. The feature set of tth step is constructed by a square window (searching window) shown in Fig.2e. Given a template feature set A and a candidate feature set B , our goal is to find the best matching result as indicated by the following target function. ~ ⎫ ⎧T D ( A, B) = ~min~ ⎨∑ D ( Ft , Ft ) ⎬ Ft ∈{ Ft , i }⎩ t =1 ⎭
(5)
where D is the distance function, D( A, B) represents the overall matching distance ~
between the model and the candidate feature set. Ft is the global optimized matching result in t th step. In each step, the function D can be calculated as ~
~
~
D( Ft , Ft ) = K θ (θ t − θ t ) ⋅ K ρ ( ρ t − ρ t )
(6) ~
~
θ t is the angle between the line Ft −1 Ft and the horizontal line, ρ t is the distance ~
~
between the two feature points Ft −1 and Ft as shown in Fig.2. Kθ , K ρ are two
900
Q. Ye, J. Jiao, and H. Yu
functions to describe the dissimilarity between the model and the real data. In our experiments, they are selected as Gaussian functions as ~
K θ (θ t,i − θ t ) =
~ ⎛ ⎜ θ − θt exp⎜ t,i 2π σ θ ⎜ σθ ⎝
1
2
⎞ , ~ ⎟ ⎟ K ρ ( ρ t,i − ρ t ) = ⎟ ⎠
~ ⎛ ⎜ ρ t,i − ρ t exp⎜ 2π σ ρ ⎜ σρ ⎝
1
⎞ ⎟ ⎟ ⎟ ⎠
2
,
(7)
The value of σ θ and σ ρ is determined in experiment by referring the size of template image. A higher value of them implies higher acceptance of posture variations of detected targets at the cost of higher false alarm rates. This will be further discussed in the experimental part of this paper. In the detection process, a template will be matched to foreground features with the goal of function (5). To solve the function we can use a standard Viterbi decoding algorithm [13] to obtain global optimized matching result. As for the matching result a threshold method is used to determine whether a region in the image is or is not a moving human. This process can be described as ⎧1, if (D( A, B ) > T g ) H =⎨ ⎩0, otherwise
(8)
where H represents the detection result, 1 stands for human and 0 for negative. Tg is a threshold whose value is determined in terms of false alarm rate and recall rate. Since all templates have the similar sizes, Tg is same for all of the templates. 2.3 Detecting Multi-scale and Multi-posture Human
Supposing that n templates are built, then we can think that there are n human detectors representing n postures. In the detection process, a candidate image block can be matched with all of the built templates in the series, which is shown in Fig.3. If the matching result of any template satisfies equation (8), then the image block can be regarded as a human being. D1 Y
Y D1
Y
Y
…
D2 Y
Y
Y …
D2
Dn
Y
…
…
D1
D2 Y
Dn
Y …
… … Y
Moving human
Multi-scale detection
Dn Y negatives
Fig. 3. Multi-scale and multi-posture human detection
Multi-posture Human Detection in Video Frames by Motion Contour Matching
901
Templates with fixed sizes cannot process human in large or small size. To make the method be able to process multi-scale human beings, a pyramid resizing of the video frame’s size is taken out to build multi-scale detection. Experiments proved that 8-10 scales can cover humans of various sizes. Supposing that D1 − Dn represent n detectors for postures, then multi-scale and multi-posture human detection process can be described by the following figure.
3 Experiments We have prepared a dataset of 220 video clips containing about 200,000 video frames captured from natural scene for experiments. The video frame sizes are 640 × 480 or 720 × 576 pixels. The test set consists of a variety of cases, such as moving human in static background, moving human in moving background and static human in moving background. The backgrounds of most of the video frames are complex ones, which include swaying trees, moving cars, moving animals and buildings. In the following figure, we illustrate the foreground detection results from static and moving background. It can be seen that when the background is moving, more moving pixels are detected and then it is more difficult to detect the human.
Fig. 4. Examples of detected foreground pixels
Recall rate and false alarm rate are used to evaluate the proposed method. By adjusting the threshold of (8) we can obtain the curves to show recall rates and false alarm rates as shown in Fig.5. On the curve we can obtain a tradeoff between recall and false alarm rates. For example, the average 86% recall rate with 4.3% false alarm
Fig. 5. Detection performance on static (left) and moving (right) backgrounds
902
Q. Ye, J. Jiao, and H. Yu
rate can satisfy the requirement of lots of real applications in intelligent video surveillance systems. Two kinds of results are given in Fig.5 for three types of moving human beings. The figures show that when the background is moving, human detection is more difficult and the performance is worse than that of static background. Since the detection is performed in videos, the detection results can be smoothed by integrating multi-frame detection results, which means that if a moving human is detected in any of the ten frames, then we regard that there is a moving human in all of the ten frames. The detection performance is improved after the integration of multi-frame results. The figure also shows that if we do not use a dynamic programming process, the detection performance can drop a lot. Examples of contour matching results are showed in Fig.6. It can be seen that after a dynamic based optimization the final matching results are reasonable. Given a moving feature set (left image of each example) the algorithm can automatically find the most similar template and project the feature points to reasonable positions so that the final matching distance is minimized. These results intuitionisticly show the effectiveness of dynamic programming for contour matching.
Fig. 6. Examples of feature matching with dynamic programming
(a)
(d)
(b)
(e) Fig. 7. Examples of human detection results
(c)
(f)
Multi-posture Human Detection in Video Frames by Motion Contour Matching
903
Fig.7 shows examples of the detected moving humans. It can be seen that most of the humans are well detected despite of their postures, sizes, etc. The results also show that even in a cluttered background, the proposed method performs well on most of the conditions. Fig.7d is a frame obtained from a video clip captured on a moving vehicle with a shaking camera. Two moving persons are well detected in this case. Fig. 7e contains a human that is missed (the lady besides the tree) by the algorithm, which shows that the human whose brightness is similar with the background is more intent to be missed since the foreground pixels cannot be correctly separated from the background. Fig.7f contains a false alarm. In these images, some building exteriors look quite like some human contours and are falsely classified as texts. However, we think that these false alarms could be eliminated by integrating trajectory cue of a tracking algorithm in the future work.
4 Conclusion and Future Works In this paper, a new method is proposed for moving human detection. Multiple human contour templates are built to represent multi-posture humans in video frames. A dynamic programming method is employed to find the best matching between candidates and templates. Experimental results on video frames have proved the effectiveness of the template matching method for human detection. The multi-scale, multi-posture detection algorithm is effective in detecting human beings of various sizes and postures in videos. The speed of the algorithm should be considered in future works so that it can work in real time. More representative templates should be built for the detection of more postures, such as creepy postures. Moving object tracking algorithms can also be integrated to improve the performance of the proposed method. Acknowledgement. This research is supported by the Bairen Project of Chinese Academy of Sciences and partly supported by National Science Foundation of China (NO. 60672147).
References 1. Lee, D.J., Zhan, P., Thomas, A., Schoenberger, R.B.: Shape-based Human Detection for Threat Assessment. In: Proceedings of Visual Information Processing, SPIE (2004) 2. Beleznai, C., Fruhstuck, B., Bischof, H.: Human Detection in Groups Using a Fast Meanshift Procedure. International Conference on Image Processing 1, 349–352 (2004) 3. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: Real-time Tracking of Human Body. IEEE Trans. on PAMI 19, 780–785 (1997) 4. Gavrila, D.M., Giebel, J.: Shape-based Pedestrian Detection and Tracking. IEEE Intelligent Vechicle Symposium 1, 8–14 (2002) 5. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based Object Detection in Images by Components. IEEE Trans.PAMI 23 (2001) 6. Dalal, N.m., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. European Conference on Computer Vision 2006.
904
Q. Ye, J. Jiao, and H. Yu
7. Leibe, B., Seemann, E., Schiele, B.: Pedestrian Detection in Crowded Scenes. In: International Conference on Computer Vision and Pattern Recognition (2005) 8. Cutler, R., Davis, L.S.: Robust Real-time Periodic Motion Detection, Analysis and Applications. IEEE Trans. on PAMI 22, 781–796 (2000) 9. Viola, P., Jones, M.J., Snow, D.: Detecting Pedestrians using Patterns of Motion and Appearance. IEEE International Conference on Computer Vision 2, 734–741 (2003) 10. Haga, T., Sumi, K., Yagi, Y.: Human Detection in Outdoor Scene Using Spatio-temporal Motion Analysis. International Conference on Pattern Recognition 4, 331–334 (2004) 11. Zhang, Y., Kiselewich, S.J., Bauson, W.A., Hammoud, R.: Robust Moving Object Detection at Distance in the Visible Spectrum and Beyond Using A Moving Camera. In: Workshop of International Conference on Computer Vision and Pattern Recognition, pp. 131–134 (2006) 12. Dai, C., Zheng, Y., Li, X.: Pedestrian Detection and Tracking in Infrared Imagery Using Shape and Appearance. Int., J. CVIU. 106, 288–299 (2007) 13. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley and Sons press, Chichester (2001)
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection Yu-Ting Chen1,2 and Chu-Song Chen1,3 Institute of Information Science, Academia Sinica, Taipei, Taiwan 2 Dept. of Computer Science and Information Engineering, National Taiwan University Graduate Institute of Networking and Multimedia, National Taiwan University {yuhtyng,song}@iis.sinica.edu.tw 1
3
Abstract. We develop a method that can detect humans in a single image based on a new cascaded structure. In our approach, both the rectangle features and 1-D edge-orientation features are employed in the feature pool for weak-learner selection, which can be computed via the integral-image and the integral-histogram techniques, respectively. To make the weak learner more discriminative, Real AdaBoost is used for feature selection and learning the stage classifiers from the training images. Instead of the standard boosted cascade, a novel cascaded structure that exploits both the stage-wise classification information and the interstage cross-reference information is proposed. Experimental results show that our approach can detect people with both efficiency and accuracy.
1
Introduction
Detecting pedestrians in an image has received considerable attentions in recent years. It has a wide variety of applications, such as video surveillance, smart rooms, content-based image retrievals, and driver-assistance systems. Detecting people in a cluttered background is still a challenging problem, since different postures and illumination conditions can cause a large variation of appearances. In object detection, both efficiency and accuracy are important issues. In [1], Viola and Jones proposed a fast face detection framework through a boosted cascade. This cascade structure has been further applied to many other object detection problems. For instance, Viola et al. [2] used the cascade framework for pedestrian detection. Rectangle features, which can be evaluated efficiently via the technique of integral image, are employed as the basic elements to construct the weak learners of the AdaBoost classifier for each stage in the cascade. While the use of rectangle features is effective for object-detection tasks such as face detection, they still encounter difficulties in detecting people. It is because that the rectangle features are built by only using intensity information, which is not sufficient to encode the variance of human appearances caused by some factors that can result in large gray-value changes, such as the clothes they wear. Recently, Dalal and Triggs [3] presented a people detection method with promising detection performances. This method can detect people in a single Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 905–914, 2007. c Springer-Verlag Berlin Heidelberg 2007
906
Y.-T. Chen and C.-S. Chen
image. In this work, edge-based features, HOG (Histograms of Oriented Gradients), are designed for capturing edge-orientation structure that can characterize human images effectively. HOG features are variant from Lowe’s SIFT [4] (Scale Invariant Feature Transform), but they are computed on a dense grid of uniform space. Nevertheless, a limitation of this method is that a very high-dimensional feature vector is used to describe each block in an image, which requires a long computation time. To speed up the detection, Zhu et al. [5] combined the above two methods by using linear SVM classifier with HOG features as a weak learner in the AdaBoost stages of the cascaded structure, and enhance the efficiency of the HOG approach. In this paper, we develop an object detection framework with both efficiency and accuracy. Our approach employs rectangle features and 1-D edge-orientation features that can be computed efficiently. To make the weak learner more discriminative, we use Real AdaBoost as a stage classifier in the cascade. Instead of learning a standard boosted cascade [1] for detection, a new cascading structure is introduced in this paper to exploit not only the stage-wise classification information, but also the inter-stage cross-reference information, so that the detection accuracy and efficiency can be further increased.
2
Previous Work
There are two main types of approaches on pedestrian detection, the holistic approach and the component-based approach. In holistic approaches, a full-body detector is used to analyze a single detection window. The method of Gavrila and Philomin [6] detects pedestrians in images by extracting edge images and matching them to a template hierarchy of learned examplars using chamfer distances. Papageorgiou and Poggio [7] adopted polynomial SVM to learn a pedestrian detector, where Haar wavelets are used as feature descriptors. In [1], Viola and Jones proposed a boosted cascade of Haar-like wavelet features for face detection. Subsequently, this work was further extended to integrating intensity and motion information for walking person detection [2]. Dalal and Triggs [3] designed HOG appearance descriptors, which are fed into a linear SVM for human detection. Zhu et al. [5] employed the HOG descriptor in the boosted cascade structure to speed up the people detector. Dalal et al. [8] further extended the approach in [3] by combining the HOG descriptors with oriented histograms of optical flow to handle space-time information for moving human. The holistic approaches may fail to detect pedestrians when occlusion happens. Some component-based researches were proposed to deal with the occlusion problem. Generally, a component-based approach searches for a pedestrian by looking for its apparent components instead of the full body. For example, Mohan et al. [9] divided the human body into four components, head, legs, and left/right arms, and a detector is learned by using SVM with Haar features for each component. In [10], Mikolajczyk et al. used position-orientation histograms of binary edges as features to build component-based detectors of frontal/profile heads, faces, upper bodies, and legs. Though component-based approaches can
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection
907
cope with the occlusion problem, a high image resolution of the detection window is required for capturing sufficient information of human components. This restricts the range of applications. For example, the resolution of humans in some surveillance videos is too low to detect by component-based approaches. In this paper, we propose a holistic human detection framework. Our approach can detect humans in a single image. It is thus applicable for the cases where only single images are available, such as detecting people in home photos. The rest of this paper is organized as follows: In Section 3, the Real AdaBoost algorithm using rectangle features and EOH fetures is introduced. A novel cascaded structure of feed-forward classifiers is proposed in Section 4. Experimental results are shown in Section 5. Finally, a conclusion is given in Section 6.
3 3.1
Real AdaBoost with Rectangle and EOH Features Feature Pool
Features based on edge orientations have been shown effective for human detection [3]. In the HOG feature [3], each image block is represented as 7 × 15 overlapping sub-blocks. Each sub-block contains 4 non-overlapping regions, where each region is represented as a 9-bin histogram with each bin being corresponding to a particular edge orientation. In this way, a 3780-dimensional feature, encoding part-based edge-orientation distribution information, is used to represent an image block. Such a representation is powerful for people detection, but has some limitations. First, the representation is too complex to evaluate, and thus the detection speed is slow. Second, all of the dimensions in an HOG feature vector are employed simultaneously, declining the chance of employing only part of them, which may be capable of rejecting the non-human blocks, for fast pre-filtering. A high-dimensional edge-orientation feature like HOG can be treated as a combination of many low-dimensional ones. In our approach, instead of employing a high-dimensional feature vector, we use a set of one-dimensional features derived from edge orientations, as suggested by Levi and Weiss [11]. Similar to HOG, the EOH (Edge Orientation Histogram) feature introduced in [11] also employs the edge-orientation information for feature extraction, but an EOH feature can characterize only one orientation at a time, and each EOH feature is represented by a real value. Unlike the HOG that is uniquely defined for an image region, many EOH features (with respect to different orientations) can be extracted from an image region, and each of which is only of one-dimension. Therefore, there is a pool of EOH features allowed to be selected for a region. The EOH feature is thus suitable to be integrated into the AdaBoost or boostedcascade approaches for weak-learner selection. In our approach, the EOH feature is employed in the AdaBoost stage of our cascading structure. Since the weak learners employed are all one-dimension with the output being simply scalars, the resulted AdaBoost classifier is more efficient to compute than which of using high-dimensional features (e.g. HOG) for building the weak learners [5]. We briefly review the EOH features in the following.
908
Y.-T. Chen and C.-S. Chen
To compute EOH features, the pixel gradient magnitude m and gradient orientation θ in a block B are calculated by Sobel edge operator. The edge orientation is evenly divided into K bins over 0◦ to 180◦ and the sign of the orientation is ignored, and thus the orientations between 180◦ to 360◦ are considered as the same to those between 0◦ to 180◦. Then, the edge orientation histograms Ek (B) in each orientation bin k of block B is built by summing up all of the edge magnitudes whose edge orientations are belonging to bin k. The EOH features we adopted is measured by the ratio of the bin value of a single orientation to the sum of all the bin values as follows: Ek (B) + , Fk (B) = i Ei (B) +
(1)
where is a small positive value to avoid the denominator being zero. Each block thus has K EOH features, F1 (B), . . . , FK (B), which are allowed to be selected as weak learners. Similar to the usage of the integral image technique for fast evaluating the rectangle features, integral histogram [12] can be used to efficiently compute the EOH features. The feature pool employed in our approach for AdaBoost learning contains the EOH features. To further enhance the detection performance, we also include the rectangle features used in [1] for weak learner selection. 3.2
Learning Via Real AdaBoost
After forming the feature pool, we learn an AdaBoost classifier for some stages of our cascading structure. Typically, the AdaBoost alrogithm selects weak learners of binary-valued outputs obtained by thresholding the feature values as shown in Fig. 1(a) [1,2,5,11]. However, a disadvantage of the thresholded-type weak learners is that it is too crude to discriminate the complex distributions of the positive and negative training data. To deal with the problem, Schapire et al. [13] suggested the use of the Real AdaBoost algorithm. To represent the distributions of positive and negative data, the domain space of the feature value is evenly partitioned into N disjoint bins (see Fig. 1(b)). The real-valued output in each bin is calculated according to the ratio of the training data falling into the bin. The weak learner output then depends only on the bin to which the input data belongs. Real AdaBoost has shown its better discriminating power between positive and negative data [13]. This algorithm is employed to find an AdaBoost classifier for each stage of the cascade, and more details can be found in [13].
4
Feed-Forward Cascade Architecture
The Viola and Jones cascade structure containing S stages is illustrated in Fig. 2, where Ai is referred to as an AdaBoost or Real AdaBoost classifier in the i-th stage. In this cascaded structure, negative image blocks that do not contain humans can be discarded in some early stages of the cascade. Only the blocks passing all the stages are deemed as positive ones (i.e., the ones containing
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection h (x)
909
h (x) Thresho ld
1
0 x
x
Feature Value
Feature Value
(a)
(b)
Fig. 1. Two types of weak classifiers: (a) Binary-valued weak classifier and (b) Realvalued weak classifier B ootstra p Se t
Input Im age
A0 F
T
A1 F
T
A2
T
F
A3 F
T
... F
T
AS -1
T
Accept
F
R ejec t
Fig. 2. Viola and Jones cascade structure. To learn each stage, negative images are randomly selected from the bootstrap set as shown in dashed arrows.
humans). A characteristic of the cascading approach is that the decision time of negative and positive blocks are un-equal, where the former takes less but the later takes much. To find an object of unknown positions and sizes in an image, it usually involves the search of the blocks of all possible sites and scales in the image. In this case, since the negative blocks required to be verified in an image are usually far more than the positive blocks, saving the detection time of the negative blocks thus increases the overall efficiency of the object detector. To train such a cascaded structure, we usually set a goal for each stage. The later the stages, the more difficult the goals. For example, consider the situation that the first stage is designed with 99.9% positive examples being accepted and 50% negative examples being rejected. Then, in the second stage, the positive examples remain the same, but the negative examples include those not successfully rejected in the bootstrap set by the first stage. If we set the goal of the second stage again as accepting 99.9% positive examples and rejecting 50% negative examples, respectively, and repeat the procedure for the later stages, the accepting rate of positive examples and rejecting rate of negative examples in i-th stages are (99.9%)i and 1 − (50%)i respectively for the training data. In each stage, the Real AdaBoost algorithm introduced in Section 3.2 can be used to select a set of weak learners from the feature pool to achieve the goal. Since more difficult negative examples are sent to the later stages, it usually happens that more weak learners have to be chosen to fulfill the goal in the later stage.
910
Y.-T. Chen and C.-S. Chen
The degree of accurate prediction in each stage is evaluated by the confidence score. A high confidence value implies an accurate prediction. Each stage learns its own threshold to accept or reject an image block as shown in Fig. 3(a). In Viola and Jones structure, the confidence value is discarded in successive stages. That is, once the confidence value is employed to make a binary decision (yes or no) in the current stage, it will be no longer used in the later stages. This means that the stages are independent to each other and no cross-stage references are allowed. Nevertheless, exploiting the inter-stage information is possible to boost the classification performance further. It is because that, by compositing the confidence values of multiple stages (say, d-stages) as a vector and making a decision in the d-dimensional space, the classification boundaries being considered will not be restricted as hyper-planes parallel to the axes of the stages (as shown in Fig. 3(a)), but can be hyper-planes (or surfaces) of general forms. A two-dimensional case is illustrated in Fig. 3(b). One possible way to exploit the inter-stage information is to delay the decision making of all the S stages in the cascade, and perform a post-classification in the S-dimensional space to make a unique final decision. However, making a decision after gathering all the confidence scores will considerably decrease the detection efficiency, since there is no chance to early jump out the cascade. In this paper, we propose a novel approach that can exploit the inter-stage information while preserve the early jump-out effect. 4.1
Adding Meta-stages
Our method is based on adding some meta-stages in the original boosted cascade as shown in Fig. 4. A meta-stage is a classifier that uses the inter-stage information (represented as the confidence scores) of some of the previous stages for learning. Like an AdaBoost stage, a meta-stage is also designed with a goal to accept and reject the pre-defined ratios of positive and negative examples respectively, and the prediction accuracy is also measured by the confidence score of the adopted classification method for the meta-stage. In our approach, the meta-stages and the AdaBoost stages aligned in the cascade are designed as AAMAMAM. . . AM, where ‘A’ and ‘M’ denote the AdaBoost stages and meta-stages, respectively, as shown in Fig. 4. In this case, the meta-stage is a classifier in the two-dimensional space. The input vector of the first meta-stage M1 is a two-dimensional vector (C(A0 ), C(A1 )), where C(Ai ) is the confidence score of the i-th AdaBoost stage. The input vector of the other meta-stage Mi (i = 2, . . . , H) is also a two-dimensional vector (C(Mi−1 ), C(Ai )) that consists of the confidence values of the two closest previous-stages in the cascade, where C(Mi ) is the confidence score of the i-th meta-stage. The meta-stage introduced above is light-weight in computation since only a two-dimensional classification is performed. However, it can help us to further reject the negative examples during training the entire cascade. In our implementation, we usually set the goal of the meta-stage as allowing all the positive training examples to be correctly classified, and finding the classifier with the highest rejection rate of the negative training exmaples under this condition.
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection Sta ge i+1
S ta ge i+1
P OS
P OS
th i+ 1
th i+ 1 NEG
P OS NEG
NEG
Sta ge i
P OS
NEG
P OS
th i
th i
(a)
(b)
911
NEG New Boundary S tag e i
Fig. 3. Triangles and circles are negative and positive examples shown in data space. (a) The data space is separated into object (POS) and non-object (NEG) regions by thresholds thi and thi+1 in stages i and i + 1. (b) The inter-stage information of stage i and i + 1 can be used to learn a new classification boundary as shown in green line.
Input Im age
A0
T
F
A1
T
F
M1
T
F
A2
T
F
M2
T
F
A3
T
F
M3
T ... T T MH
F
F
Accept
F
Reject Fig. 4. Feed-forward cascade structure
This criterion will not influence the latest decision of the previous AdaBoost classifier about the positive data, but can help reject more of the negative data. In our experience, by adding the meta stages, the total number of the required AdaBoost stages can be reduced when the same goals are set to be fulfilled. 4.2
Meta-stage Classifier
The classification method used in the meta-stage can be arbitrary. In our work, we choose the linear SVM as the meta-stage classifier due to its high generalization ability and efficiency in evaluation. To train the meta-stage classifier, 3-fold cross-validation is applied for selecting the best penalty parameter C of the linear SVM. Then, a maximum-margin hyperplane that separates the positive and negative training data can be learned. To achieve the goal of the meta-stage, we move the hyperplane along its normal direction by applying different thresholds, and find the one with the highest rejection rate for the negative training data (under the situation that no positive ones will be falsely rejected). Note that, even a two-dimensional classifier is used, each meta-stage inherently contains the confidence of all the previous stages. This is because that a meta-stage (except to the first one) employs the confidence value of its closest previous meta-stage as one of the inputs. Thus, information of the previous stages will be iteratively feed-forwarded to the later meta-stages.
912
5
Y.-T. Chen and C.-S. Chen
Experimental Result
To evaluate the proposed cascade structure, a challenging pedestrian data set, INRIA person data set [3], is adopted in our experiments. This data set contains standing people with different orientations and poses, in front of varied cluttered backgrounds. The resolution of human images is 64×128 pixels. Within a 64×128 detection block, the feature pool contains 22477 features (6916 rectangle features and 15561 EOH features) for learning the AdaBoost stages, and the domain space of the feature value is evenly divided into 10 disjoint bins for each feature in the Real AdaBoost algorithm. The edge orientation is evenly divided into 9 bins over 0◦ to 180◦ to calculate the EOH feature. A bootstrap set with 3373860 negative images is generated by selecting sub-images from the non-pedestrian training images in different positions and scales. We refer the method presented in Section 3 the ErR-cascade method since it employs the EOH and rectangle features in the Real AdaBoost algorithm for human detection. The method where the meta-stages are further added (as illustrated in Fig. 4) is referred to as the ErRMeta-cascade method. In the ErRMeta-cascade method, all meta-stages are two-dimensional classifiers, and the linear SVM is adopted as the meta-stage learners. Thus the meta-stages can be computed very fast since only a twodimensional inner product is needed. We use the same number of positive and negative examples for training each stage of the cascade: The data set provides 2416 positive training data and we randomly select 2416 negative images from the bootstrap set as the negative training data. In training each AdaBoost stage, we keep adding weak learners until the predefined goals are achieved. In our experiments, we require that at least 99.95% positive examples are accepted and at least 50% negative examples are rejected in each AdaBoost stage. For meta-stages, we only require that all the positive examples should be accepted and find the classifier with the highest negative-example rejection rate. If the false positive rate of the cascade is below 0.5%, the cascade structure will stop learning new stages. We also implemented the Dalal and Triggs method [3] (referred to as the HOG-LSVM method). First, we compare the performances of the ErR-cascade and the HOG -LSVM method. After training, there are eight AdaBoost stages with 285 weak classifiers in the ErR-cascade as shown in Fig. 5(a). For a 320×240 image (containing 2770 detection blocks), the averaged processing speeds of the HOG-LSVM and the ErR-cascade are 1.21 and 9.48 fps (frames per second) respectively by using a PC with a 3.4 GHz processor and 2.5 GB memory. Since the HOG-LSVM uses a 3780-dimensional feature, their method is timeconsuming. As to the detection result, the ROC curves of these two methods are shown in Fig. 6(a). From the ROC curves, the detection result of the ErRcascade is in overall better than that of the HOG-LSVM method. The introduced ErR-cascade method thus highly improves the detection speed and also slightly increases the detection accuracy than the HOG-LSVM method. Then, we compare these methods to the method with meta-stages. All the goal settings of the AdaBoost stages are the same as those of ErR-cascade. After training, there are seven AdaBoost stages with 258 weak classifiers as
A Cascade of Feed-Forward Classifiers for Fast Pedestrian Detection 100
weak classifier number
weak classifier number
100
913
80 60 40 20 0
1
2
3 4 5 6 7 cascade stage
80 60 40 20 0
8
1
2
(a)
4 6 8 10 cascade stage
12
(b)
1
1
0.9
0.9
0.8
0.8
True Detection Rate
True Detection Rate
Fig. 5. The number of weak classifiers learned in each AdaBoost stage of the ErRcascade method (a) and ErRMeta-cascade method (b)
0.7
0.6
HOG-LS V M E rR -cascade
0.5
0.4
0
0.5
1
1.5
2 2.5 3 False Positive Rate
(a)
3.5
4
0.7
0.6
E rR -cascade E rR M eta-cascade
0.5
4.5
5 x 10
-3
0.4
0
1
2
3 False Positive Rate
4
5
6 x 10
-3
(b)
Fig. 6. (a) The ROC curves of the HOG-LSVM method and the ErR-cascade method. (b) The ROC curves of the ErR-cascade method and ErRMeta-cascade method.
Fig. 7. Experimental results of the ErRMeta-cascade method
shown in Fig. 5(b) and six meta-stages. For a 320 × 240 image, the averaged processing speed is 10.13 fps. For the ErR-cascade method, the trained cascade contains less weak learners, and some non-pedestrian blocks can be early rejected by the cascade in the meta-stages with less computation. The ROC curves of ErR-cascade and ErRMeta-cascade are shown in Fig. 6(b). The results demonstrate that, by adding the meta-stages, both the detection speeds and accuracies can be further raised. Some results are shown in Fig. 7.
914
6
Y.-T. Chen and C.-S. Chen
Conclusion
A novel cascaded structure for pedestrian detection is presented in this paper, which consists of the AdaBoost stages and meta-stages. In our approach, the 1-D EOH edge-based feature is employed for weak-learner selection and Real AdaBoost algorithm is used as the AdaBoost-stage classifier to make the weak learner more discriminative. As to the meta-stages, the inter-stage information of the previous stages is composed as a vector for learning a SVM hyperplane, so that the negative examples can be further rejected. Based on experimental results, our approach is practically useful since it can detect pedestrian with both efficiency and accuracy. Although the cascade type of AAMAMAM. . . AM is used, our approach can be generalized to other types to composite the AdaBoost stages and the meta-stages. In the future, we plan to employ our method for other object detection problems, such as faces, vehicles, and motorcycles. Acknowledgments. This work was supported in part under Grants NSC962422-H-001-001.
References 1. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE CVPR, vol. 1, pp. 511–518 (2001) 2. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: IEEE ICCV, vol. 2, pp. 734–741 (2003) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR, vol. 1, pp. 886–893 (2005) 4. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 5. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE CVPR, vol. 2, pp. 1491–1498 (2006) 6. Gavrila, D., Philomin, V.: Real-time object detection for “smart” vehicles. In: IEEE ICCV, vol. 1, pp. 87–93 (1999) 7. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38(1), 15–33 (2000) 8. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 9. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE PAMI 23(4), 349–361 (2001) 10. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, pp. 69–82. Springer, Heidelberg (2004) 11. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: the importance of good features. In: IEEE CVPR, vol. 2, pp. 53–60 (2004) 12. Porikli, F.: Integral histogram: a fast way to extract histograms in cartesian spaces. In: IEEE CVPR, vol. 1, pp. 829–836 (2005) 13. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)
Combined Object Detection and Segmentation by Using Space-Time Patches Yasuhiro Murai1 , Hironobu Fujiyoshi1 , and Takeo Kanade2 Dept. of Computer Science, Chubu University, Matsumoto 1200, Kasugai, Aichi, 487-8501 Japan
[email protected],
[email protected] http://www.vision.cs.chubu.ac.jp/ 2 The Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213-3890 USA
[email protected] 1
Abstract. This paper presents a method for classifying the direction of movement and for segmenting objects simultaneously using features of space-time patches. Our approach uses vector quantization to classify the direction of movement of an object and to estimate its centroid by referring to a codebook of the space-time patch feature, which is generated from multiple learning samples. We segmented the objects’ regions based on the probability calculated from the mask images of the learning samples by using the estimated centroid of the object. Even though occlusions occur when multiple objects overlap in different directions of movement, our method detects objects individually because their direction of movement is classified. Experimental results show that object detection is more accurate with our method than with the conventional method, which is only based on appearance features.
1
Introduction
Recent achievements in automatic object detection and segmentation have led to applications in robotics, visual surveillance, and ITS[1]. Motion- and part-based approaches have previously been proposed to detect and estimate the positions of objects moving in images. Optical-flow, which quantifies the movement of objects as vector data, has previously been proposed[2]. However, dense, unconstrained, and non-rigid motion estimation by using optical-flow is noisy and unreliable, so estimating the movement of objects by optical-flow is difficult. Shechtman et al. [3] proposed a method for detecting similar motion in video streams despite differences in appearance due to clothing, background, and illumination by using space-time patches. For short, we refer to space-time patch as ST-patch. Niebles et al.[4] proposed a method for categorizing human action by gathering information from space-time interest points. The part-based approach with local features has been used to categorize unknown objects in difficult real-world images. Agarwal et al.[5] proposed an approach that uses an automatically acquired, sparse, part-based representation Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 915–924, 2007. c Springer-Verlag Berlin Heidelberg 2007
916
Y. Murai, H. Fujiyoshi, and T. Kanade
of objects to learn a classifier that can be used to accurately detect occurrences of a category of objects in a static image. Leibe et al.[6,7] proposed a method for categorizing and segmenting objects by estimating the centroids of objects with image patches, which were extracted from a test image, and the corresponding appearance codebook. Moreover, the method for object categorization using the object boundary fragments and relation to centroid[8], a people detection algorithm using a dense grid of Histograms of Oriented Gradients(HOG)[9], and a face detection system using patterns of appearance obtained by Haar-like features[10] are proposed. Thus, many recent studies have also used the partbased approach. These approaches have an advantage in that they can detect an object, when part of it is occluded. However, it is difficult to segment multiple overlapping objects individually, such as pedestrians who are walking in different directions. We developed a method, which is based on the part-based approach, by using spatio-temporal features to simultaneously classify the direction of movement and segment the objects. Our approach classifies the direction of movement of an object by using ST-patch features[3] and estimates the position of the centroid of the object based on its direction of motion. The object is segmented by using the estimated position of its centroid and its mask image, which are stored in the learning samples of the ST-patch features.
2
ST-Patch
Our approach classifies the direction of movement of objects by using the ST-patch features. When we observe two movements, such as a pedestrian walking to the right and another walking to the left, we can obtain the different features of the ST-patch. Therefore, we can generate a codebook based on the different motion of the ST-patch features. In this section, we describe the ST-patch features used to classify the direction of movement of the object, and we describe a method for generating a codebook for the ST-patch features extracted from learning samples. 2.1
Overview of the ST-Patch
The ST-patch features are extracted from a small domain of a spatio-temporal image, i.e., the 3-dimensional data, which extend the image in the direction of time. Fig.1 shows an overview of the ST-patch. Three color lines represent the motion of each pixel, where [u v w]T is a space-time direction vector in the ST-patch, and ∇Pi represents the space-time gradients. 2.2
ST-Patch Features
A locally uniform motion induces parallel lines(see zoomed-in part in Fig.1) within the ST-patch P . All the color lines within a single ST-patch are oriented in the space-time direction [u v w]T . The orientation of [u v w]T can be different
Combined Object Detection and Segmentation by Using Space-Time Patches
917
Fig. 1. Overview of the ST-patch
for different points. It is assumed to be uniform locally, within a small ST-patch P in video streams. By examining the space-time gradients ∇Pi = (Pxi , Pyi , Pti ) of the intensity at each pixel within the ST-patch P (i = 1, · · ·, n), we find that these gradients all point to directions of the maximum change in the intensity of space-time. Namely, these gradients will all be perpendicular to the direction [u v w]T of the color lines. ⎡
⎤ u ∇Pi ⎣ v ⎦ = 0. w
(1)
Stacking these equations from all n pixels within the small ST-patch P , we obtain: ⎤ ⎡ ⎡ ⎤ Px1 Py1 Pt1 0 ⎡ ⎤ u ⎢ Px2 Py2 Pt2 ⎥ ⎢0⎥ ⎥ ⎢ ⎥ ⎣v⎦=⎢ , (2) ⎢ .. ⎢ .. ⎥ .. .. ⎥ ⎣ . ⎣ ⎦ . . ⎦ . w Pxn Pyn Ptn n×3 0 n×1 where n is the number of pixels in P , and we denote an n × 3 matrix by G. By multiplying both sides of Eq.(2) by GT (the transpose of the gradient matrix G), yields: ⎡
⎤ ⎡ ⎤ u 0 GT G ⎣ v ⎦ = ⎣ 0 ⎦ . w 0 3×1
(3)
GT G is a 3 × 3 matrix. We denote it by M: ⎡ 2 ⎤ Px Px Py Px Pt 2 ⎦. M = GT G = ⎣ Py Px Py Py P2 t Pt Px Pt Py Pt
(4)
918
Y. Murai, H. Fujiyoshi, and T. Kanade
The matrix M contains information about the appearance and motion of the ST-patch. This matrix can be represented as 9-dimensional vector e as follows:
(5) Px Py , · · · , Pt2 . e= Px2 , 2.3
The Codebook of the ST-Patch Features
To generate a codebook of the ST-patch features for classifying the direction of movement and for segmenting objects, we used the LBG algorithm[11]. The LBG algorithm is a method for clustering the features and generating a codebook. Using the LBG algorithm, the feature vector of the learning samples can be clustered into a group of N representation vectors. The learning samples in which pedestrians or vehicles moved to the right and to the left in the image were used to generate the codebook of the ST-patch features. The following steps represent the flow for generating the codebook of the ST-patch features. Step 1. ST-patch features are extracted from multiple learning samples. Step 2. The ST-patch features are labeled based on their direction of movement od = {right, left, other}. Moreover, the position of the centroid and the mask image of the object are stored in each learning samples of ST-patch feature. Step 3. A codebook is created by clustering N groups with the LBG algorithm. Step 4. The probability for direction of movement p (od | I) of codebook cluster I is calculated. When the codebook of the ST-patch features is created by using the LBG algorithm, not all labels belonging to each codebook cluster are the same. However, in a codebook cluster, the rate of same label becomes high. Then, the probability for direction of movement p (od | I) of codebook cluster I is calculated from number of labels belonging to each codebook cluster. And, the positions of the centroids of the lerning samples and the mask images are used for estimating the centroids of objects, and for segmenting objects’ regions.
3
Classifying Direction of Movement and Segmenting Regions of Objects
We quantized the vector of the ST-patch features that we acquired from an input image using the codebook of the ST-patch features. We estimated the position of the centroid of the object by voting on different centroid positions based on the classification of the direction of movement and by sampling the ST-patch features. Then, we classified the direction of movement of the object. The flow of the proposed method is illustrated in Fig.2. 3.1
Vector Quantization of the ST-Patch Features
The vector quantization of the ST-patch features was performed using the codebook generated in advance. The flow of vector quantization is shown below.
Combined Object Detection and Segmentation by Using Space-Time Patches
919
Fig. 2. Flow of the proposed method
Step 1. The image patch is obtained by downsampling the image, and the STpatch features are extracted from this patch(Fig.3(a)). Step 2. Vector quantization is performed on the ST-patch features(Fig.3(b)). The Euclidean distance, between the vectors of the input ST-patch features e and the features of the codebook cluster c, is calculated. And the codebook cluster I which is the minimum Euclidean distance is selected from Eq.(6). I = argmin e − c 2 . c
(6)
Step 3. The size of a patch is changed to handle the change in scale. Step 4. Steps1-3 are repeated until the raster scan. Thus, we can perform response to an object scale by changing the size of a patch. 3.2
Estimating Position of Centroid of Object
We estimated the position of the centroid of the object by voting the classification of the direction of movement based on the vector quantization of the input ST-patch features and from the lerning samples. Voting on Centroid Position. To estimate the position of the centroid of the object, we vote on centroid positions[6,7]. Let e be our evidence, an extracted ST-patch observed at location l. By matching it to our codebook, we obtain valid interpretation I. The interpretation is weighted with probability p (I | e, l). Here, we use the relative matching score of a codebook cluster I and ST-patch feature e for p (od , x | I, l). If a codebook cluster matches, it can cast its votes for different object positions. That is, for learning samples belonging to a codebook cluster I, we can obtain votes for several directions of movement of objects od and positions x, which we weight with p (od , x | I, l). Formally, this can be expressed by the following marginalization. p (od , x | e, l) = p (od , x | e, I, l) p (I | e, l) .
(7)
Since we have replaced the unknown ST-patch by a known interpretation, the first term can be treated as independent from ST-patch e. In addition, we match
920
Y. Murai, H. Fujiyoshi, and T. Kanade
Fig. 3. Estimating position of centroid of object
patches to the codebook independent of their location l. The equation thus reduces to: p (od , x | e, l) = p (od , x | I, l) p (I | e) . = p (x | od , I, l) p (od | I, l) p (I | e) .
(8) (9)
The first term is the probabilistic vote for an object position given its identity and the patch interpretation. The second term specifies a confidence that the codebook cluster is really matched to the direction of movement. The third term reflects the quality of the match between the ST-patch and the codebook cluster. Thus, the total number of votes for object od at location x in window W (x) is:
score (od , x) = p (od , xj | ek , lk ) . (10) k xj ∈W (x)
Mean-Shift Clustering. We can search for the positions of points with the most votes(i.e., the local maxima) by using 3-dimensional(x-y-scale space) MeanShift clustering(Fig.3(c))[12]. Fig.3 illustrates this procedure. Local maxima that converge by Mean-Shift clustering integrate into one cluster by Nearest Neighbor clustering algorithm. When the total weight integrated around the local maximum is below a certain threshold, we reject it as an outlier(Fig.3(d)). We can therefore remove the outliers of the voted points.We can then estimate the position of the centroid of the object. 3.3
Segmenting Regions of Objects
We construct regions of objects based on the number of voting points around the position of the centroids. Fig.4 shows the flow of segmenting the regions of objects. Backprojection of the ST-Patch Features. We perform a backprojection of the ST-patch features, which is the number of voted points around the position of centroid of the object, and remove the outliers of the voted points. We can
Combined Object Detection and Segmentation by Using Space-Time Patches
921
Fig. 4. Segmenting regions of object
then select information about the reliable ST-patch features. The effect of the backprojected ST-patch e can be expressed as: p (e, l | od , x) =
p (od , x | e, l) p (e, l) p (od , x | I, l) p (I | e) p (e, l) = , p (od , x) p (od , x)
(11)
where the patch votes p (od , x | e, l) are obtained from the codebook, as described in the Eq.(8). Estimating Region of Object. To segment the object, we now want to know whether a certain image pixel p is part of the object or the background, given the backprojected ST-patch e. More precisely, we are interested in the probability p (p = obj. | od , x). Given the effect of p (e, l | od , x), we can obtain information about a specific pixel as follows:
p (p = obj. | od , x) = p (p = obj. | od , x, e, l) p (e, l | od , x), (12) num
where num is number of the backprojected ST-patch, and p (p = obj. | od , x, e, l) denoting patch-specific segmentation information, which is weighted by the effect of p (e, l | od , x). Again, we can resolve patches by resorting to the learned patch interpretation I stored in the codebook.
p (p = obj. | od , x, e, I, l) p (e, I, l | od , x). p (p = obj. | od , x) = num
=
num
p (p = obj. | od , x, I, l)
p (od , x | I, l) p (I | e) p (e, l) . (13) p (od , x)
Then, segmentation information p (p = obj. | od , x, I, l) can be acquired from the mask image of the object stored in the lerning samples. This means that for every pixel, we calculate a weighted average over all segmentations stemming from ST-patches. Therefore, we can calculate the probability of objects for each pixel. Here, the probability of objects below a certain threshold represents a pixel in
922
Y. Murai, H. Fujiyoshi, and T. Kanade
the background, and the probability of objects over that threshold represents a pixel in the object. We can therefore segment the objects’ regions into rectangles by using the probability of objects for each pixel.
4
Experiment
This section describes the experimental results of the proposed method and the conventional method[6] which uses appearance information only. 4.1
Experimental Overview
We extracted 10,198 ST-patch features from sequences of pedestrians walking toward the right, 10,220 ST-patch features from sequences of pedestrians walking toward the left, and 36,982 ST-patch features from the background. We also extracted 9,885 ST-patch features from sequences of vehicles moving toward the right, 9,968 ST-patch features from sequences of vehicles moving toward the left, and 20,047 ST-patch features from the background. Using pedestrian and vehicle codebooks which were generated from the ST-patch features we extracted, we classified thedirection of movement and segmented the regions of the objects. In this experiment, the size of the ST-patch is 15x15[pixels]x3[frames], and the codebook size is 512 clusters. The experiment sequences were taken with a fixed camera at the location different from that where learning samples were collected. The sequences include rightward and leftward movement objects such a pedestrian and vehicle. The total number of frames for experiment sequences are 23,097. 4.2
Experimental Results
Fig.5 shows the detection and segmentation results by the conventional method and by our method. As shown in Fig.5(a)-(d), we can see that the proposed method can be used to classify the direction of movement and to segment the regions of a pedestrian and a moving vehicle. In particular, separate objects can be segmented exactly even when multiple objects walking in different directions overlap, because our method segments objects’ regions based on the classification of the direction of movement. As shown in Fig.5(b), our method responds to the scale of an object. As shown in Fig.5(d), the pedestrian who has occlusion in the body can be segmented in consideration of the objects’ regions, because they are estimated from the mask image of the learning samples. Moreover, as shown in Fig.5(a), the proposed method detects multiple objects individually, without being affected by shadow. Table1 shows the experimental results of object detection with our method and the conventional method. Only the frame in which the object exists in an image is set as a detection target. As shown in Table1, we can see that our method of detection is better than the conventional method. Thus, because our method is based on classifying the direction of movement, the object detection rate was also better than that with the conventional method.
Combined Object Detection and Segmentation by Using Space-Time Patches
923
Fig. 5. Classifying direction of movement and segmenting the objects’ regions Table 1. Detection result pedestrian sequence vehicle sequence average
conventional method[6] 64.3% 70.7% 67.3%
proposed method 74.7% 93.3% 84.0%
Fig. 6. Example of failure
From Fig.6(a), it is difficult to estimate the position of the centroid when multiple objects move in the same direction, such as a group of pedestrians. This is why the segmentation goes wrong. To solve this problem, we will add
924
Y. Murai, H. Fujiyoshi, and T. Kanade
more information about the appearance to the 9-dimensional vector e in future work. Moreover, for moving objects(for example, a bus and a truck), which do not exist in a learning samples, as shown in Fig.6(b), detection may also go wrong because such objects cannot be classified.
5
Conclusion
We developed a method for classifying the direction of movement and for segmenting objects simultaneously by using ST-patch features. Our method segments objects based on occlusion. Moreover, our method detects objects individually when multiple objects overlap in different directions of movement because the direction of movement is classified. Our future work will involve overlapped objects moving in the same direction, and we will create a method for identifying objects by adding more information about the object’s appearance to the ST-patch features.
References 1. Fujiyoshi, H., Komura, T., Yairi, I.E., Kayama, K.: Road Observation and Information Providing System for Supporting Mobility of Pedestrian. In: IEEE International Conference on Computer Vision Systems, pp. 37–44. IEEE Computer Society Press, Los Alamitos (2006) 2. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 3. Shechtman, E., Irani, M.: Space-Time Behavior Based Correlation. Computer Vision and Pattern Recognition 1, 405–412 (2005) 4. Niebles, J.C., Wang, H., Fei-Fe, L.: i: Unsupervised learning of human action categories using spatial-temporal words. In: British Machine Vision Conference, vol. 3, pp. 1249–1258 (2006) 5. Agarwal, S., Roth, D.: Learning a Sparse Representation for Object Detection. In: European Conference on Computer Vision, pp. 113–130 (2002) 6. Leibe, B., Leonardis, A., Schiele, B.: Interleaved Object Categorization and Segmentation. In: British Machine Vision Conference, Norwich, pp. 759–768 (2003) 7. Leibe, B., Leonardis, A., Schiele, B.: Combined Object Categorization and Segmentation with an Implicit Shape Model. In: European Conference on Computer Vision, Prague, pp. 496–510 (2004) 8. Opelt, A., Pinz, A., Zisserman, A.: Incremental learning of object detectors using a visual shape alphabet. Computer Vision and Pattern Recognition 1, 3–10 (2006) 9. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. IEEE Computer Vision and Pattern Recognition, 886–893 (2005) 10. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. Computer Vision and Pattern Recognition 1, 511–519 (2001) 11. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Trans. on Communications 28(1), 84–95 (1980) 12. Comaniciu, D., Meer, P.: Mean Shift Analysis and Applications. International Conference on Computer Vision 2, 1197–1203 (1999)
Embedding a Region Merging Prior in Level Set Vector-Valued Image Segmentation Ismail Ben Ayed1 and Amar Mitiche2 GE Healthcare 268 Grosvenor, E5-137, London, ON, N6A 4V2, Canada 2 Institut national de la recherche scientifique, INRS-EMT 800, de La Gaucheti`ere Ouest, Montr´eal, QC, H5A 1K6, Canada 1
Abstract. In the scope of level set image segmentation, the number of regions is fixed beforehand. This number occurs as a constant in the objective functional and its optimization. In this study, we propose a region merging prior which optimizes the objective functional implicitly with respect to the number of regions. A statistical interpretation of the functional and learning over a set of relevant images and segmentation examples allow setting the weight of this prior to obtain the correct number of regions. This method is investigated and validated with color images and motion maps.
1
Introduction
Image segmentation by active contours/level sets leads to effective results as several studies have shown [1] [3] [2] [5] [4] [6] [7] [8]. Current methods assume that the number of regions is given beforehand. It occurs as a constant in the objective functional and its optimization [1] [3] [2] [5] [4] [6] [7] [8]. A few investigations proposed to estimate the number of regions automatically, but as a process external to the functional optimization [9] [10] [11]. In [9], a preliminary stage based on hierarchical level set splitting is used prior to a classical functional minimization with a fixed number of regions. In [10] [11], local region merging is alternated with curve evolution. Apart from their computational cost, these methods are subject to the well known limitations of local region splitting/merging operations: (1) dependence on several ad hoc parameters and on the order of local operations [10] [14], (2) need of additional variables for local neighborhood search [10] [11], and (3) sensitivity to the initial conditions [10]. The purpose of this study is to vary the effective number of regions within level set optimization. A region merging prior is proposed for this purpose. This prior favors region merging. Used in conjunction with a data term which measures the conformity of vector-valued image data in each region to the piecewise constant segmentation model [3] and a length related term for smooth region boundaries, this prior allows the objective functional to be optimized implicitly with respect to the number of regions. A maximum number of regions is used in the definition of the segmentation functional. The effective number of regions, equal to the maximum number of regions initially, decreases implicitly during curve evolution Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 925–934, 2007. c Springer-Verlag Berlin Heidelberg 2007
926
I. Ben Ayed and A. Mitiche
to be, ideally, the desired number of regions. The functional minimization is carried out using the partition constrained minimization scheme developed in [7] [8]. A coefficient must be affected to the region merging prior in order to balance its contribution with respect to the other functional terms. This coefficient will, of course, affect the number of regions obtained at convergence. We will show that we can determine systematically an interval of values of this coefficient to obtain the desired number of regions. This is possible via a statistical interpretation of the coefficient over a set of relevant images and segmentation examples. The method is investigated and validated with color images and motion maps.
2
Segmentation into a Fixed Number of Regions
Consider a vector-valued image I : Ω ⊂ R2 → RL represented by L images I l : Ω ⊂ R2 → R+ (l ∈ [1, .., L]) and a partition R = {Rk }k∈[1,N ] of Ω defined by a family of simple closed plane curves γ k (s) : [0, 1] → Ω|k=1,...,N −1 . For each k, region Rk corresponds to the interior Rγ k of curve γ k : Rk = Rγ k . Let N −1 RN = k=1 Rck . Level set segmentation of an image I is commonly stated as determining a partition R which minimizes a functional containing two terms: a data term which measures the conformity of the data within each region to a parametric model and a regularization term for smooth segmentation boundaries [1] [2] [3] [4] [5] [6] [7] [8] [11]. Following the piecewise constant model [3] and a regularization term, multiregion active curve segmentation consists of determining γ k |k=1,...,N −1 that minimize the following functional: −1 F ({γ k }N k=1 ) =
N k=1
L
Rk l=1
I l − μlk 2 + λ
N −1 k=1
ds
(1)
γk
where μlk is the mean intensity for image I l in the segmentation region k (l ∈ [1, .., L], k ∈ [1, .., N ]), and λ is a positive real constant to weigh the relative contribution of the two terms of the functional. Let first consider a simple example that illustrates the usefulness of adding a region merging term to this type of functional. Consider two non intersecting regions R1 and R2 of a partition R. We have [12]: L L L l l 2 l l 2 l l 2 l=1 I − μ1 + R2 l=1 I − μ2 ≤ R1 ∪R2 l=1 I − μ1,2 R1 λ(∂R1 + ∂R2 ) = λ∂(R1 ∪ R2 ) (2) where μl1,2 the mean of R1 ∪ R2 for image I l , and ∂R is the boundary of R. Consequently, the minimization of (1) does not favor merging R1 and R2 even when μl1 = μl2 , ∀l ∈ [1, .., L]. As we will show in the experiments, model (1) may result in an over-segmentation when N is superior to the actual number of regions. An additional term in (1) which can merge regions, as when μl1 = μl2 , ∀l ∈ [1, .., L], would be useful. In this study, we propose and investigate such a prior.
Embedding a Region Merging Prior
3
927
A Region Merging Prior
A region merging prior PRM is a function from the set of partitions of Ω to R. This function must satisfy the following condition: For each partition R = {Rk }N k=1 of Ω, and for each subset J of [1..N ]: N PRM ({∪j∈J Rj , {Rk }k∈[1..N ],k∈J / }) < PF R ({Rk }k=1 )
(3)
This condition means that any region merging must decrease the prior term. We propose the following prior: PRM ({Rk }N k=1 ) = −β
N
ak logak ,
(4)
k=1
where ak is the area of region Rk , and β is a positive real constant to weigh the relative contribution of the region merging term in the segmentation functional. As we will see in Section 4.2, the logarithmic form of this prior has an interesting property which leads to a statistical interpretation that allows to fix the weight of the region merging term systematically. This prior satisfies condition (3).
4
Segmentation Functional
Let N be the maximum number of regions, i.e., a number such that the actual number of regions is less or equal to N . Such a number is available in most applications. With the region merging prior (4), the functional of segmentation into a number of regions less than N is: −1 FRM ({γ k }N k=1 )
=
N k=1
L
I − l
Rk l=1
Data term
μlk 2
−β
N
ak logak
k=1
Region merging prior
+λ
N −1 k=1
γk
ds
regularization
(5) In Section 4.1, we will show how the effective number of active curves can decrease as a result of the region merging prior. We minimize FRM with respect to the curves γ k , k = 1, .., N − 1 by embedding these into a family of one-parameter curves γ k (s, t) : [0, 1] × R+ → Ω and solving the partial differential equations: ∂FRM dγ k =− , k = 1, .., N − 1 dt ∂γ k
(6)
We use the multiregion minimization scheme developed in [7] [8]. This scheme has several advantages over others: it is fast, stepwise optimal, and robust to initialization [7] [8]. It embeds an efficient partition constraint directly in the curve/level set evolution equations. At each iteration, the scheme involves only two regions for each pixel x, a region Ri which contains x currently, and a region
928
I. Ben Ayed and A. Mitiche
Rj , j = i, which corresponds to the largest decrease in the functional were x transferred to this region (refer to [7] [8] for details). For a level set implementation of curve evolution, we represent each curve γ k implicitly by the zero level set of a function uk : R2 → R, with the region inside γ k corresponding to uk > 0. The level set equations minimizing FRM are given, at each x ∈ Ω, by: L ∂ui = (−λκui + β.(logai − logaj ) − I l (x) − μli 2 − I l (x) − μlj 2 )∇ui ∂t
l=1 Region merging
Region competition
∂uj = (−λκuj + β.(logaj − logai ) − ∂t
L
I l (x) − μlj 2 − I l (x) − μli 2 )∇uj
l=1
(7) where κuk is the curvature of the zero level-set of uk , i ∈ [1..N ] is the index of the region containing x currently, and j is given by: j = arg
4.1
min
{k∈[1..N ], x∈Rk }
−β.ak logak +
L
I l (x) − μlk 2
(8)
l=1
A Region Merging Interpretation of Curve/Level Set Equations
The level set equations (7) show how region merging occurs: When two disjoint regions Rγ i and Rγ j have close intensities in each image I l (l ∈ [1, .., L]), the L velocity resulting from the data term ( l=1 I l − μlj 2 − I l − μli 2 ) is weak. Ignoring the curvature term, evolution of curves γ i and γ j is guided principally by the region merging prior velocity. As ui increases and uj decreases under the effect of (logai − logaj ), this velocity expands the region with the larger area, and shrinks the other region until only one curve encloses both regions and the other curve disappears. 4.2
How to Fix the Weighting Parameter β
On the one hand, the data term increases N when regions are merged. On the other hand, the region merging term, − k=1 ak logak , decreases when regions are merged. The role of β is to balance the contribution of the region merging term against the other terms so as to, ideally, correspond to the actual number of regions. The weighting parameter β can be viewed as a unit conversion factor between the units of the region merging and the data terms. Therefore, and considering the form of these terms, we can take: L I l − μl 2 (9) β = α Ω l=1 A.logA
Embedding a Region Merging Prior
929
where μl is the mean intensity over the whole image I l (l ∈ [1, .., L]), A is the image domain area, and α is a constant without unit. Using expression (9), we rewrite the sum of the data term and the region merging prior as follows: N N L L l l 2 k=1 ak logak I − μk − α I l − μl 2 (10) A.logA R Ω k k=1 l=1 l=1
close to 1
Now, by applying inequality log(z) ≤ z − 1, ∀z ∈ [0, +∞[ to aAk , ∀k ∈ [1..N ] and using condition (3), one can prove the following important inequalities: N ak logak logN 1− ≤ k=1 ≤1 (11) logA A.logA N
a loga
k k In practice, N is generally much smaller than A and k=1 is close to A.logA 1. For example, for a maximum number of regions equal to 10 and a 256x256 N i=1 ai logai ≤ 1. We now consider the folimage, we have approximately 0.8 ≤ A.logA lowing classical relation in statistical pattern recognition between within-cluster distance, total distance, and in-between cluster distance [12]:
N k=1
L
I l − μlk 2 −
Rk l=1
within−cluster distance
L
Ω l=1
I l − μl 2 =
total distance
−
L N k=1 l=1
μlk − μl 2
(12)
in−between cluster distance
Consequently, with a value of α close to 1 in (10), the sum of the data and the region merging terms will be close to the segmentation in-between cluster distance. However, minimizing the in-between cluster distance is equivalent to minimizing the within-cluster distance because the total distance is independent from the segmentation [12]. This interpretation, which suggests a value of α close to 1, will be confirmed in the next section (section 5) with several simulations. We will show that we can take α in an interval containing, or close to, 1, and which we can use for all the images of the same class. Note that β depends on the image, and once α is fixed, β is given directly by (9).
5
Experiments
We conducted a large number of tests with color images and motion maps. We show representative segmentation examples. To support the possibility of determining via learning an interval of α values applicable to the images of a given class (color images and motion maps for example), we run tests showing that we can find a common interval of α values which gives the desired number of regions for all the images of the same class. Color Images. The RGB space is used to represent the color information in each image. We show, here, results for a set of 6 images, each image containing several objects (Figure 2, (1)-(6)). The objects in these images were taken
930
I. Ben Ayed and A. Mitiche
(a)
(b)
(c)
Fig. 1. Results without the region merging prior on an image compound of 2 regions (object and background): (a) initialization (N = 5, i.e., 4 curves), (b) final curves, (c) final segmentation into 5 regions Table 1. Color images: intervals of α values corresponding to the correct number of regions Images 1
2
3
4
5
6
αmin
1.49 1.00 1.35 0.36 1.03 0.096
αmax
5.6
3.97 4.37 5.5 2.67 3.81
from ALOI database [15]. We have images with two, three and four regions. With these images, the actual number of regions is known, which allows us to evaluate experimentally the interval of α values giving this number. Segmentation of these images without fixing the number of regions is difficult due to the illumination variations inside each object. Figure 1 gives the segmentation result, without the region merging prior, of the color image (1) shown in Figure 2. This image consists of two regions: an object and a background. Segmentation of this image into 5 regions gives the final curves displayed in Figure 1 (b) and the segmentation shown in Figure 1 (c). The corresponding initialization with 4 curves is depicted in Figure 1 (a). Without the region merging prior, the object is fragmented into 4 different regions due to illumination variations. The first line in Figure 2 shows the segmentation results of the same image in Figure 1 with the region merging prior. Only one curve (red) remained at convergence. This final curve separates correctly the object from the background. With the same initialization (5 regions, 4 curves) in Figure 1 (a), and using the same α (α = 2), the other images (from (2) to (6)) have also been segmented correctly. The columns of Figure 2 show, respectively, the image, the final curves remaining at convergence, and the corresponding final segmentation. The final segmentation of each image corresponds to desired number of regions as well as to the objects. We evaluated the interval of α values, [αmin , αmax ], which lead to the desired number of regions for each image. The obtained intervals are reported in table 1. All α values in [1.49, 2.67] lead to a correct segmentation of the six
Embedding a Region Merging Prior
931
(1)
(2)
(3)
(4)
(5)
(6) Fig. 2. Segmentation results with the region merging prior for 6 color images from the same database (α = 2): images (1)-(6) (first column); final curves which remained at convergence (second Column); final segmentations (third column)
images. These results are conform to expectations, to the statistical interpretation we gave in section (4.2) to coefficient α, and support the possibility of determining via learning an interval of α values.
932
I. Ben Ayed and A. Mitiche
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Segmentation results with the region merging prior (Marmor sequence, α = 2): (a) 5 initial curves (6 regions) and the motion field (2 moving objects), (b) 2 final curves corresponding to the moving objects, (c) obtained segmentation into 3 regions (2 moving objects and a background), (d)-(e) segmentation regions corresponding to moving objects, (f) segmentation region corresponding to the background
(a)
(b)
Fig. 4. Segmentation results without the region merging prior (Marmor sequence, α = 0): (a) segmentation with N = 4, (b) segmentation with N = 6
(a)
(b)
(c)
(d)
Fig. 5. Segmentation results with the region merging prior (Road sequence, α = 2): (a) 5 initial curves (6 regions) and the motion field (1 moving object), (b) 1 final curve, (c) segmentation region corresponding to the moving object (inside the curve), (d) segmentation region corresponding to the background (outside the curve)
Embedding a Region Merging Prior
933
Motion Segmentation. In this experiment, we segment optical flow images into motion regions. Optical flow at each pixel is a two-dimensional vector. The method in [16] was used to estimate the optical flow. We show two examples. The first example uses the Marmor sequence, which contains 3 regions: 2 moving objects and a background. The initial curves (5 curves for at most 6 regions) and motion vectors are shown in Figure 3 (a). With the region merging prior, Figure 3 (b) depicts curves which remained giving a correct segmentation into 3 regions (Figure 3 (c)). In this example, α = 2. Figures 3 (d) and (e) show regions corresponding to the 2 moving objects, and (f) shows the background. To illustrate the effect of the region merging prior, Figure 4 (b) shows the segmentation obtained using the same initialization (N = 6) but without the region merging prior (α = 0). The background, in this case, is divided into 3 different regions, and the moving object on the right is divided into 2 regions. We also give in Figure 4 (a) the segmentation obtained with 4 initial regions (N = 4), and with α = 0. Results obtained without the region merging prior do not correspond to a meaningful segmentation. The second example uses the Road image sequence (Figure 5 (a)), which contains two regions: a moving vehicle and a background. The same initialization as with the Marmor sequence was used. Figure 5 shows the results obtained (using the region merging prior). In (b), one curve remains which separates the moving object from the background; Figures (c)-(d) display segmentation regions. We evaluated the interval of α values, [αmin , αmax ], which gave the desired number of regions for each sequence. The obtained intervals are reported in table 2 and are conform to the interpretation of the weight of the region merging prior, and which led to a value of α close to 1. All α values segmenting correctly the Marmor sequence give also the desired number of regions for the Road sequence. Table 2. Motion segmentation: intervals of α values corresponding to the desired number of regions Images Marmor Road
6
αmin
1.217
0.017
αmax
2.69
6.5
Conclusion
This study investigated a curve evolution method which allowed the effective number of regions to vary during optimization. This was done via a region merging prior which embeds an implicit region merging in curve evolution. We gave a statistical interpretation of the weight of this prior. We confirmed this interpretation by several experiments with both color images and motion maps. Experiments demonstrated that we can determine by learning an interval of values of this weight applicable to the images of a given class.
934
I. Ben Ayed and A. Mitiche
References 1. Cremers, D., Rousson, M., Deriche, R.: A Review of Statistical Approaches to Level Set Segmentation: Integrating Color, Texture, Motion and Shape. Int. J. of Computer Vision 62, 249–265 (2007) 2. Rousson, M., Deriche, R.: A variational framework for active and adaptive segmentation of vector valued images. In: Proc. IEEE Workshop on Motion and Video Computing, pp. 56–61. IEEE Computer Society Press, Los Alamitos (2002) 3. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active Contours without Edges for VectorValued Images. J. Visual Communication Image Representation 11, 130–141 (2000) 4. Vese, L.A., Chan, T.F.: A Multiphase Level Set Framework for Image Segmentation Using the Mumford and Shah Model. Int. J. of Computer Vision 50, 271–293 (2002) 5. Samson, C., Blanc-F´eraud, L., Aubert, G., Zerubia, J.: A Level Set Model for Image Classification. Int. J. of Computer Vision 40, 187–197 (2000) 6. Ayed, I.B., Hennane, N., Mitiche, A.: Unsupervised Variational Image Segmentation/Classification using a Weibull Observation Model. IEEE Trans. on Image Processing 15, 3431–3439 (2006) 7. Ayed, I.B., Mitiche, A., Belhadj, Z.: Polarimetric Image Segmentation via Maximum Likelihood Approximation and Efficient Multiphase Level Sets. IEEE Trans. on Pattern Anal. and Machine Intell. 28, 1493–1500 (2006) 8. Ayed, I.B., Mitiche, A.: A Partition Constrained Minimization Scheme for Efficient Multiphase Level Set Image Segmentation. In: Proc. IEEE Int. Conf. on Image Processing, pp. 1641–1644. IEEE Computer Society Press, Los Alamitos (2006) 9. Brox, T., Weickert, J.: Level Set Segmentation With Multiple Regions. IEEE Trans. on Image Processing 15, 3213–3218 (2006) 10. Kadir, T., Brady, M.: Unsupervised non-parametric region segmentation using level sets. Proc. Int. Conf. on Computer Vision, 1267–1274 (2003) 11. Zhu, S.C., Yuille, A.: Region Competition: Unifying Snakes, Region Growing, and Bayes /MDL for Multiband Image Segmentation. IEEE Trans. on Pattern Anal. and Machine Intell. 18, 884–900 (1996) 12. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Chichester (2000) 13. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (1999) 14. Nock, R., Nielsen, F.: Statistical Region Merging. IEEE Trans. on Pattern Anal. and Machine Intell. 26, 1452–1458 (2004) 15. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam Library of Object Images. Int. J. of Computer Vision 61, 103–122 (2005) 16. Vazquez, C., Mitiche, A., Laganiere, R.: Joint Multiregion Segmentation and Parametric Estimation of Image Motion by Basis Function Representation and Level Set Evolution. IEEE Trans. on Pattern Anal. and Machine Intell. 28 (2006)
A Basin Morphology Approach to Colour Image Segmentation by Region Merging Erchan Aptoula and S´ebastien Lef`evre UMR-7005 CNRS-Louis Pasteur University LSIIT, Pˆ ole API, Bvd Brant, PO Box 10413, 67412 Illkirch Cedex, France {aptoula,lefevre}@lsiit.u-strasbg.fr
Abstract. The problem of colour image segmentation is investigated in the context of mathematical morphology. Morphological operators are extended to colour images by means of a lexicographical ordering in a polar colour space, which are then employed in the preprocessing stage. The actual segmentation is based on the use of the watershed transformation, followed by region merging, with the procedure being formalized as a basin morphology, where regions are “eroded” in order to form greater catchment basins. The result is a fully automated processing chain, with multiple levels of parametrisation and flexibility, the application of which is illustrated by means of the Berkeley segmentation dataset.
1
Introduction
Automatic, robust and efficient colour image segmentation is nowadays more indispensable than ever, since numerous image depositories have been formed and continue to grow with an increasing speed. As far as the human vision system is concerned, edge information is primarily contained within the luminance component. Hence colour is regarded as an invaluable, yet auxiliary component when it comes to image segmentation and generally object recognition. The problem of its efficient exploitation in this context remains to be resolved, because not only the principles of human colour vision have not yet been fully understood, but because it also introduces additional parameters in the already elusive problem of general purpose image segmentation. Specifically, one of the major questions is the representation of colour vectors and the choice of the associated colour space. Since the desired segmentation outcome is almost always based on the human interpretation of objects, it is deemed natural to attempt to emulate the sensitivities of human colour vision. That is besides the reason why polar colour spaces have been gaining popularity in this regard. However as it will be elaborated in section 2, these spaces also suffer from considerable drawbacks. Among the approaches developed to resolve the problem of colour segmentation, mathematical morphology offers a different perspective from the mostly statistical and clustering based methods, since it is an algebraic image processing framework capable of exploiting not only the spectral, but spatial relationships Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 935–944, 2007. c Springer-Verlag Berlin Heidelberg 2007
936
E. Aptoula and S. Lef`evre
Luminance
Luminance
of pixels as well. In this paper, we present a fully automated colour segmentation procedure, designed for polar colour spaces, based on morphological operators. In particular, the proposed approach consists primarily of manipulating the catchment basins resulting from a watershed transformation, through their interpretation as the new processing units for morphological operators. Hence leading to a “morphology of basins”. A hierarchy of attributes is thus organised, making it possible to merge these regions based on arbitrary characteristics such as mean colour and texture. The resulting method is tested using the Berkeley segmentation dataset [1]. The rest of the paper is organised as follows. Section 2 discusses initially the crucial choice of polar colour space. Then in section 3 the proposed segmentation approach is elaborated and its individual stages are detailed. Finally section 4 is devoted to concluding remarks.
Saturation
Saturation
Fig. 1. Vertical semi-slice of the cylindrical HLS (left) and bi-conic IHLS (right) colour spaces
2
Choice of Colour Space
For the reasons mentioned in the previous section, here we concentrate on 3d-polar colour spaces, that have appeared as the result of attempts to describe the RGB cube in a more intuitive manner, from the point of view of human interpretation of colour, in terms of luminance, saturation and hue. While luminance L ∈ [0, 1] accounts for the amount of light, saturation S ∈ [0, 1] represents the purity of a colour. The values of the periodical hue interval H ∈ [0, 2π[ on the other hand, denote the dominant wavelength, with 0 corresponding to red. Basically, polar colour spaces achieve this transformation by representing colours with respect to the achromatic axis of RGB. Nevertheless, several implementational variants are available for this single transformation, e. g. HSV, HSB, HLS, HSI, etc [2]. According to Hanbury and Serra [3], all of the aforementioned colour spaces were developed primarily for easy numerical colour specification, while they are ill-suited for image analysis. Specifically, although they were initially designed as conic or bi-conic shaped spaces, later on their cylindrical versions were employed in practice, in order to avoid the computationally expensive (for that period) checking for valid colour coordinates. The passage from conic to cylindrical shape however resulted in many inconsistencies within these spaces,
A Basin Morphology Approach to Colour Image Segmentation
937
for instance by allowing fully saturated colours to be defined in zero luminance. Extensive details on this topic can be found in [3]. Here we adopt the suggestion made in [3], and make our colour space choice in favour of the improved HLS space (IHLS), which employs the original biconic version of HLS, hence limiting the maximal allowed value for saturation in relation to luminance (figure 1). Further advantages of IHLS with respect to its counterparts include the independence of saturation from luminance, thus permitting the use of any luminance expression (e. g. RGB average, perceptual luminance, etc) and the comparability of saturation values.
Preprocessing
Basin extraction
Input image
Merging
Postprocessing
Label image
Fig. 2. Summary of the proposed processing chain
3
Proposed Approach
In an ideal world all images would have the same resolution, colour number and overall complexity. Unfortunately it is not the case. The problem of segmentation in its most general form is highly difficult to resolve, as it aims to detect the semantic regions of extremely heterogeneous input. Moreover semantic level segmentation requires naturally some a priori information of semantic nature, hence rendering it feasible only for domain specific applications, since no ontology incorporating all types of objects exists. Consequently a more “practical and realistic” aim, also adopted here, is to attempt to detect the principal regions of images with respect to homogeneity, which is a task most important for content based image retrieval. Considering the vast heterogeneity of image data, an equally high degree of adaptability is crucial, which will take into account the different types of border information contained within an image, e. g. spectral, textural, etc. To this end, we propose the processing chain summarised in figure 2. Briefly, the input image is first simplified using border preserving morphological operators and then through the combination of a colour gradient and the watershed transformation, the catchment basins are obtained. Next, an iteratively applied hierarchical fusion is carried out, providing a rough approximation of the sought borders, that are finally refined in the last stage by means of a marker based watershed transformation. Details on each step follow. 3.1
Preprocessing
This first step aims to simplify the input image, and eliminate any “excessive” details. The morphological toolbox offers a rich variety of operators for this purpose, however, several issues arise. The first concerns the extension of grayscale
938
E. Aptoula and S. Lef`evre
morphological operators to colour images, a theoretical problem stemming from the need to impose a complete lattice structure on the pixel intensity range, which in the case of multivalued images, is equivalent to the need to order vectorial data [4,5]. Several approaches have been developed to this end, a survey of which may be found in [6]. Here, it has been chosen to order the colour vectors of the IHLS space by means of a lexicographical ordering: ⎧ ⎨ l1 < l2 , or (1) (h1 , s1 , l1 ) < (h2 , s2 , l2 ) ⇔ l1 = l2 , s1 < s2 or ⎩ l1 = l2 , s1 = s2 and h1 < h2 where l1 , l2 , s1 , s2 ∈ [0, 1] and h1 , h2 ∈ [0, 2π]. As the hue component is a circular value, an angular distance from a reference hue h0 [7] is employed for their comparison: |h − h0 | if |h − h0 | < π h ÷ h0 = (2) 1 − |h − h0 | if |h − h0 | ≥ π which for the sake of simplicity is set as h0 = 0.0. The hue values are then ordered according to their distances from h0 : ∀ h, h ∈ [0, 2π], h < h ⇔ h ÷ h0 < h ÷ h0
(3)
where hues closer to h0 are considered greater. Hence, with the luminance components compared first, this ordering leads to operators that process particularly this channel, containing the majority of the total variational information.
0.6
Luminance
Recon. Standard Original 0.5
0.4
0.3 220
225
230 235 240 Vertical dimension
245
Fig. 3. From left to right, the original image (#101087), its preprocessed form and the intensity transition of the white line in the original image, for a reconstructive and standard processing based leveling
Equipped with this ordering, erosion (ε), dilation (δ) and all the deriving grayscale morphological operators may be extended to colour data. Nevertheless, a second issue in this regard is the need for border preserving operators. That is why, it was chosen to employ a morphological leveling Λ(f, m) [8], which provides
A Basin Morphology Approach to Colour Image Segmentation
939
a simplified version of the input image f , by applying iterative geodesic erosions and dilations to the marker m, of the input, until idem a rough approximation potence, i. e., Λ(f, m)i = sup inf[f, δ i (m)], εi (m) , until Λ(f, m)i+1 = Λ(f, m)i . The marker image is obtained by means of a reconstruction based opening followed by a reconstruction based closing. The result is a “leveled” image, of which the details smaller that the structuring element’s (SE) size have been removed while also preserving perfectly all the region borders (figure 3). The size of the SE, typically a square of 7×7 pixels is determined with respect to the dimensions of the input image. 3.2
Basin Extraction
Having simplified the input, this step consists in computing a first segmentation map of the image using the watershed transformation. As this powerful operator can be applied only to a scalar input representing the topographic relief of the image, it has been chosen to combine the colour channels by means of a channel wise maximum of marginal gradients: ρHLS (h, s, l) = max {ρ(l), ρ(s), ρH (h)}
(4)
where ρ = f − ε(f ) is the standard internal morphological gradient. Although the components of the polar colour spaces are highly intuitive, their combination is relatively problematic. In particular, hue is of no importance if saturation is “low”, while the bi-conic shape of the colour space assures that no high saturation levels exist, if luminance is not “high enough”. Hence the hue gradient needs to be weighted with a coefficient that has a strong output only when both compared saturation values are “sufficiently high”: ρH (h) = max {j(s, si ) × h ÷ hi } − min {j(s, si ) × h ÷ hi } i∈B
i∈B
(5)
where B is the local 8-neighborhood and j(·, ·) a double sigmoid controlling the transition from “low” to “high” saturation levels: j(s1 , s2 ) =
1 (1 + exp(α × (s1 − β))) × (1 + exp(α × (s2 − β)))
(6)
where α = −10 and the offset β = μS is set as the mean saturation of the image, hence making it possible to adapt the gradient’s sensitivity to colour, according to the image’s overall colourfulness level. The application of the watershed transformation on the newly computed gradient leads to the result depicted in figure 4. 3.3
Merging
Given the sensitivity of the internal gradient, the oversegmented result has been expected in the previous step. Considering that the sought borders are contained within this complex of adjacency relations, from this point on all efforts
940
E. Aptoula and S. Lef`evre
Fig. 4. From left to right, the hand reference segmentation, the proposed colour gradient and its oversegmented watershed transformation result, superposed on the original image
are concentrated on eliminating the unwanted borders, and thus increasing the sizes of the catchment basins. Merging the mosaic of basins obtained by watershed transformation is a well known technique in automated image segmentation [9,10]. Here we follow a graph based formalisation for this procedure. As each basin represents a locally homogeneous region, despite the level of oversegmentation, by providing spectrally atomic regions, the entire watershed procedure greatly reduces the volume of clustering to be carried out in the later stages. At this point, based on the atomicity of each basin, one can proceed by manipulating the image content with the catchment basins being the new processing “image units”, instead of pixels. Hence the image can be viewed as an undirected graph of basins, where each node is characterised by a set of spectral and other properties (e. g. mean colour, variance, etc) as well as its set of adjacent basins, or neighbours. With this point of view, the merging procedure, can be defined as an operator on this graph, which propagates labels, and modifies adjacency relations. Furthermore, by imposing a complete lattice structure on the “value interval” of basins, one can define morphological operators, hence leading to a basin morphology. In particular, by formulating the merging of basins, as the replacement of each node, by its closest with respect to a certain metric, the operator becomes intuitively similar to an erosion. While a dilation would dually replace each node with its most distant neighbour. However this option is of no interest on its own. Consequently, the basin erosion (εb ) and dilation (δb ) of a graph G = (V, E) can be defined respectively as: εb (G) =
G | ∀ Vi ∈ V, label(Vi ) = label( argmin d(Vj , Vi ))
δb (G) =
Vj ∈N (Vi )
(7)
G | ∀ Vi ∈ V, label(Vi ) = label( argmax d(Vj , Vi )) Vj ∈N (Vi )
(8)
A Basin Morphology Approach to Colour Image Segmentation
941
where N (Vi ) is the set containing the neighbours of Vi . Several operators may be derived from the combinations of these two, their efficiency however in this context is strongly related to the metric of similarity in use (d(·, ·)). Consequently, one can implement a rich variety of merging strategies, where each is based on a different basin similarity metric, exploiting some of their properties.
Level 3
Texture...
Level 2
Basin transition
Level 1
Mean colour, variance
Reliability
Fig. 5. Hierarchy of merging criteria
We propose a hierarchical approach in this regard, as illustrated in figures 5 and 6, which consists in employing various properties of the basins in different scales, in order to compute their distances and hence realise their erosions (i. e. mergings) by means of equation (7). Specifically, it begins with a series of thresholded erosions based on their mean colour, where only the basins that are closer than a predefined limit are taken into account. This first step aims to merge only spectrally similar basins. The colour distance in use is: ∀ c1 = (h1 , s1 , l1 ), c2 = (h2 , s2 , l2 ), d(c1 , c2 ) = j(s1 , s2 ) × h1 ÷ h2 + (1 − j(s1 , s2 )) × |l1 − l2 |
(9)
By means of factor j(·, ·), a saturation based continuous transition of priority is realised between hue and luminance. This low level step is carried out iteratively with thresholds starting from t0 = 0.01 and increasing until tmax , which doubles the initial intra-basin variances. This process results in a preliminary segmentation map, where relatively homogeneous regions appear. Next, we modify the distance metric so as to eliminate intensity gradients, and apply it using the same threshold. For this purpose, the erosions are computed by taking into account only the bordering pixels of basins. Whereas at the third step, higher level merging criteria are employed. Specifically, in order to calculate the textural similarity of basins, their mean covariance vector is used:
(10) K(f ) = Vol εP2,v (f ) / Vol (f ) where P2,v is a pair of points separated by a vector v and Vol the volume, i. e. sum of pixel values. Of course one is by no means limited with these criteria; for instance border geometry may be further exploited. The threshold in this case is fixed as the mean covariance vector of the entire image. 3.4
Post-processing
Once this stage is reached, the principal regions of the input are expected to have been formed. As a last touch one can eliminate all regions inferior to a
942
E. Aptoula and S. Lef`evre
Fig. 6. From left to right, the three levels of merging using the principle illustrated in figure 5
certain surface, by merging them with their closest neighbor. A more serious problem however, concerns the possibility of local deviations from the sought borders, since according to the definition of the proposed erosion operator, basin processing has been carried out so far using only the immediate neighborhood of each basin. To counter this phenomenon, one can employ for instance multiple scales by modifying the size of the processed neighborhood, or in other words the shape and extent of the SE. Another possibility is to profit from the topological properties of the marker based watershed transform, which by limiting the flooding sources, provides an absolute control over the number of regions that are formed. As to the markers, once can very simply erode the binary region map, while preserving their connectivity. Thus, flexibility areas are formed among them which make it possible to realise topological border corrections (figure 7).
Fig. 7. From left to right, the segmentation result using the jump connection algorithm [11], the marker image and the final marker based watershed result
A Basin Morphology Approach to Colour Image Segmentation
4
943
Discussion and Conclusion
In this paper, an unsupervised and input specific colour image segmentation method has been presented. It has been developed for the improved HLS space, and constitutes an attempt to integrate the spatial sensitivities of morphological operators with spectral image properties. Furthermore, a graph based morphology approach has been formulated in order to manipulate the catchment basins produced by the watershed transform. This formulation aims mainly to provide a more efficient and flexible exploitation framework for the wealth of topological information provided by the aforementioned transform. Pertinent results have been obtained with the Berkeley dataset (figures 8), even by using the basic erosion definition in combination with a hierarchy of multiple merging criteria, ranging from mean colour to covariance based texture features. While more sophisticated operators (e. g. geodesic reconstruction of basins, etc), as well as the exploration of further basin metrics remain to be investigated. Its execution speed and adaptivity, along with its capacity to provide the “main” borders of its input, render this approach suitable for applications where precision is of secondary importance, and a fast and robust segmentation is prioritised (e. g. content based image retrieval).
Fig. 8. From left to right, the original images, their segmentations based on jump connection [11] and based on the proposed approach (top to bottom: #3096, #42049, #143090 and #145086
944
E. Aptoula and S. Lef`evre
Issues that require further attention include improving the estimation of arguments as well as the use of additional high level merging criteria, such as border geometry.
References 1. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, vol. 2, pp. 416–423 (2001) 2. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Addison-Wesley, New York (1992) 3. Hanbury, A., Serra, J.: Colour image analysis in 3d-polar coordinates. In: International Conference on Image Processing and its Applications, Magdeburg, Germany (2003) 4. Serra, J.: Image Analysis and Mathematical Morphology, vol. I. Academic Press, London (1982) 5. Ronse, C.: Why mathematical morphology needs complete lattices. Signal Processing 21(2), 129–154 (1990) 6. Aptoula, E., Lef`evre, S.: A comparative study on multivariate mathematical morphology. Pattern Recognition (2007), doi:10.1016/j.patcog.2007.02.004). 7. Hanbury, A., Serra, J.: Morphological operators on the unit circle. IEEE Transactions on Image Processing 10(12), 1842–1850 (2001) 8. Gomila, C., Meyer, F.: Levelings in vector spaces. In: Proceedings of the IEEE Conference on Image Processing, Kobe, Japan (1999) 9. Chen, Q., Zhou, C., Luo, J., Ming, D.: Fast segmentation of high-resolution satellite images using watershed transform combined with an efficient region merging apˇ c, J. (eds.) IWCIA 2004. LNCS, pp. 621–630. Springer, proach. In: Klette, R., Zuni´ Heidelberg (2004) 10. Garrido, L., Salembier, P., Garcia, D.: Extensive operators in partition lattices for image sequence analysis. Signal Processing 66(2), 157–180 (1998) 11. Angulo, J., Serra, J.: Modelling and segmentation of colour images in polar representations. Image and Vision Computing (2006), doi:10.1016/j.imavis. 2006.07.018).
Detecting and Segmenting Un-occluded Items by Actively Casting Shadows Tze K. Koh1,2,3 , Amit Agrawal1 , Ramesh Raskar1 , Steve Morgan3, Nicholas Miles2 , and Barrie Hayes-Gill3 1
2
Mitsubishi Electric Research Labs (MERL), 201 Broadway, Cambridge MA 02139, USA {agrawal,raskar}@merl.com http://www.merl.com/people/agrawal/index.html School of Chemical, Environmental and Mining Engineering, University of Nottingham, UK {enxtkk,nick.miles}@nottingham.ac.uk 3 School of Electrical and Electronic Engineering, University of Nottingham, UK {steve.morgan,barrie.hayes-gill}@nottingham.ac.uk
Abstract. We present a simple and practical approach for segmenting un-occluded items in a scene by actively casting shadows. By ‘items’, we refer to objects (or part of objects) enclosed by depth edges. Our approach utilizes the fact that under varying illumination, un-occluded items will cast shadows on occluded items or background, but will not be shadowed themselves. We employ an active illumination approach by taking multiple images under different illumination directions, with illumination source close to the camera. Our approach ignores the texture edges in the scene and uses only the shadow and silhouette information to determine the occlusions. We show that such a segmentation does not require the estimation of a depth map or 3D information, which can be cumbersome, expensive and often fails due to the lack of texture and presence of specular objects in the scene. Our approach can handle complex scenes with self-shadows and specularities. Results on several real scenes along with the analysis of failure cases are presented.
1 Introduction Human vision system is extremely efficient at scene analysis. Identifying objects in the scene and grasping them is a mundane task for us. However, designing vision algorithms even for such simple tasks have proven to be notoriously difficult. For example, random 3D ‘bin-picking’, where objects are randomly placed in a bin is still an unsolved problem. Commercial systems typically address less taxing robot-guidance tasks, such as picking singulated parts from a moving conveyor belt and employ 2D image processing techniques. Partial occlusion with overlapping parts is a serious problem and it is important to find un-occluded objects. In this paper, we address the problem of identifying un-occluded items in a scene. By ‘items’, we refer to objects (or part of them) enclosed by depth edges. Such an approach could serve as a pre-processing stage for several vision tasks, for example, robotic manipulations in factory automation, 3D pose estimation and object recognition. Our motivating application for detecting un-occluded items is to enable a robot-mounted vision system to better plan the picking sequence. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 945–955, 2007. c Springer-Verlag Berlin Heidelberg 2007
946
T.K. Koh et al. Depth Edges Shadow Edges
A
B
A T-junction
B
A Shadow Region 1
B
A
Shadow Un-occluded Region 2 Item
Fig. 1. Segmenting un-occluded items. (Left) Implementation of our active illumination approach using a firewire camera and eight light emitting diodes (LED’s) around it. (Right) A scene with two objects. B is occluded since it contains a shadow edge (orange). Equivalently, B’s shadow region does not contain the complete depth edge contour (green) of B as its depth edges are intersected by shadow edges. However, A’s shadow region contains its complete depth edge contour. Thus, un-occluded items can be obtained by filling in depth edges inside shadow regions.
Although 2D image segmentation approaches can segment an image into semantic regions, in absence of 3D or depth information, these approaches cannot identify occlusion between objects. It is a general belief that once 3D or depth information is obtained, several vision tasks can be simplified. Although past decades have witnessed significant research efforts in this direction, accurate 3D estimation is cumbersome, expensive and usually have limitations (e.g. stereo on non-textured surfaces). Even if the depth map of the scene is available, one would have to do an analysis similar to ours to find un-occluded objects. This is because un-occluded objects may not necessarily be at a smaller distance from the camera as compared to occluded objects. Range segmentation may segment the depth map into regions, but one still needs to determine the occlusions to remove occluded objects. More importantly, we show that such an analysis can be done using depth and shadow edges without obtaining the depth map of the scene. Thus, our approach inherently overcomes the limitations of shape-from-X algorithms. Our approach can easily handle textured and non-textured objects as well as specular objects (to certain extent) as described in Sect. 3. Contributions: We make the following contributions in this paper – We propose an approach to segment un-occluded items in a scene using cast shadows. We describe a simple implementation for this approach using depth and shadow edges. – We analyze practical configurations where our approach works and fails including self-occlusions, mutual occlusions, object with holes and specular objects. – We show how to handle missing depth/shadow edges due to noise and lack of shadow information. 1.1 Related Work 2D/Range Segmentation: Image & range segmentation [1,2,3,4,5,6] is a well researched area. Although 2D segmentation can segment an image into semantic regions, it cannot provide occlusion information due to the lack of depth information. Even when a depth map of the scene is available, we need to explicitly find occlusions using depth edges. In Sect. 2, we show that depth edges can be directly obtained using active illumination without first computing a depth map.
Detecting and Segmenting Un-occluded Items
947
Shape from Silhouettes: These approaches [7,8,9,10] attempt to infer 3D shape from silhouettes obtained under different view-points. The computed silhouettes for every image along with the camera center of the corresponding camera is used to define a volume assumed to bound the object. The intersection of these volumes known as the visual hull [11] yields a reasonable approximation of the real object. In contrast, we capture images from a single view-point under varying illumination and use the information in cast shadows to segment un-occluded objects. Active Illumination: Several vision approaches use active illumination to simplify the underlying problem. Nayar et al. [12] recover shape of textured and textureless surfaces by projecting an illumination pattern on the scene. Shape from structured light [13,14] has been an active area of research for 3D capture. Raskar et al. [15] proposed the multiflash camera (MFC) by attaching four flashes to a conventional digital camera to capture depth edges in a scene. Crispell et al. [16] exploited the depth discontinuity information captured by the MFC for a 3D scanning system which can reconstruct the position and orientation of points located deep inside concavities. The depth discontinuities obtained by the MFC have also been utilized for robust stereo matching [17] and recognition of finger-spelling gestures [18]. Koh et al. [19] use the depth edges obtained using multiflash imaging [15] for automated particle size analysis with applications in mining and quarrying industry. Our approach also uses a variant of MFC (with 8 flashes, Fig. 1) to extract depth discontinuities, which are then used to segment un-occluded objects. Interpretation of Line Drawings: Our visual system is surprisingly good at perceptual analysis of line drawings and occluding contours into 3D shapes [20]. Labeling line drawing into different types of edges has been proposed by Huffman [21]. Waltz [22] describe a system to provide a precise description of a plausible scene which could give rise to a particular line drawing for polyhedral scenes. Malik [23] proposed schemes for labeling line drawing of scenes containing curved objects under orthographic projection. Marr [24] argued that a given silhouette could be generated by an infinite variety of shapes and analyzed the importance of assumptions about viewed surfaces in our perception of 3D from occluding contours. Our goal is not to interpret occluding contours into 3D shapes, but to label occluding contours corresponding to un-occluded objects in the scene.
2 Segmentation Using Information in Cast Shadows In this section, we describe the basic idea of segmenting un-occluded items in a scene. We first assume that complete depth and shadow edges are available, i.e., there are no missing edges. We do a thorough analysis of this ideal case for several practical scenes in Sect. 3. In Sect. 4, we will extend our approach to handle missing edges. As mentioned earlier, depth edges alone cannot provide a unique interpretation for objects in the scene. Thus, our goal is not to interpret depth edges into 3D shapes, but to identify or label those depth edges that possibly correspond to un-occluded objects in the scene. In particular, our approach outputs items enclosed by depth edges (See Fig. 2). In Sect. 3, we show how this is affected by self-shadows, self-occlusions and mutual occlusions. We assume that the scene consists of objects lying on a flat surface and on top of each other, and the view direction is along the vertical direction.
948
T.K. Koh et al. Top View
Depth Edge Concave Edge Convex Edge Shadow Edge
A B Scene
B
A Cast Shadows
A Depth & Un-occluded Item Shadow Edges
Fig. 2. (Left) Different types of edges [22]. It is well-known that occluding contours alone cannot provide a unique interpretation of the objects in the scene [24,23]. We therefore find un-occluded items, which are objects or part of objects enclosed within depth edges. (Right) In this scene, A and B could be parts of the same physical object or two different physical objects. Since A cast shadows on B, our approach will identify only A as the un-occluded item.
Consider a simple scene show in Fig. 1, where A casts shadow on B and B casts shadow on the background. We depict depth edges in green and shadow edges in orange color. Suppose we could segment the boundaries of A and B from the captured intensity images. Then we could easily infer that since region B has a shadow edge, it must be occluded. Thus, all regions that do not have any shadow edges are potential candidates for un-occluded objects. The important question is how to obtain such a segmentation so that the segmentation boundaries correspond to object boundaries or shape edges? Note that any 2D segmentation approach relies on image intensities and thus will respond to texture/reflectance edges. A depth edge may not correspond to a texture edge at the same location in the image (e.g. all objects with same reflectance, Fig. 4) and intensity edges on object surfaces will result in false depth edges. Thus, we need a robust method to find depth edges which can ignore texture edges. Computing Depth Edges: The active illumination method proposed in [15] is an easy way to find depth edges in the scene. In this approach, four flashes are attached close to the main lens of the camera along left, right, top and bottom directions. Four images, I1 , I2 , I3 , and I4 are captured, each under a different flash illumination. Since shadows will be cast due to object boundaries and not due to reflectance boundaries, depth edges can be extracted using the shadows. To compute depth edges, first a max composite image (Imax ) is obtained by taking the maximum of intensity value at every pixel. Imax will be a shadow-free image. Then, ratio images are calculated as ri = Ii /Imax . Depth edges are obtained by estimating the foreground to shadow transition in each ratio image and combining all the estimates. In our implementation, we capture eight images with different illumination directions using the setup shown in Fig. 1. Fig. 3 shows an example on a scene containing three overlapping crayons. Note that the shadow edges can be similarly obtained by estimating the shadow to foreground/background transition in each ratio image. Segmenting Un-occluded Items: The basic idea in segmenting un-occluded items is to utilize the cast shadows information. If we trace the depth edges in clockwise direction, the cast shadows should always be on the left of the depth edge. In other words, if an object is un-occluded, then cast shadows cannot fall inside the object, or to the right of the depth edge. T-junctions at the intersection of two objects (Fig. 1) can also be handled with this tracing method by always tracing along the rightmost boundary at
Detecting and Segmenting Un-occluded Items
Depth Edges
Depth & Shadow Edges
Shadow Regions
949
Un-occluded Items
Fig. 3. Depth and shadow edges can be obtained using active illumination. (Top Row) The eight input images captured using our setup. (Middle Row) Ratio images obtained by dividing the input images with Imax . Note that the ratio images are texture-free and have shadows according to the corresponding LED direction. (Bottom row) Depth & shadow edges are obtained using ratio images. Note that only the shadow region corresponding to the red crayon contains closed depth edge contours. Thus, filling depth edges inside shadow regions will correctly output the red crayon as the un-occluded item. Matlab source code and input images for this example are included in the supplementary materials.
junctions. In Fig. 1, at the intersection of A and B, the above condition will be satisfied for A but not for B, identifying A as an un-occluded item. Instead of tracing depth edges which might be cumbersome, we propose a simple equivalent implementation using shadow edges. The shadow edges segment the image into regions. For any un-occluded object, the shadow region should contain the entire depth edge contour for that object. For example, in Fig. 1, shadow region 1 contains the entire depth edge contour of object A. However, for occluded objects such as B, the shadow edge cuts through the depth edge. For shadow region 2, the depth edges inside that region do not form a closed contour. Thus, to find un-occluded items, we simply region fill the depth edges inside each shadow region. For occluded objects, since the depth edges inside the shadow regions will not be complete, they will not get filled. In Fig. 3, the shadow edges form five regions as shown in the last row. Only the depth edges in the shadow region corresponding to the red crayon form a closed contour. Thus, the red crayon will be correctly identified as un-occluded item. Supplementary materials include Matlab source code and input images for this example.
3 Practical Configurations In this section, we analyze common scenes which give rise to more complex shadow configurations such as objects with self-shadows, objects with holes and specular objects. We also analyze two failure cases involving self-occlusions and mutual occlusions.
950
T.K. Koh et al. Self-shadows
Extra Edges
A B
Scene
Right Flash Ratio Image
Depth and Shadow Edges
Shadow Region 1
Depth edges inside shadow region 1
Filled Depth Edges: Unoccluded Item
Shadow Regions
Fig. 4. Self-shadows. The scene consists of a rabbit shaped object A on top of another object B. Part ‘R’ of object A casts shadow on itself as evident from the ratio image corresponding to the right flash. This leads to extra depth and shadow edges as shown in the third image. If these extra edges do not form closed contours, erroneous shadow regions are not obtained. Note that the shadow region corresponding to the rabbit still contains the closed contour corresponding to the outer boundary of object A.
Self-shadows: We consider self-shadows as those shadows of an object which fall on the object itself. Fig. 4 shows an example where the part ‘R’ of the rabbit shaped object A casts shadow on itself 1 . The self-shadows lead to extra depth and shadow edges. These extra edges can be ignored by our algorithm if they do not form closed contours, or do not cut through the outer boundary of the object. Note that the shadow edges lead to five shadow regions which would also have been obtained if the self-shadows were not present. By filling in the depth edges inside shadow region 1, we can identify object A as an un-occluded item. In Sect. 3.1, we show that when the extra depth edges due to self-shadows form closed contours with other depth edges, the entire object is not identified as an un-occluded item. Object with Holes: Our algorithm can handle challenging cases of objects with holes. A common scenario is shown in Fig. 5. Although the depth edges are the same in two cases, the cast shadows are different. For two spheres case, the upper sphere will cast shadows on the lower sphere and thus only the upper sphere will be considered as the unoccluded item. For the doughnut case, note that the shadows cast by the inner region does not contain any depth edges, and hence will be ignored. The shadows cast by the outer region contains both depth edge contours. If the un-occluded item is obtained by filling the depth edges inside the outer shadow region as before, the doughnut hole will also get filled. We can remove the holes by ignoring those filled regions that contain a complete shadow edge contour. The inner filled region (in green) contains the complete shadow edge contour (in orange) due to the inner doughnut boundary, and can be removed. 1
The part ‘R’ is a slanted piece whose one side is attached to the object A.
Detecting and Segmenting Un-occluded Items
951
Doughnut
Two Spheres Cast Shadows
Depth & Shadow Edges
Shadow Region
Un-occluded Item
Fig. 5. Object with holes. Our approach can recover a doughnut shaped object using cast shadows information.
Specularities and Specular Objects: Specular highlights on objects are a common problem for vision algorithms as they tend to saturate and are view dependent. In the case of specular highlights, the active illumination approach for finding depth edges results in spurious depth edges [15,25]. We show that similar to the self-shadowing case above, our approach can ignore the effect of specular highlights if the spurious depth edges due to specularities do not form closed contours. For example, in Fig. 3, the specularities on the green and the red crayon result in spurious depth edges inside the crayons. But since these edges do not intersect the true depth edges and do not form closed regions, they can be ignored while filling in the true depth edges inside shadow regions. Scene
Depth and Shadow Edges
Shadow Regions
Un-occluded Items
Specular Object
Fig. 6. Handling specular objects. Using our method, depth edges for specular objects can also be obtained. If spurious depth edges due to specularities do not form closed contours, specular objects can be handled.
A more general case of a scene having a specular object is shown in Fig. 6. An important point to note is that while specularities may result in spurious depth edges, the true depth edges even for a specular object are obtained by our technique. This is different from other techniques such as stereo/photometric stereo where the estimation is completely incorrect for specular objects. Note that in Fig. 6, the outer depth edges for the specular object are obtained. The shadow edges results in four regions. Once again, by filling in the depth edges in shadow regions, the un-occluded specular object can be recovered. Only the shadow region corresponding to the specular object has closed depth edges, as other objects are shadowed by the specular object.
952
T.K. Koh et al.
3.1 Failure Cases Two important failure cases are described below. Self-Occlusions: The first case correspond to self-occlusions such that the depth edges due to self-occlusion form closed regions with outer depth edges of the object. Fig. 7 shows an example. Note that the part of the object which is occluded by the object itself cannot be recovered. Scene
Depth & Shadow Edges
Scene
Shadow Regions Region 3
Depth Edges within Region 3
Depth & Shadow Edges
Filled Region
Un-occluded Item
Shadow Regions
Fig. 7. Failure Cases. (Top row) Self-Occlusions. The scene consist of a single pipe which occludes itself. The shadow edges results in three shadow regions. Only region 3 has closed depth edge contours. However, filling in the depth edges inside region 3 followed by hole removal only recovers the un-occluded part of the object as the un-occluded item, instead of the entire object. (Bottom Row) Mutual Occlusions. The scene consist of two mutually occluding pipes. The shadow edges give rise to five regions. However, none of the shadow regions contain complete depth edge contours as each depth edge is intersected by some shadow edge. Thus, the output will be zero un-occluded items.
Mutual Occlusions: The second failure case correspond to mutual occlusions, where object A occludes object B but is also occluded by the object B at the same time. Fig. 7 shows such a scenario for a scene containing two pipes. For this scene, neither of the two pipes or any part of them will be segmented as an un-occluded item.
4 Handling Missing Depth and Shadow Edges In the previous section, we showed that if we have complete depth and shadow edge information, we can reliably segment un-occluded items in the scene. However, in some cases complete depth/shadow edges are not obtained due to noise, or dark surfaces. If shadow edges are missing, correct shadow regions will not be obtained and we cannot use the previous approach of filling depth edges within the shadow regions. Now we describe an extension to handle such cases. Our approach first tries to complete the depth edges by segmenting the pseudo-depth map [17,15] of the scene. We find an over-segmentation in this step so that all missing depth edges are accounted for, but this may result in extra regions. We then verify each segmented region for occlusion by checking if any shadow falls in that region.
Detecting and Segmenting Un-occluded Items Scene
Edges
Pseudo-depth Map
Segmented
953
Un-occluded Items
Fig. 8. Handling missing edges. In a complex scene with several objects, depth and shadow edges may be missing (pointed by white arrows). We first compute the pseudo-depth map of the scene. We then complete the depth edges by segmenting the pseudo-depth map. Each segmented region is then checked for occlusions using shadow information. All regions intersecting with shadow edges are removed to obtain un-occluded items.
Fig. 8 shows a complex scene with several objects. The extracted depth edges have gaps as shown. A pseudo-depth map of the scene is computed by assigning horizontal/vertical gradients to each depth edge pixel, according to the direction of the light source. The magnitude of the gradient is set proportional to the width of the shadow at that pixel [17]. The gradients at all other pixels are set to zero. The pseudo-depth map is obtained by integrating the resulting 2D gradient field by solving a Poisson equation. We segment the pseudo-depth map using EDISON [3]. The resulting pseudo-depth map and its segmentation is also shown in Fig. 8. Note that all the missing depth edges are completed but the segmented pseudo-depth map have extra regions. The final step consists of checking each region for occlusions. If we draw a line from any point inside an un-occluded object to a point outside the object, it should intersect a depth edge before intersecting a shadow edge. For an occluded object, since the shadow falls inside the object, such a line may intersect a shadow edge first. For example, in Fig. 1, any line drawn from inside of object A to outside will intersect a depth edge (green) first. However, certain lines drawn from the inside of object B to outside will intersect a shadow edge (red) first. Thus, for each segmented region, we draw lines from inside the region at several different angles. We count the number of intersections with a shadow edge before intersection with a depth edge. If this count is greater than some threshold, the region is declared to be occluded. Fig. 8 shows that all occluded regions were successfully eliminated. The starting point of these lines is taken to be the medial axis of each region to handle general regions with concavities.
5 Discussions Several improvements to our approach are possible. Better region filling approaches could handle cases where only a few pixels are missing in depth or shadow edges. A gradient based analysis could be used to remove spurious depth edges due to specularities [25]. Since depth edges are view dependent, the labeling of scene parts as unoccluded items is also view dependent. Higher level information can be combined for object-based interpretation.
954
T.K. Koh et al.
Limitations: Our approach share the limitations described in [15] for finding depth edges. This includes dark surfaces/background and detached shadows from the objects due to large baseline between the LED and the camera or thin objects. Our scheme works better on curved objects compared to polyhedral objects. The depth edge at the intersection of polyhedral objects may convert into a concave/convex edge depending on the viewpoint, and thus may not be obtained. Conclusions: We have proposed a simple and practical approach to segment unoccluded items in a scene using cast shadows by analyzing the resulting depth and shadow edges. A depth map of the scene is not required and our approach can handle complex scenes with specularities, specular objects, self-shadows and objects with holes. We showed several real examples using our approach and analyzed the failure cases including self-occlusions and mutual occlusions. To handle missing depth and shadow edges, we propose an extension based on segmenting the scene using pseudodepth map and analyzing each region for occlusions. We believe that our approach could serve as a pre-processing stage for several vision tasks including bin-picking, 3D pose estimation and object recognition.
References 1. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice-Hall, Englewood Cliffs (2001) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell 22(8), 888–905 (2000) 3. Christoudias, C.M., Georgescu, B., Meer, P.: Synergism in low level vision. In: Proc. Int’l Conf. Pattern Recognition, vol. IV, pp. 150–155 (2002) 4. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell. 24(5), 603–619 (2002) 5. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P., Bunke, H., Goldgof, D., Bowyer, K., Eggert, D., Fitzgibbon, A., Fisher, R.: An experimental comparison of range image segmentation algorithms. IEEE Trans. Pattern Anal. Machine Intell. 18(7), 673–689 (1996) 6. Yim, C., Bovik, A.: Multiresolution 3-D range segmentation using focus cues. IEEE Trans. Image Processing 7(9), 1283–1299 (1998) 7. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: SIGGRAPH, pp. 369–374 (2000) 8. Cheung, K.M.G.: Visual hull construction, alignment and refinement for human kinematic modeling, motion tracking and rendering. PhD thesis, CMU (2003) 9. Brand, M., Kang, K., Cooper, D.: Algebraic solution for the visual hull. In: Proc. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 30–35 (2004) 10. Franco, J., Boyer, E.: Exact polyhedral visual hulls. In: Proc. Fourteenth British Machine Vision Conference, pp. 329–338 (2003) 11. Laurentini, A.: The visual hull concept for the silhouette-based image understanding. IEEE Trans. Pattern Anal. Machine Intell. 16, 150–162 (1994) 12. Nayar, S., Watanabe, M., Noguchi, M.: Real-time focus range sensor. IEEE Trans. Pattern Anal. Machine Intell. 18, 1186–1198 (1995) 13. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: Proc. Conf. Computer Vision and Pattern Recognition (2003) 14. Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: high resolution capture for modeling and animation. ACM Trans. Graph 23, 548–558 (2004)
Detecting and Segmenting Un-occluded Items
955
15. Raskar, R., Tan, K.H., Feris, R., Yu, J., Turk, M.: Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Trans. Graph. 23(3), 679– 688 (2004) 16. Crispell, D., Lanman, D., Sibley, P.G., Zhao, Y., Taubin, G.: Beyond silhouettes: Surface reconstruction using multi-flash photography. In: Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 405–412 (2006) 17. Feris, R., Raskar, R., Chen, L., Tan, K.H., Turk, M.: Discontinuity preserving stereo with small baseline multi-flash illumination. In: Proc. Int’l Conf. Computer Vision, vol. 1, pp. 412–419 (2005) 18. Feris, R., Turk, M., Raskar, R., Tan, K., Ohashi, G.: Exploiting depth discontinuities for vision-based fingerspelling recognition. In: IEEE Workshop on Real-Time Vision for Human-Computer Interaction, IEEE Computer Society Press, Los Alamitos (2004) 19. Koh, T.K., Miles, N., Morgan, S., Hayes-Gill, B.: Image segmentation of overlapping particles in automatic size analysis using multi-flash imaging. In: WACV 2007. Proc. Eighth IEEE Workshop on Applications of Computer Vision, IEEE Computer Society Press, Los Alamitos (2007) 20. Barrow, H., Tenenbaum, J.: Interpreting line drawings as three-dimensional surfaces. Artificial Intelligence 17, 75–116 (1981) 21. Huffman, D.A.: Impossible objects as nonsense sentences. In: Melzer, B., Michie, D. (eds.) Machine Intelligence, vol. 6, pp. 295–323. Edinburgh University Press (1971) 22. Waltz, D.: Understanding line drawings of scenes with shadows. In: Winston, P. (ed.) The Psychology of Computer Vision, pp. 19–91. McGraw-Hill, New York (1975) 23. Malik, J.: Interpreting line drawings of curved objects. Int’l J. Computer Vision 1, 73–103 (1987) 24. Marr, D.: Analysis of occluding contour. Technical Report ADA034010, MIT (1976) 25. Feris, R., Raskar, R., Tan, K.H., Turk, M.: Specular reflection reduction with multi-flash imaging. In: SIBGRAPI, pp. 316–321 (2004)
A Local Probabilistic Prior-Based Active Contour Model for Brain MR Image Segmentation Jundong Liu1 , Charles Smith2 , and Hima Chebrolu2 1
School of Electrical Engineering and Computer Science Ohio University Athens, OH 2 Department of Neurology University of Kentucky Lexington, KY
Abstract. This paper proposes a probabilistic prior-based active contour model for segmenting human brain MR images. Our model is formulated with the maximum a posterior (MAP) principle and implemented under the level set framework. Probabilistic atlas for the structure of interest, e.g., cortical gray matter or caudate nucleus, can be seamlessly integrate into the level set evolution procedure to provide crucial guidance in accurately capturing the target. Unlike other region-based active contour models, our solution uses locally varying Gaussians to account for intensity inhomogeneity and local variations existing in many MR images are better handled. Experiments conducted on whole brain as well as caudate segmentation demonstrate the improvement made by our model.
1
Introduction
Magnetic Resonance Images (MRI) of the brain provide very important tools in diagnosing and treating various neurodegenerative diseases including Alzheimer’s disease (AD), Parkinson’s disease (PD), and multiple sclerosis. Segmentation of the whole brain as well as the subcortical structures from MR image is a critical and fundamental task for the 3D MRI data to be effectively utilized for disease diagnosis and treatment. Numerous segmentation solutions have been proposed in the literature. Among them, region-based active contour models [17,2,3,11,14,5] have recently gained great popularity, mainly due to the demonstrated strong segmentation robustness. The Chan-Vese piecewise-constant model [2], commonly known as the active contour without edge model, adopts a stopping term based on a simplified version of the Mumford-Shah functional, and has the ability to detect object boundaries with or without gradient. Although impressive experimental results have been reported of using this model and its variants [11,5] in various applications, several common drawbacks and limitations have to be addressed when they are utilized for brain MRI segmentation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 956–964, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Local Probabilistic Prior-Based Active Contour Model
957
Firstly, in these models, a mixture of global Gaussians (piecewise-constant can be regarded as the degenerate case) has been used a convenient assumption for modeling the intensity distribution. Global means are employed to discriminate regions from each other. However, “homogenous regions with distinct means” is rarely accurate for brain MRIs, especially before the bias field is removed. Secondly, spatial distribution priors, often available and being used extensively in histogram-based models, are normally neglected in the region-based active contour models. For whole brain segmentation, prior knowledge about the organ’s location sometimes is an helpful resource to separate certain tissue types from their surroundings. For subcortical segmentation, atlases constructed from the train sets often an indispensable part that defines the structure of interest. 1.1
Our Proposed Solution
This paper proposes a fully automatic whole-brain as well as subcortical structure segmentation solution. The models consists of two major components: a local likelihood based active contour (LLAC) model and a guiding probabilistic atlas. The former can be regarded as a bridging solution between the ChanVese piecewise-constant [2] and Chan-Vese [3] (Tsai-Yezzi [14]) piecewise smooth models. The latter tells which structure to be captured. Formulated under Bayesian a posterior probability framework, our LLAC model can seamlessly integrate the probabilistic atlas information into the level set evolution procedure. In addition, it relaxes the global Gaussian assumption in many region-based actively models from “global” to “local”, and local means are used as the area representatives. Being able to better account for intensity inhomogeneity, the LLAC model can stand out structures of interest that have low contrast with the surrounding tissues.
2
Methods
Let C be an evolving curve in Ω. Cin denotes the region enclosed by C and Cout denotes the region outside of C. Chan-Vese (two-phase) piecewise-constant model is to minimize the energy functional F (c1 , c2 , C) = μ · Length(C) + λ1 |u0 − c1 |2 dxdy + λ2 |u0 − c2 |2 dxdy Cin
Cout
where c1 and c2 are the averages of u0 inside C and outside C respectively. Note that both c1 and c2 are global values, computed based on the entire image. Global Gaussian distribution has been assumed for each individual classes. This assumption, however, is not an accurate depiction of local image profile for many medical images, including brain MRIs. Piecewise-smooth models [3,14] provide a solution for the intensity variability problem. Gradual intensity changes can be potentially handled with [3,14], however, high computational cost and being sensitive to curve initialization pose a barrier for practical applications.
958
2.1
J. Liu, C. Smith, and H. Chebrolu
Our Local Likelihood-Based Active Contour (LLAC) Model
Let S = {in, out} be the two classes for a two-phase model. The probability of the pixel (x, y) belonging to in and out is denoted by P (in|(x, y)) and P (out|(x, y)) respectively. Let P r(in) and P r(out) be the class prior probabilities at (x, y). Then, P r(in(x,y) )P (u0 (x, y)|in) P (B) P r(out(x,y) )P (u0 (x, y)|out) P (out|u0 (x, y)) = P (B) P (in|u0 (x, y)) =
where P (u0 (x, y)|in) is the likelihood of a voxel in class in has the intensity of u0 (x, y). P (B) is a constant. The maximum a posterior segmentation would be achieved only if the multiplication of P (in|(x, y)) and P (out|(x, y)) all over the entire image domain is maximized. Taking a logarithm, the maximization can be reduced to the minimization of the following energy: log(P r(in)P (u0 (x, y)|in))dxdy F (C) = μ · Length(C) − Cin − log(P r(out)P (u0 (x, y)|in))dxdy Cout
Note that our overall model is similar to [11,12], but the setup of the likelihood term is different, which will be explained next. Spatial Priors for Whole Brain Segmentation: P r(in) and P r(out). A widely used whole brain tissue distribution prior model is provided by the Montreal Neurological Institute [7]. MNI prior is made of three probability images that contain values in the range of zero to one, representing the prior probability of a voxel being either GM, WM or CSF after an image has been normalized to the same space (see Figure 1). In this paper, we are particularly interested in extracting sub-cortical GM, therefore we take the GM and WM prior images as P r(in) and P r(out) respectively, for demonstration purpose. For these prior images to be applied, a registration is need to align the prior and the input image. We used the affine registration routine provided by SPM [13] in all the 3D experiments of this paper. Spatial Priors for Caudate Segmentation: In this paper, we take caudate as an example to show how our method can be used for subcortical segmentation. The proposed method would, in principle, work for other sub-cortical structures as well. The distribution prior used for Caudate is constructed with 18 T1-weighted MR image data downloaded from internet brain segmentation repository (IBSR) at Mass General Hospital. Each data set contains a whole brain MRI together with an expert manual segmentation of 43 individual structures (1.5mm slice thickness). Out of the 18 data sets, the first 9 have been used for constructing the distribution atlas. The other 9 are used as testing cases to evaluate the accuracy
A Local Probabilistic Prior-Based Active Contour Model
(a)
(b)
959
(c)
Fig. 1. Spatial prior probability images of CSF, GM and WM
of our segmentation model. To construct a distribution atlas, the 9 caudate segmentations need to put into a standard space. The template brain provided by SPM2 [13], obtained based on 152 brains from Montreal Neurological Institute, has been used as the standard space. Let fi (1 ≤ i ≤ N = 9) be one of the training images, and the extracted caudate segmentation is denoted as si (1 ≤ i ≤ N ). Let r denote the standard temple. The probabilistic caudate atlas was constructed as follows 1. For each training image fi in the IBSR data sets, map it to the standard template r using SPM2’s normalization routine. A 12-parameters affine transformation is estimated first, followed by a nonlinear warping based on a linear combination of discrete cosine transform (DCT) basis functions. The resulting transformation is denoted as Ti . 2. Apply Ti to si to get a transformed caudate segmentation si . 3. Sum up si under the standard space to get ss . 4. The prior distribution is then obtained: P r(caudate) = ss N , and P r(noncaudate) = 1 − P (caudate). When we try to segment the caudate of a testing image k, an affine transformation from the standard template r to k is estimated using SPM2. Then the obtained transformation is applied to the distribution atlas P r(in) = P r (Caudate) and P r(out) = P r (nonCaudate) to put the prior images on the aligned space with the testing image k. Figure 2 shows a zoom-in version of the atlas (distribution prior image) constructed from the 9 IBSR data sets, viewing from three different axes. Likelihood Terms P (u0 (x, y)|in) and P (u0 (x, y)|out): Global Gaussians are commonly assumed in many region-based active contour models to model the intensity distribution, but they are often not an accurate description of the local image profile, especially when intensity inhomogeneity is present. A remedy is to relax the global Gaussian mixture assumption and take local intensity variations into consideration. More specifically, local Gaussians (piecewise-constant is the degenerate case) should be used as a better approximation to model the vicinity of each voxel. In the Chan-Vese model, two global means c1 and c2 are computed for Cin and Cout . In our approach, we introduce two functions v1 (x, y) and v2 (x, y), both
960
J. Liu, C. Smith, and H. Chebrolu
(a)
(b)
(c)
Fig. 2. Viewing from three axes: spatial prior probability image of the Caudate, constructed from 9 IBSR data sets
defined on the image domain, to represent the mean values of the local pixels inside and outside the moving curve. By Local, we mean that only neighboring pixels will be considered. A simple implementation of the “neighborhood” is to introduce a rectangular window W (x, y) with size of 2k + 1 by 2k + 1, where k is a constant integer. Therefore, v1 (x, y) = mean(u0 ∈ (Cin ∩ W (x, y))) v2 (x, y) = mean(u0 ∈ (Cout ∩ W (x, y))) With the new setup, our segmentation model can then be updated as a minimization of the following energy: (u0 − v1 )2 F (v1 , v2 , C)=μ · Length(C)− log(P r(in)) − log(σ1 ) − dxdy − 2σ12 Cin (u0 − v2 )2 log(P r(out)) − log(σ2 ) − dxdy 2σ22 Cout The variances σ1 and σ2 should also be defined and estimated locally. However, due to the fact that local variance estimation tends to be very unstable, we use global variances (for the pixels in Cin and Cout ) as uniform approximation. 2.2
Level Set Framework and Gradient Flow
Using the Heavside function H and the one-dimensional Dirac measure δ [2], the energy function F (v1 , v2 , C) can be minimized under the level set framework, where the update will be conducted on the level set function φ. Parameterizing the descent direction by an artificial time t ≥ 0, the gradient flow for φ(t, x, y) is given from the associated Euler-Lagrange equation as r(in) ∂φ ∇φ = sign(v − v2) · δ(φ) μdiv( |∇φ| ) − log PPr(out) + log σσ12 1 ∂t −v1 )2 −v2 )2 (1) − (u02σ − (u02σ 2 2 1
2
A Local Probabilistic Prior-Based Active Contour Model
961
where φ0 is the level set function of the initial contour. This gradient flow is the evolution equation of the level set function of our proposed method. Correspondingly, v1 and v2 are computed with v1 =
(u0 ∗ H(φ)) ⊗ W H(φ) ⊗ W
v2 =
(u0 ∗ (1 − H(φ))) ⊗ W (1 − H(φ)) ⊗ W
(2)
where ⊗ is the convolution operator. One should note that, Chan-Vese model can be regarded as a special case of our model — when the window W is set to infinitely large. The sign(v1 - v2 ) term in Eqn.1 is designed to avoid the occurrence of an undesired curve evolution phenomenon we named as local twist. When Local twist happens, the multiple components of a same class might be evolved into the opposite side of φ, therefore labeled with different classes. sign(v1 - v2 ) is a simple yet effective way to prevent this phenomenon from happening. More details can be found in [10]. In practice, the Heaviside function H and Dirac function δ in eqn. 1 have to be approximated by smoothed versions. We adopt the H2, and δ2, used in [2]. For all the experiments conducted in this paper, we set the size of the window W as 21 × 21.
3
Results and Discussions
The fist experiment is based on a 3T MR image, shown in Fig 3, before the biased field is removed. Due the existing bias field, this image greatly violates the global Gaussian/mean assumption, therefore traditional region-based approaches, including the Chan-Vese model, are expected to fail. Fig 3 shows the result of using Chan-Vese model (left column) and that of using our local median model (right column). Three snapshots of the executions are provided. As evident, Chan-Vese model has trouble in capturing the GM area in the top-left and right-bottom corners, while our model separate the two issues very accurately. The second experiment is conducted based on a low resolution 1.5T MR images. We compared our solution with that of SPM [13] and Chan-Vese model. Fig. 4 shows a single slice result from all three methods. Fig 4.a is the input image, and 4.b, 4.c and 4.d are the GM segmentation from SPM, Chan-Vese and our model, respectively. The sub-cortical GM tissues in all the seven images have a bit higher intensity values than cortical GM, therefore the Chan-Vese model, using a piece-wise constant assumption, mis-classifies quite a portion of putamen as WM. Our model, on the other hand, clearly separates the putamen and thalamus from their surrounding WM. The comparison for the sub-cortical area has been highlighted with a red circle in Fig 4 (Figures are better seen on screen than in black-white print). Spatial distribution prior and local Gaussians both play a role in achieving this improvement. Compared to SPM, our model has the edge in outlining cleaner cortical GM (highlighted with a blue circle; better seen on the screen).
962
J. Liu, C. Smith, and H. Chebrolu
20
20
20
40
40
40
60
60
60
80
80
80
100
100
100
120
120
120
140
140
140
160
160
160
180
180
200
200 20
40
60
80
100
120
140
160
180
180 200 20
40
60
80
100
120
140
160
180
20
20
20
40
40
40
60
60 80
80
100
100
120
120
120
140
140
140
160
160
160
180
180
200
200 40
60
80
100
120
140
160
180
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
60
80 100
20
20
180 200 20
40
60
80
100
120
140
160
180
Fig. 3. Segmentation comparison of Chan-Vese model and our model in handling severe intensity inhomogeneity. First row: three snapshot of the execution on Chan-Vese model; Second row: three snapshot for our model. (The figures are better seen on screen than in print).
50
50
100
100
150
150
200
200
250
250 50
100
150
200
250
50
(a)
(a)
100
(b)
150
200
250
(b)
50
50
100
100
150
150
200
200
250
250 50
100
(c)
150
200
250
50
100
150
200
250
(d)
Fig. 4. Input image and 3 GM segmentation results from SPM (b), Chan-Vese (c) and our model (d).
The last group of experiments are for subcortical structure segmentation, carried out on the rest 9 IBSR data sets mentioned in section 2.1. To assess the performance of our algorithm we computed the Dice coefficients between the segmentation obtained from our algorithm and that of the ground truth. This
A Local Probabilistic Prior-Based Active Contour Model
963
Table 1. Dice coefficients of the comparison for all the 9 test cases. The summary rows at the end of the table display the overall average. IBSR Datasets
Dice Coefficient
Patient 10 Patient 11 Patient 12 Patient 13 Patient 14 Patient 15 Patient 16 Patient 17 Patient 18 Average
0.7611 0.7929 0.7039 0.6928 0.8019 0.7342 0.7503 0.6622 0.8201 0.7466
metric measures the similarity of two sets and ranges from 0 for sets that are disjoint to 1 for sets that are identical. This index is a special case of the kappa index that is used for comparing set similarity. The Dice coefficient is defined as: 2 × |S1 ∩ S2 | (3) K(S1 , S2 ) = |S1 | + |S2 | Table 1 shows the results of the Dice coefficients for all test cases. The results are rather stable across the 9 data sets, with an average Dice value of 0.7466. The accuracy is comparable to the results reported in [16].
4
Conclusion
In this paper, we propose a brain MRI segmentation algorithm based on a local likelihood oriented active contour model. The LLAC model has the advantage of being able to stand out the brain structures that are with low contrast with the surrounding tissues. The probabilistic atlas essentially works as a mask to capture the structure of interest, where no thresholding step and value are needed. The accuracy of our model may be further boosted if shape-based atlas, constructed through PCA, is integrated into the level set framework.
References 1. Leemput, K.V., et al.: Automated model-based tissue classification of MR images of the brain. IEEE Trans. on Medical Imaging 18, 897–908 (1999) 2. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on Image Processing 10(2), 266–277 (2001) 3. Chan, T.F., Vese, L.A.: A level set algorithm for minimizing the Mumford-Shah functional in image processing. In: 1st IEEE Workshop on Variational and Level Set Methods in Computer Vision, pp. 161–168 (2001)
964
J. Liu, C. Smith, and H. Chebrolu
4. Cocosco, C.A., et al.: BrainWeb: Online interface to a 3D MRI simulated brain database. Neuroimage 5(4) part 2/4, S245 (1997) 5. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. In: IJCV 6. Yang, J., Tagare, H., Staib, L.H., Duncan, J.S.: Segmentation of 3D Deformable Objects with Level Set Based Prior Models. In: ISBI, pp. 85–88 (2004) 7. Evans, A.C., Collins, D.L., Milner, B.: An MRI-based stereotactic atlas from 250 young normal subjects. Society of Neuroscience Abstrasts 18, 408 (1992) 8. Gao, S., Bui, T.D.: Image Segmentation and Selective Smoothing by Using Mumford-Shah Model. IEEE Transactions on Image Processing 14(10), 1537–1549 (2005) 9. Li, C., Liu, J., Fox, M.D.: Segmentation of Edge Preserving Gradient Vector Flow: An Approach Toward Automatically Initializing and Splitting of Snakes. In: CVPR, vol. 1, pp. 162–167 (2008) 10. Liu, J., Chelberg, D., Smith, C., Chebrolu, H.: Distribution-based Level Set Model for Medical Image Segmentation. In: BMVC 2007. British Machine Vision Conference, Warwick, 10-13 September 2007, UK (2007) 11. Paragios, N., Deriche, R.: Coupled Geodesic Active Regions for Image Segmentation: A Level Set Approach. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 224–240. Springer, Heidelberg (2000) 12. Rousson, M., Deriche, R.: A Variational Framework for Active and Adaptative Segmentation of Vector Valued Images, INRIA Technical Report (2002) 13. Mechelli, A., Price, C.J., Friston, K.J., Ashburner, J.: Voxel-Based Morphometry of the Human Brain: Methods and Applications. Current Medical Imaging Reviews, 105–113 (2005) 14. Tsai, A., Yezzi, A., Wells, W., Tempany, C.: Approach to Curve: Evolution for Segmentation of Medical Imagery. IEEE TMI 22(2), 137–154 (2003) 15. Xu, C., Prince, J.L.: Snakes, Shapes, and Gradient Vector Flow. IEEE Transactions on Image Processing 7(3), 359–369 (1998) 16. Zhou, J., Rajapakse, J.C.: Segmentation of subcortical brain structures using fuzzy templates. NeuroImage 28, 915–924 (2005) 17. Zhu, S., Yuille, A.: Region competition: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. PAMI 18(9), 884–900 (1996)
Author Index
Abe, Shinji I-292 Agrawal, Amit I-945 Ai, Haizhou I-210 Akama, Ryo I-779 Andreopoulos, Alexander Aoki, Nobuya I-116 Aptoula, Erchan I-935 Arita, Daisaku I-159 Arth, Clemens II-447 Ashraf, Nazim II-63 ˚ Astr¨ om, Kalle II-549
I-385
Babaguchi, Noboru II-651 Banerjee, Subhashis II-85 Ben Ayed, Ismail I-925 Beveridge, J. Ross II-733 Bigorgne, Erwan II-817 Bischof, Horst I-657, II-447 Bouakaz, Sa¨ıda I-678, I-738 Boyer, Edmond II-166, II-580 Brice˜ no, Hector M. I-678, I-738 Brooks, Michael J. I-853, II-227 Byr¨ od, Martin II-549 Cai, Kangying I-779 Cai, Yinghao I-843 Cannons, Kevin I-532 Cha, Seungwook I-200 Chan, Tung-Jung II-631 Chang, Jen-Mei II-733 Chang, Wen-Yan II-621 Chaudhuri, Subhasis I-240 Chebrolu, Hima I-956 Chen, Chu-Song I-905, II-621 Chen, Ju-Chin II-700 Chen, Qian I-565, I-688 Chen, Tsuhan I-220, II-487, II-662 Chen, Wei I-843 Chen, Wenbin II-53 Chen, Ying I-832 Chen, Yu-Ting I-905 Cheng, Jian II-827 Choi, Inho I-698 Choi, Ouk II-269
Chu, Rufeng II-22 Chu, Wen-Sheng Vincnent II-700 Chun, Seong Soo I-200 Chung, Albert C.S. II-672 Chung, Ronald II-301 Cichowski, Alex I-375 Cipolla, Roberto I-335 Courteille, Fr´ed´eric II-196 Cui, Jinshi I-544 Dailey, Matthew N. I-85 Danafar, Somayeh II-457 Davis, Larry S. I-397, II-404 DeMenthon, Daniel II-404 De Mol, Christine II-881 Destrero, Augusto II-881 Detmold, Henry I-375 Di Stefano, Luigi II-517 Dick, Anthony I-375, I-853 Ding, Yuanyuan I-95 Dinh, Viet Cuong I-200 Doermann, David II-404 Donoser, Michael II-447 Dou, Mingsong II-722 Draper, Bruce II-733 Du, Wei I-365 Du, Weiwei II-590 Durou, Jean-Denis II-196 Ejiri, Masakazu I-35 Eriksson, Anders P. II-796 Fan, Kuo-Chin I-169 Farin, Dirk I-789 Foroosh, Hassan II-63 Frahm, Jan-Michael II-353 Fu, Li-Chen II-124 Fu, Zhouyu I-482, II-134 Fujimura, Kikuo I-408, II-32 Fujiwara, Takayuki II-891 Fujiyoshi, Hironobu I-915, II-806 Fukui, Kazuhiro II-467 Funahashi, Takuma II-891 Furukawa, Ryo II-206, II-847
966
Author Index
Gao, Jizhou I-127 Gargallo, Pau II-373, II-784 Geurts, Pierre II-611 Gheissari, Niloofar II-457 Girdziuˇsas, Ram¯ unas I-811 Goel, Dhiraj I-220 Goel, Lakshya II-85 Grabner, Helmut I-657 Grabner, Michael I-657 Guillou, Erwan I-678 Gupta, Ankit II-85 Gupta, Gaurav II-394 Gupta, Sumana II-394 Gurdjos, Pierre II-196 Han, Yufei II-1, II-22 Hancock, Edwin R. II-869 Handel, Holger II-258 Hao, Pengwei II-722 Hao, Ying II-12 Hartley, Richard I-13, I-800, II-279, II-322, II-353 Hasegawa, Tsutomu I-628 Hayes-Gill, Barrie I-945 He, Ran I-54, I-728, II-22 H´eas, Patrick I-864 Hill, Rhys I-375 Hiura, Shinsaku I-149 Honda, Kiyoshi I-85 Hong, Ki-Sang II-497 Horaud, Radu II-166 Horiuchi, Takahiko I-708 Hou, Cong I-210 Hsiao, Pei-Yung II-124 Hsieh, Jun-Wei I-169 Hu, Wei I-832 Hu, Weiming I-821, I-832 Hu, Zhanyi I-472 Hua, Chunsheng I-565 huang, Feiyue II-477 Huang, Guochang I-462 Huang, Kaiqi I-667, I-843 Huang, Liang II-680 Huang, Po-Hao I-106 Huang, Shih-Shinh II-124 Huang, Weimin I-875 Huang, Xinyu I-127 Huang, Yonggang II-690 Hung, Y.S. II-186 Hung, Yi-Ping II-621
Ide, Ichiro II-774 Ijiri, Yoshihisa II-680 Ikeda, Sei II-73 Iketani, Akihiko II-73 Ikeuchi, Katsushi II-289 Imai, Akihiro I-596 Ishikawa, Hiroshi II-537 Itano, Tomoya II-206 Iwata, Sho II-570 Jaeggli, Tobias I-608 Jawahar, C.V. I-586 Je, Changsoo II-507 Ji, Zhengqiao II-363 Jia, Yunde I-512, II-641, II-754 Jiao, Jianbin I-896 Jin, Huidong I-482 Jin, Yuxin I-748 Josephson, Klas II-549 Junejo, Imran N. II-63 Kahl, Fredrik I-13, II-796 Kalra, Prem II-85 Kanade, Takeo I-915, II-806 Kanatani, Kenichi II-311 Kanbara, Masayuki II-73 Katayama, Noriaki I-292 Kato, Takekazu I-688 Kawabata, Satoshi I-149 Kawade, Masato II-680 Kawamoto, Kazuhiko I-555 Kawasaki, Hiroshi II-206, II-847 Khan, Sohaib I-647 Kim, Daijin I-698 Kim, Hansung I-758 Kim, Hyeongwoo II-269 Kim, Jae-Hak II-353 Kim, Jong-Sung II-497 Kim, Tae-Kyun I-335 Kim, Wonsik II-560 Kirby, Michael II-733 Kitagawa, Yosuke I-688 Kitahara, Itaru I-758 Klein Gunnewiek, Rene I-789 Kley, Holger II-733 Kogure, Kiyoshi I-758 Koh, Tze K. I-945 Koller-Meier, Esther I-608 Kondo, Kazuaki I-544 Korica-Pehserl, Petra I-657
Author Index Koshimizu, Hiroyasu II-891 Kounoike, Yuusuke II-424 Kozuka, Kazuki II-342 Kuijper, Arjan I-230 Kumano, Shiro I-324 Kumar, Anand I-586 Kumar, Pankaj I-853 Kuo, Chen-Hui II-631 Kurazume, Ryo I-628 Kushal, Avanish II-85 Kweon, In So II-269 Laaksonen, Jorma I-811 Lai, Shang-Hong I-106, I-638 Lambert, Peter I-251 Langer, Michael I-271, II-858 Lao, Shihong I-210, II-680 Lau, W.S. II-186 Lee, Jiann-Der II-631 Lee, Kwang Hee II-507 Lee, Kyoung Mu II-560 Lee, Sang Wook II-507 Lee, Wonwoo II-580 Lef`evre, S´ebastien I-935 Lei, Zhen I-54, II-22 Lenz, Reiner II-744 Li, Baoxin II-155 Li, Heping I-472 Li, Hongdong I-800, II-227 Li, Jiun-Jie I-169 Li, Jun II-722 Li, Ping I-789 Li, Stan Z. I-54, I-728, II-22 Li, Zhenglong II-827 Li, Zhiguo II-901 Liang, Jia I-512, II-754 Liao, ShengCai I-54 Liao, Shu II-672 Lien, Jenn-Jier James I-261, I-314, I-885, II-96, II-700 Lim, Ser-Nam I-397 Lin, Shouxun II-106 Lin, Zhe II-404 Lina II-774 Liu, Chunxiao I-282 Liu, Fuqiang I-355 Liu, Jundong I-956 Liu, Nianjun I-482 Liu, Qingshan II-827, II-901 Liu, Wenyu I-282
Liu, Xiaoming II-662 Liu, Yuncai I-419 Loke, Eng Hui I-430 Lu, Fangfang II-134, II-279 Lu, Hanqing II-827 Lubin, Jeffrey II-414 Lui, Shu-Fan II-96 Luo, Guan I-821 Ma, Yong II-680 Maeda, Eisaku I-324 Mahmood, Arif I-647 Makhanov, Stanislav I-85 Makihara, Yasushi I-452 Manmatha, R. I-586 Mao, Hsi-Shu II-96 Mar´ee, Rapha¨el II-611 Marikhu, Ramesh I-85 Martens, Ga¨etan I-251 Matas, Jiˇr´ı II-236 Mattoccia, Stefano II-517 Maybank, Steve I-821 McCloskey, Scott I-271, II-858 Mekada, Yoshito II-774 Mekuz, Nathan I-492 ´ M´emin, Etienne I-864 Metaxas, Dimitris II-901 Meyer, Alexandre I-738 Michoud, Brice I-678 Miˇcuˇs´ık, Branislav I-65 Miles, Nicholas I-945 Mitiche, Amar I-925 Mittal, Anurag I-397 Mogi, Kenji II-528 Morgan, Steve I-945 Mori, Akihiro I-628 Morisaka, Akihiko II-206 Mu, Yadong II-837 Mudenagudi, Uma II-85 Mukaigawa, Yasuhiro I-544, II-246 Mukerjee, Amitabha II-394 Murai, Yasuhiro I-915 Murase, Hiroshi II-774 Nagahashi, Tomoyuki II-806 Nakajima, Noboru II-73 Nakasone, Yoshiki II-528 Nakazawa, Atsushi I-618 Nalin Pradeep, S. I-522, II-116 Niranjan, Shobhit II-394 Nomiya, Hiroki I-502
967
968
Author Index
Odone, Francesca II-881 Ohara, Masatoshi I-292 Ohta, Naoya II-528 Ohtera, Ryo I-708 Okutomi, Masatoshi II-176 Okutomoi, Masatoshi II-384 Olsson, Carl II-796 Ong, S.H. I-875 Otsuka, Kazuhiro I-324 Pagani, Alain I-769 Paluri, Balamanohar I-522, II-116 Papadakis, Nicolas I-864 Parikh, Devi II-487 Park, Joonyoung II-560 Pehserl, Joachim I-657 Pele, Ofir II-435 Peng, Yuxin I-748 Peterson, Chris II-733 Pham, Nam Trung I-875 Piater, Justus I-365 Pollefeys, Marc II-353 Poppe, Chris I-251 Prakash, C. I-522, II-116 Pujades, Sergi II-373 Puri, Manika II-414 Radig, Bernd II-332 Rahmati, Mohammad II-217 Raskar, Ramesh I-1, I-945 Raskin, Leonid I-442 Raxle Wang, Chi-Chen I-885 Reid, Ian II-601 Ren, Chunjian II-53 Rivlin, Ehud I-442 Robles-Kelly, Antonio II-134 Rudzsky, Michael I-442 Ryu, Hanjin I-200 Sagawa, Ryusuke I-116 Sakakubara, Shizu II-424 Sakamoto, Ryuuki I-758 Sato, Jun II-342 Sato, Kosuke I-149 Sato, Tomokazu II-73 Sato, Yoichi I-324 Sawhney, Harpreet II-414 Seo, Yongduek II-322 Shah, Hitesh I-240, I-522, II-116 Shahrokni, Ali II-601
Shen, Chunhua II-227 Shen, I-fan I-189, II-53 Shi, Jianbo I-189 Shi, Min II-42 Shi, Yu I-718 Shi, Zhenwei I-180 Shimada, Atsushi I-159 Shimada, Nobutaka I-596 Shimizu, Ikuko II-424 Shimizu, Masao II-176 Shinano, Yuji II-424 Shirai, Yoshiaki I-596 Siddiqi, Kaleem I-271, II-858 Singh, Gajinder II-414 Slobodan, Ili´c I-75 Smith, Charles I-956 Smith, William A.P. II-869 ˇ Sochman, Jan II-236 Song, Gang I-189 Song, Yangqiu I-180 Stricker, Didier I-769 Sturm, Peter II-373, II-784 Sugaya, Yasuyuki II-311 Sugimoto, Shigeki II-384 Sugiura, Kazushige I-452 Sull, Sanghoon I-200 Sumino, Kohei II-246 Sun, Zhenan II-1, II-12 Sung, Ming-Chian I-261 Sze, W.F. II-186 Takahashi, Hidekazu II-384 Takahashi, Tomokazu II-774 Takamatsu, Jun II-289 Takeda, Yuki I-779 Takemura, Haruo I-618 Tan, Huachun II-712 Tan, Tieniu I-667, I-843, II-1, II-12, II-690 Tanaka, Hidenori I-618 Tanaka, Hiromi T. I-779 Tanaka, Tatsuya I-159 Tang, Sheng II-106 Taniguchi, Rin-ichiro I-159, I-628 Tao, Hai I-345 Tao, Linmi I-748 Tarel, Jean-Philippe II-817 Tian, Min I-355 Tombari, Federico II-517 Tominaga, Shoji I-708
Author Index Toriyama, Tomoji I-758 Tsai, Luo-Wei I-169 Tseng, Chien-Chung I-314 Tseng, Yun-Jung I-169 Tsotsos, John K. I-385, I-492 Tsui, Timothy I-718 Uchida, Seiichi I-628 Uehara, Kuniaki I-502 Urahama, Kiichi II-590 Utsumi, Akira I-292 Van de Walle, Rik I-251 van den Hengel, Anton I-375 Van Gool, Luc I-608 Verri, Alessandro II-881 Vincze, Markus I-65 Wada, Toshikazu I-565, I-688 Wan, Cheng II-342 Wang, Fei II-1 Wang, Guanghui II-363 Wang, Junqiu I-576 Wang, Lei I-800, II-145 Wang, Liming I-189 Wang, Te-Hsun I-261 Wang, Xiaolong I-303 Wang, Ying I-667 Wang, Yuanquan I-512, II-754 Wang, Yunhong I-462, II-690 Wehenkel, Louis II-611 Wei, Shou-Der I-638 Werman, Michael II-435 Wildenauer, Horst I-65 Wildes, Richard I-532 Wimmer, Matthias II-332 With, Peter H.N. de I-789 Wong, Ka Yan II-764 Woo, Woontack II-580 Woodford, Oliver II-601 Wu, Fuchao I-472 Wu, Haiyuan I-565, I-688 Wu, Jin-Yi II-96 Wu, Q.M. Jonathan II-363 Wu, Yihong I-472 Wuest, Harald I-769 Xu, Gang II-570 Xu, Guangyou I-748, II-477
Xu, Lijie II-32 Xu, Shuang II-641 Xu, Xinyu II-155 Yagi, Yasushi I-116, I-452, I-544, I-576, II-246 Yamaguchi, Osamu II-467 Yamamoto, Masanobu I-430 Yamato, Junji I-324 Yamazaki, Masaki II-570 Yamazoe, Hirotake I-292 Yang, Ruigang I-127 Yang, Ying II-106 Ye, Qixiang I-896 Yin, Xin I-779 Ying, Xianghua I-138 Yip, Chi Lap II-764 Yokoya, Naokazu II-73 Yu, Hua I-896 Yu, Jingyi I-95 Yu, Xiaoyi II-651 Yuan, Ding II-301 Yuan, Xiaotong I-728 Zaboli, Hamidreza II-217 Zaharescu, Andrei II-166 Zha, Hongbin I-138, I-544 Zhang, Changshui I-180 Zhang, Chao II-722 Zhang, Dan I-180 Zhang, Fan I-282 Zhang, Ke I-482 Zhang, Weiwei I-355 Zhang, Xiaoqin I-821 Zhang, Yongdong II-106 Zhang, Yu-Jin II-712 Zhang, Yuhang I-800 Zhao, Qi I-345 Zhao, Xu I-419 Zhao, Youdong II-641 Zhao, Yuming II-680 Zheng, Bo II-289 Zheng, Jiang Yu I-303, II-42 Zhong, H. II-186 Zhou, Bingfeng II-837 Zhou, Xue I-832 Zhu, Youding I-408
969