Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2352
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Anders Heyden Gunnar Sparr Mads Nielsen Peter Johansen (Eds.)
Computer Vision – ECCV 2002 7th European Conference on Computer Vision Copenhagen, Denmark, May 28-31, 2002 Proceedings, Part III
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Anders Heyden Gunnar Sparr Lund University, Centre for Mathematical Sciences Box 118, 22100 Lund, Sweden E-mail: {Anders.Heyden,Gunnar.Sparr}@math.lth.se Mads Nielsen The IT University of Copenhagen Glentevej 67-69, 2400 Copenhagen NW, Denmark E-mail:
[email protected] Peter Johansen University of Copenhagen Universitetsparken 1, 2100 Copenhagen, Denmark E-mail:
[email protected]
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Computer vision : proceedings / ECCV 2002, 7th European Conference on Computer Vision, Copenhagen, Denmark, May 28 - 31, 2002. Anders Heyden ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer Pt. 3 . - 2002 (Lecture notes in computer science ; Vol. 2352) ISBN 3-540-43746-0 CR Subject Classification (1998): I.4, I.3.5, I.5, I.2.9-10 ISSN 0302-9743 ISBN 3-540-43746-0 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna e.K. Printed on acid-free paper SPIN: 10870041 06/3142 543210
Preface
Premiering in 1990 in Antibes, France, the European Conference on Computer Vision, ECCV, has been held biennially at venues all around Europe. These conferences have been very successful, making ECCV a major event to the computer vision community. ECCV 2002 was the seventh in the series. The privilege of organizing it was shared by three universities: The IT University of Copenhagen, the University of Copenhagen, and Lund University, with the conference venue in Copenhagen. These universities lie ¨ geographically close in the vivid Oresund region, which lies partly in Denmark and partly in Sweden, with the newly built bridge (opened summer 2000) crossing the sound that formerly divided the countries. We are very happy to report that this year’s conference attracted more papers than ever before, with around 600 submissions. Still, together with the conference board, we decided to keep the tradition of holding ECCV as a single track conference. Each paper was anonymously refereed by three different reviewers. For the final selection, for the first time for ECCV, a system with area chairs was used. These met with the program chairs in Lund for two days in February 2002 to select what became 45 oral presentations and 181 posters. Also at this meeting the selection was made without knowledge of the authors’ identity. The high-quality of the scientific program of ECCV 2002 would not have been possible without the dedicated cooperation of the 15 area chairs, the 53 program committee members, and all the other scientists, who reviewed the papers. A truly impressive effort was made. The spirit of this process reflects the enthusiasm in the research field, and you will find several papers in these proceedings that define the state of the art in the field. Bjarne Ersbøll as Industrial Relations Chair organized the exhibitions at the conference. Magnus Oskarsson, Sven Spanne, and Nicolas Guilbert helped to make the review process and the preparation of the proceedings function smoothly. Ole Fogh Olsen gave us valuable advice on editing the proceedings. Camilla Jørgensen competently headed the scientific secretariat. Erik Dam and Dan Witzner were responsible for the ECCV 2002 homepage. David Vernon, who chaired ECCV 2000 in Dublin, was extremely helpful during all stages of our preparation for the conference. We would like to thank all these people, as well as numerous others who helped in various respects. A special thanks goes to Søren Skovsgaard at the Congress Consultants, for professional help with all practical matters. We would also like to thank Rachid Deriche and Theo Papadopoulo for making their web-based conference administration system available and adjusting it to ECCV. This was indispensable in handling the large number of submissions and the thorough review and selection procedure. Finally, we wish to thank the IT University of Copenhagen and its president Mads Tofte for supporting the conference all the way from planning to realization. March 2002
Anders Heyden Gunnar Sparr Mads Nielsen Peter Johansen
Organization
Conference Chair Peter Johansen
Copenhagen University, Denmark
Conference Board Hans Burkhardt Bernard Buxton Roberto Cipolla Jan-Olof Eklundh Olivier Faugeras Bernd Neumann Giulio Sandini David Vernon
University of Freiburg, Germany University College London, UK University of Cambridge, UK Royal Institute of Technology, Sweden INRIA, Sophia Antipolis, France University of Hamburg, Germany University of Genova, Italy Trinity College, Dublin, Ireland
Program Chairs Anders Heyden Gunnar Sparr
Lund University, Sweden Lund University, Sweden
Area Chairs Ronen Basri Michael Black Andrew Blake Rachid Deriche Jan-Olof Eklundh Lars Kai Hansen Steve Maybank Theodore Papadopoulo Cordelia Schmid Amnon Shashua Stefano Soatto Bill Triggs Luc van Gool Joachim Weichert Andrew Zisserman
Weizmann Institute, Israel Brown University, USA Microsoft Research, UK INRIA, Sophia Antipolis, France Royal Institute of Technology, Sweden Denmark Technical University, Denmark University of Reading, UK INRIA, Sophia Antipolis, France INRIA, Rhône-Alpes, France The Hebrew University of Jerusalem, Israel University of California, Los Angeles, USA INRIA, Rhône-Alpes, France K.U. Leuven, Belgium & ETH, Zürich, Switzerland Saarland University, Germany University of Oxford, UK
Organization
VII
Program Committee Luis Alvarez Padmanabhan Anandan Helder Araujo Serge Belongie Marie-Odile Berger Aaron Bobick Terry Boult Francois Chaumette Laurent Cohen Tim Cootes Kostas Daniilidis Larry Davis Frank Ferrie Andrew Fitzgibbon David J. Fleet David Forsyth Pascal Fua Richard Hartley Vaclav Hlavac Michal Irani Allan Jepson Peter Johansen Fredrik Kahl Sing Bing Kang Ron Kimmel Kyros Kutulakos Tony Lindeberg Jim Little Peter Meer David Murray Nassir Navab Mads Nielsen Patrick Perez Pietro Perona Marc Pollefeys Long Quan Ian Reid Nicolas Rougon José Santos-Victor Guillermo Sapiro Yoichi Sato Bernt Schiele Arnold Smeulders
University of Las Palmas, Spain Microsoft Research, USA University of Coimbra, Portugal University of California, San Diego, USA INRIA, Lorraine, France Georgia Tech, USA Leheigh University, USA INRIA, Rennes, France Université Paris IX Dauphine, France University of Manchester, UK University of Pennsylvania, USA University of Maryland, USA McGill University, USA University of Oxford, UK Xerox Palo Alto Research Center, USA University of California, Berkeley, USA EPFL, Switzerland Australian National University, Australia Czech Technical University, Czech Republic Weizmann Institute, Israel University of Toronto, Canada Copenhagen University, Denmark Lund University, Sweden Microsoft Research, USA Technion, Israel University of Rochester, USA Royal Institute of Technology, Sweden University of Brittish Columbia, Canada Rutgers University, USA University of Oxford, UK Siemens, USA IT-University of Copenhagen, Denmark Microsoft Research, UK California Insititute of Technology, USA K.U. Leuven, Belgium Hong Kong University of Science and Technology, Hong Kong University of Oxford, UK Institut National des Télécommunications, France Instituto Superior Técnico, Lisbon, Portugal University of Minnesota, USA IIS, University of Tokyo, Japan ETH, Zürich, Switzerland University of Amsterdam, The Netherlands
VIII
Organization
Gerald Sommer Peter Sturm Tomas Svoboda Chris Taylor Phil Torr Panos Trahanias Laurent Younes Alan Yuille Josiane Zerubia Kalle Åström
University of Kiel, Germany INRIA, Rhône-Alpes, France Swiss Federal Institute of Technology, Switzerland University of Manchester, UK Microsoft Research, UK University of Crete, Greece CMLA, ENS de Cachan, France Smith-Kettlewell Eye Research Institute, USA INRIA, Sophia Antipolis, France Lund University, Sweden
Additional Referees Henrik Aanaes Manoj Aggarwal Motilal Agrawal Aya Aner Adnan Ansar Mirko Appel Tal Arbel Okan Arikan Akira Asano Shai Avidan Simon Baker David Bargeron Christian Barillot Kobus Barnard Adrien Bartoli Benedicte Bascle Pierre-Louis Bazin Isabelle Begin Stephen Benoit Alex Berg James Bergen Jim Bergen Marcelo Bertamlmio Rikard Berthilsson Christophe Biernacki Armin Biess Alessandro Bissacco Laure Blanc-Feraud Ilya Blayvas Eran Borenstein Patrick Bouthemy Richard Bowden
Jeffrey E. Boyd Edmond Boyer Yuri Boykov Chen Brestel Lars Bretzner Alexander Brook Michael Brown Alfred Bruckstein Thomas Buelow Joachim Buhmann Hans Burkhardt Bernard Buxton Nikos Canterakis Yaron Caspi Alessandro Chiuso Roberto Cipolla Dorin Comaniciu Kurt Cornelis Antonio Criminisi Thomas E. Davis Nando de Freitas Fernando de la Torre Daniel DeMenthon Xavier Descombes Hagio Djambazian Gianfranco Doretto Alessandro Duci Gregory Dudek Ramani Duraiswami Pinar Duygulu Michael Eckmann Alyosha Efros
Michael Elad Ahmed Elgammal Ronan Fablet Ayman Farahat Olivier Faugeras Paulo Favaro Xiaolin Feng Vittorio Ferrari Frank Ferrie Mario Figueireda Margaret Fleck Michel Gangnet Xiang Gao D. Geiger Yakup Genc Bogdan Georgescu J.-M. Geusebroek Christopher Geyer Peter Giblin Gerard Giraudon Roman Goldenberg Shaogang Gong Hayit Greenspan Lewis Griffin Jens Guehring Yanlin Guo Daniela Hall Tal Hassner Horst Haussecker Ralf Hebrich Yacov Hel-Or Lorna Herda
Organization
Shinsaku Hiura Jesse Hoey Stephen Hsu Du Huynh Naoyuki Ichimura Slobodan Ilic Sergey Ioffe Michael Isard Volkan Isler David Jacobs Bernd Jaehne Ian Jermyn Hailin Jin Marie-Pierre Jolly Stiliyan-N. Kalitzin Behrooz Kamgar-Parsi Kenichi Kanatani Danny Keren Erwan Kerrien Charles Kervrann Renato Keshet Ali Khamene Shamim Khan Nahum Kiryati Reinhard Koch Ullrich Koethe Esther B. Koller-Meier John Krumm Hannes Kruppa Murat Kunt Prasun Lala Michael Langer Ivan Laptev Jean-Pierre Le Cadre Bastian Leibe Ricahrd Lengagne Vincent Lepetit Thomas Leung Maxime Lhuillier Weiliang Li David Liebowitz Georg Lindgren David Lowe John MacCormick Henrik Malm
Roberto Manduchi Petros Maragos Eric Marchand Jiri Matas Bogdan Matei Esther B. Meier Jason Meltzer Etienne Mémin Rudolf Mester Ross J. Micheals Anurag Mittal Hiroshi Mo William Moran Greg Mori Yael Moses Jane Mulligan Don Murray Masahide Naemura Kenji Nagao Mirko Navara Shree Nayar Oscar Nestares Bernd Neumann Jeffrey Ng Tat Hieu Nguyen Peter Nillius David Nister Alison Noble Tom O’Donnell Takayuki Okatani Nuria Olivier Ole Fogh Olsen Magnus Oskarsson Nikos Paragios Ioannis Patras Josef Pauli Shmuel Peleg Robert Pless Swaminathan Rahul Deva Ramanan Lionel Reveret Dario Ringach Ruth Rosenholtz Volker Roth Payam Saisan
Garbis Salgian Frank Sauer Peter Savadjiev Silvio Savarese Harpreet Sawhney Frederik Schaffalitzky Yoav Schechner Chrostoph Schnoerr Stephan Scholze Ali Shahrokri Doron Shaked Eitan Sharon Eli Shechtman Jamie Sherrah Akinobu Shimizu Ilan Shimshoni Kaleem Siddiqi Hedvig Sidenbladh Robert Sim Denis Simakov Philippe Simard Eero Simoncelli Nir Sochen Yang Song Andreas Soupliotis Sven Spanne Martin Spengler Alon Spira Thomas Strömberg Richard Szeliski Hai Tao Huseyin Tek Seth Teller Paul Thompson Jan Tops Benjamin J. Tordoff Kentaro Toyama Tinne Tuytelaars Shimon Ullman Richard Unger Raquel Urtasun Sven Utcke Luca Vacchetti Anton van den Hengel Geert Van Meerbergen
IX
X
Organization
Pierre Vandergheynst Zhizhou Wang Baba Vemuri Frank Verbiest Maarten Vergauwen Jaco Vermaak Mike Werman David Vernon Thomas Vetter
Rene Vidal Michel Vidal-Naquet Marta Wilczkowiak Ramesh Visvanathan Dan Witzner Hansen Julia Vogel Lior Wolf Bob Woodham Robert J. Woodham
Chenyang Xu Yaser Yacoob Anthony Yezzi Ramin Zabih Hugo Zaragoza Lihi Zelnik-Manor Ying Zhu Assaf Zomet
Table of Contents, Part III
Shape 3D Statistical Shape Models Using Direct Optimisation of Description Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R.H. Davies, C.J. Twining, T.F. Cootes, J.C. Waterton, C.J. Taylor
3
Approximate Thin Plate Spline Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 G. Donato, S. Belongie DEFORMOTION: Deforming Motion, Shape Average and the Joint Registration and Segmentation of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 S. Soatto, A.J. Yezzi Region Matching with Missing Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A. Duci, A.J. Yezzi, S. Mitter, S. Soatto
Stereoscopic Vision I What Energy Functions Can Be Minimized via Graph Cuts? . . . . . . . . . . . . . . . . . 65 V. Kolmogorov, R. Zabih Multi-camera Scene Reconstruction via Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . 82 V. Kolmogorov, R. Zabih A Markov Chain Monte Carlo Approach to Stereovision . . . . . . . . . . . . . . . . . . . . . 97 J. S´en´egas A Probabilistic Theory of Occupancy and Emptiness . . . . . . . . . . . . . . . . . . . . . . . 112 R. Bhotika, D.J. Fleet, K.N. Kutulakos
Texture Shading and Colour / Grouping and Segmentation / Object Recognition Texture Similarity Measure Using Kullback-Leibler Divergence between Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 J.R. Mathiassen, A. Skavhaug, K. Bø All the Images of an Outdoor Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 S.G. Narasimhan, C. Wang, S.K. Nayar Recovery of Reflectances and Varying Illuminants from Multiple Views . . . . . . . . 163 Q.-T. Luong, P. Fua, Y. Leclerc
XII
Table of Contents, Part III
Composite Texture Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 A. Zalesny, V. Ferrari, G. Caenen, D. Auf der Maur, L. Van Gool Constructing Illumination Image Basis from Object Motion . . . . . . . . . . . . . . . . . . 195 A. Nakashima, A. Maki, K. Fukui Diffuse-Specular Separation and Depth Recovery from Image Sequences . . . . . . . 210 S. Lin, Y. Li, S.B. Kang, X. Tong, H.-Y. Shum Shape from Texture without Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 D.A. Forsyth Statistical Modeling of Texture Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Y.N. Wu, S.C. Zhu, C.-e. Guo Classifying Images of Materials: Achieving Viewpoint and Illumination Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 M. Varma, A. Zisserman Estimation of Multiple Illuminants from a Single Image of Arbitrary Known Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Y. Wang, D. Samaras The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 M. Chantler, M. Schmidt, M. Petrou, G. McGunnigle On Affine Invariant Clustering and Automatic Cast Listing in Movies . . . . . . . . . . 304 A. Fitzgibbon, A. Zisserman Factorial Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 J. Kim, R. Zabih Evaluation and Selection of Models for Motion Segmentation . . . . . . . . . . . . . . . . 335 K. Kanatani Surface Extraction from Volumetric Images Using Deformable Meshes: A Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 J. Tohka DREAM2 S: Deformable Regions Driven by an Eulerian Accurate Minimization Method for Image and Video Segmentation (Application to Face Detection in Color Video Sequences) . . . . . . . . . . . . . . . . . . 365 S. Jehan-Besson, M. Barlaud, G. Aubert Neuro-Fuzzy Shadow Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 B.P.L. Lo, G.-Z. Yang Parsing Images into Region and Curve Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Z. Tu, S.-C. Zhu
Table of Contents, Part III
XIII
Yet Another Survey on Image Segmentation: Region and Boundary Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 J. Freixenet, X. Mu˜noz, D. Raba, J. Mart´ı, X. Cuf´ı Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D . . . . . . . . . . . 423 M. Nicolescu, G. Medioni Deformable Model with Non-euclidean Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 B. Taton, J.-O. Lachaud Finding Deformable Shapes Using Loopy Belief Propagation . . . . . . . . . . . . . . . . . 453 J.M. Coughlan, S.J. Ferreira Probabilistic and Voting Approaches to Cue Integration for Figure-Ground Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 E. Hayman, J.-O. Eklundh Bayesian Estimation of Layers from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . 487 Y. Wexler, A. Fitzgibbon, A. Zisserman A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction . . . . . . . . 502 F. Han, Z. Tu, S.-C. Zhu Normalized Gradient Vector Diffusion and Image Segmentation . . . . . . . . . . . . . . 517 Z. Yu, C. Bajaj Spectral Partitioning with Indefinite Kernels Using the Nystr¨om Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 S. Belongie, C. Fowlkes, F. Chung, J. Malik A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-of-Gaussian Background Models . . . . . . . . . . . . . . . . . . . . . . . . 543 M. Harville Multivariate Saddle Point Detection for Statistical Clustering . . . . . . . . . . . . . . . . . 561 D. Comaniciu, V. Ramesh, A. Del Bue Parametric Distributional Clustering for Image Segmentation . . . . . . . . . . . . . . . . . 577 L. Hermes, T. Z¨oller, J.M. Buhmann Probabalistic Models and Informative Subspaces for Audiovisual Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 J.W. Fisher, T. Darrell Volterra Filtering of Noisy Images of Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 J. August Image Segmentation by Flexible Models Based on Robust Regularized Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 M. Rivera, J. Gee
XIV
Table of Contents, Part III
Principal Component Analysis over Continuous Subspaces and Intersection of Half-Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 A. Levin, A. Shashua On Pencils of Tangent Planes and the Recognition of Smooth 3D Shapes from Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 S. Lazebnik, A. Sethi, C. Schmid, D. Kriegman, J. Ponce, M. Hebert Estimating Human Body Configurations Using Shape Context Matching . . . . . . . 666 G. Mori, J. Malik Probabilistic Human Recognition from Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 S. Zhou, R. Chellappa SoftPOSIT: Simultaneous Pose and Correspondence Determination . . . . . . . . . . . . 698 P. David, D. DeMenthon, R. Duraiswami, H. Samet A Pseudo-Metric for Weighted Point Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715 P. Giannopoulos, R.C. Veltkamp Shock-Based Indexing into Large Shape Databases . . . . . . . . . . . . . . . . . . . . . . . . . 731 T.B. Sebastian, P.N. Klein, B.B. Kimia EigenSegments: A Spatio-Temporal Decomposition of an Ensemble of Images . . 747 S. Avidan On the Representation and Matching of Qualitative Shape at Multiple Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759 A. Shokoufandeh, S. Dickinson, C. J¨onsson, L. Bretzner, T. Lindeberg Combining Simple Discriminators for Object Discrimination . . . . . . . . . . . . . . . . . 776 S. Mahamud, M. Hebert, J. Lafferty Probabilistic Search for Object Segmentation and Recognition . . . . . . . . . . . . . . . . 791 U. Hillenbrand, G. Hirzinger Real-Time Interactive Path Extraction with On-the-Fly Adaptation of the External Forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807 O. G´erard, T. Deschamps, M. Greff, L.D. Cohen Matching and Embedding through Edit-Union of Trees . . . . . . . . . . . . . . . . . . . . . 822 A. Torsello, E.R. Hancock A Comparison of Search Strategies for Geometric Branch and Bound Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837 T. M. Breuel Face Recognition from Long-Term Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 851 G. Shakhnarovich, J.W. Fisher, T. Darrell
Table of Contents, Part III
XV
Stereoscopic Vision II Helmholtz Stereopsis: Exploiting Reciprocity for Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869 T. Zickler, P.N. Belhumeur, D.J. Kriegman Minimal Surfaces for Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885 C. Buehler, S.J. Gortler, M.F. Cohen, L. McMillan Finding the Largest Unambiguous Component of Stereo Matching . . . . . . . . . . . . 900 ˇara R. S´
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915
Table of Contents, Part I
Active and Real-Time Vision Tracking with the EM Contour Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.E.C. Pece, A.D. Worrall
3
M2Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene Using Region-Based Stereo . . . . . . . . . . . . . . . . . . . . . 18 A. Mittal, L.S. Davis
Image Features Analytical Image Models and Their Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A. Srivastava, X. Liu, U. Grenander Time-Recursive Velocity-Adapted Spatio-Temporal Scale-Space Filters . . . . . . . . 52 T. Lindeberg Combining Appearance and Topology for Wide Baseline Matching . . . . . . . . . . . . 68 D. Tell, S. Carlsson Guided Sampling and Consensus for Motion Estimation . . . . . . . . . . . . . . . . . . . . . 82 B. Tordoff, D.W. Murray
Image Features / Visual Motion Fast Anisotropic Gauss Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 J.-M. Geusebroek, A.W.M. Smeulders, J. van de Weijer Adaptive Rest Condition Potentials: Second Order Edge-Preserving Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 M. Rivera, J.L. Marroquin An Affine Invariant Interest Point Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 K. Mikolajczyk, C. Schmid Understanding and Modeling the Evolution of Critical Points under Gaussian Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A. Kuijper, L. Florack Image Processing Done Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 J.J. Koenderink, A.J. van Doorn Multimodal Data Representations with Parameterized Local Structures . . . . . . . . . 173 Y. Zhu, D. Comaniciu, S. Schwartz, V. Ramesh
Table of Contents, Part I
XVII
The Relevance of Non-generic Events in Scale Space Models . . . . . . . . . . . . . . . . 190 A. Kuijper, L. Florack The Localized Consistency Principle for Image Matching under Non-uniform Illumination Variation and Affine Distortion . . . . . . . . . . . . . . . . . . . 205 B. Wang, K.K. Sung, T.K. Ng Resolution Selection Using Generalized Entropies of Multiresolution Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 E. Hadjidemetriou, M.D. Grossberg, S.K. Nayar Robust Computer Vision through Kernel Density Estimation . . . . . . . . . . . . . . . . . 236 H. Chen, P. Meer Constrained Flows of Matrix-Valued Functions: Application to Diffusion Tensor Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 C. Chefd’hotel, D. Tschumperl´e, R. Deriche, O. Faugeras A Hierarchical Framework for Spectral Correspondence . . . . . . . . . . . . . . . . . . . . . 266 M. Carcassoni, E.R. Hancock Phase-Based Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 G. Carneiro, A.D. Jepson What Is the Role of Independence for Visual Recognition? . . . . . . . . . . . . . . . . . . . 297 N. Vasconcelos, G. Carneiro A Probabilistic Multi-scale Model for Contour Completion Based on Image Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 X. Ren, J. Malik Toward a Full Probability Model of Edges in Natural Images . . . . . . . . . . . . . . . . . 328 K.S. Pedersen, A.B. Lee Fast Difference Schemes for Edge Enhancing Beltrami Flow . . . . . . . . . . . . . . . . . 343 R. Malladi, I. Ravve A Fast Radial Symmetry Transform for Detecting Points of Interest . . . . . . . . . . . . 358 G. Loy, A. Zelinsky Image Features Based on a New Approach to 2D Rotation Invariant Quadrature Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 M. Felsberg, G. Sommer Representing Edge Models via Local Principal Component Analysis . . . . . . . . . . . 384 P.S. Huggins, S.W. Zucker Regularized Shock Filters and Complex Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . 399 G. Gilboa, N.A. Sochen, Y.Y. Zeevi
XVIII
Table of Contents, Part I
Multi-view Matching for Unordered Image Sets, or “How Do I Organize My Holiday Snaps?” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 F. Schaffalitzky, A. Zisserman Parameter Estimates for a Pencil of Lines: Bounds and Estimators . . . . . . . . . . . . . 432 G. Speyer, M. Werman Multilinear Analysis of Image Ensembles: TensorFaces . . . . . . . . . . . . . . . . . . . . . 447 M.A.O. Vasilescu, D. Terzopoulos ‘Dynamism of a Dog on a Leash’ or Behavior Classification by Eigen-Decomposition of Periodic Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 R. Goldenberg, R. Kimmel, E. Rivlin, M. Rudzsky Automatic Detection and Tracking of Human Motion with a View-Based Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 R. Fablet, M.J. Black Using Robust Estimation Algorithms for Tracking Explicit Curves . . . . . . . . . . . . . 492 J.-P. Tarel, S.-S. Ieng, P. Charbonnier On the Motion and Appearance of Specularities in Image Sequences . . . . . . . . . . . 508 R. Swaminathan, S.B. Kang, R. Szeliski, A. Criminisi, S.K. Nayar Multiple Hypothesis Tracking for Automatic Optical Motion Capture . . . . . . . . . . 524 M. Ringer, J. Lasenby Single Axis Geometry by Fitting Conics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 G. Jiang, H.-t. Tsui, L. Quan, A. Zisserman Computing the Physical Parameters of Rigid-Body Motion from Video . . . . . . . . . 551 K.S. Bhat, S.M. Seitz, J. Popovi´c, P.K. Khosla Building Roadmaps of Local Minima of Visual Models . . . . . . . . . . . . . . . . . . . . . 566 C. Sminchisescu, B. Triggs A Generative Method for Textured Motion: Analysis and Synthesis . . . . . . . . . . . . 583 Y. Wang, S.-C. Zhu Is Super-Resolution with Optical Flow Feasible? . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 W.Y. Zhao, H.S. Sawhney New View Generation with a Bi-centric Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 D. Weinshall, M.-S. Lee, T. Brodsky, M. Trajkovic, D. Feldman Recognizing and Tracking Human Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 J. Sullivan, S. Carlsson
Table of Contents, Part I
XIX
Towards Improved Observation Models for Visual Tracking: Selective Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 J. Vermaak, P. P´erez, M. Gangnet, A. Blake Color-Based Probabilistic Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 P. P´erez, C. Hue, J. Vermaak, M. Gangnet Dense Motion Analysis in Fluid Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 ´ M´emin, P. P´erez T. Corpetti, E. A Layered Motion Representation with Occlusion and Compact Spatial Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692 A.D. Jepson, D.J. Fleet, M.J. Black Incremental Singular Value Decomposition of Uncertain Data with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 M. Brand Symmetrical Dense Optical Flow Estimation with Occlusions Detection . . . . . . . . 721 L. Alvarez, R. Deriche, T. Papadopoulo, J. S´anchez Audio-Video Sensor Fusion with Probabilistic Graphical Models . . . . . . . . . . . . . . 736 M.J. Beal, H. Attias, N. Jojic
Visual Motion Increasing Space-Time Resolution in Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753 E. Shechtman, Y. Caspi, M. Irani Hyperdynamics Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 C. Sminchisescu, B. Triggs Implicit Probabilistic Models of Human Motion for Synthesis and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784 H. Sidenbladh, M.J. Black, L. Sigal Space-Time Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801 L. Torresani, C. Bregler
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
Table of Contents, Part II
Surface Geometry A Variational Approach to Recovering a Manifold from Sample Points . . . . . . . . . J. Gomes, A. Mojsilovic
3
A Variational Approach to Shape from Defocus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 H. Jin, P. Favaro Shadow Graphs and Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Y. Yu, J.T. Chang Specularities Reduce Ambiguity of Uncalibrated Photometric Stereo . . . . . . . . . . . 46 ˇara O. Drbohlav, R. S´
Grouping and Segmentation Pairwise Clustering with Matrix Factorisation and the EM Algorithm . . . . . . . . . . 63 A. Robles-Kelly, E.R. Hancock Shape Priors for Level Set Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 M. Rousson, N. Paragios Nonlinear Shape Statistics in Mumford–Shah Based Segmentation . . . . . . . . . . . . 93 D. Cremers, T. Kohlberger, C. Schn¨orr Class-Specific, Top-Down Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 E. Borenstein, S. Ullman
Structure from Motion / Stereoscopic Vision / Surface Geometry / Shape Quasi-Dense Reconstruction from Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 125 M. Lhuillier, L. Quan Properties of the Catadioptric Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 140 C. Geyer, K. Daniilidis Building Architectural Models from Many Views Using Map Constraints . . . . . . . 155 D.P. Robertson, R. Cipolla Motion – Stereo Integration for Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C. Strecha, L. Van Gool
Table of Contents, Part II
XXI
Lens Distortion Recovery for Accurate Sequential Structure and Motion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 K. Cornelis, M. Pollefeys, L. Van Gool Generalized Rank Conditions in Multiple View Geometry with Applications to Dynamical Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 K. Huang, R. Fossum, Y. Ma Dense Structure-from-Motion: An Approach Based on Segment Matching . . . . . . 217 F. Ernst, P. Wilinski, K. van Overveld Maximizing Rigidity: Optimal Matching under Scaled-Orthography . . . . . . . . . . . 232 J. Maciel, J. Costeira Dramatic Improvements to Feature Based Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 V.N. Smelyansky, R.D. Morris, F.O. Kuehnel, D.A. Maluf, P. Cheeseman Motion Curves for Parametric Shape and Motion Estimation . . . . . . . . . . . . . . . . . 262 P.-L. Bazin, J.-M. V´ezien Bayesian Self-Calibration of a Moving Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 G. Qian, R. Chellappa Balanced Recovery of 3D Structure and Camera Motion from Uncalibrated Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 B. Georgescu, P. Meer Linear Multi View Reconstruction with Missing Data . . . . . . . . . . . . . . . . . . . . . . . 309 C. Rother, S. Carlsson Model-Based Silhouette Extraction for Accurate People Tracking . . . . . . . . . . . . . 325 R. Plaenkers, P. Fua On the Non-linear Optimization of Projective Motion Using Minimal Parameters . 340 A. Bartoli Structure from Many Perspective Images with Occlusions . . . . . . . . . . . . . . . . . . . 355 D. Martinec, T. Pajdla Sequence-to-Sequence Self Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 L. Wolf, A. Zomet Structure from Planar Motions with Small Baselines . . . . . . . . . . . . . . . . . . . . . . . . 383 R. Vidal, J. Oliensis Revisiting Single-View Shape Tensors: Theory and Applications . . . . . . . . . . . . . . 399 A. Levin, A. Shashua
XXII
Table of Contents, Part II
Tracking and Rendering Using Dynamic Textures on Geometric Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 D. Cobzas, M. Jagersand Sensitivity of Calibration to Principal Point Position . . . . . . . . . . . . . . . . . . . . . . . . 433 R.I. Hartley, R. Kaucic Critical Curves and Surfaces for Euclidean Reconstruction . . . . . . . . . . . . . . . . . . . 447 F. Kahl, R. Hartley View Synthesis with Occlusion Reasoning Using Quasi-Sparse Feature Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 D. Jelinek, C.J. Taylor Eye Gaze Correction with Stereovision for Video-Teleconferencing . . . . . . . . . . . . 479 R. Yang, Z. Zhang Wavelet-Based Correlation for Stereopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 M. Clerc Stereo Matching Using Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 J. Sun, H.-Y. Shum, N.-N. Zheng Symmetric Sub-pixel Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 R. Szeliski, D. Scharstein New Techniques for Automated Architectural Reconstruction from Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 T. Werner, A. Zisserman Stereo Matching with Segmentation-Based Cooperation . . . . . . . . . . . . . . . . . . . . . 556 Y. Zhang, C. Kambhamettu Coarse Registration of Surface Patches with Local Symmetries . . . . . . . . . . . . . . . 572 J. Vanden Wyngaerd, L. Van Gool Multiview Registration of 3D Scenes by Minimizing Error between Coordinate Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 G.C. Sharp, S.W. Lee, D.K. Wehe Recovering Surfaces from the Restoring Force . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 G. Kamberov, G. Kamberova Interpolating Sporadic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 L. Noakes, R. Kozera Highlight Removal Using Shape-from-Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626 H. Ragheb, E.R. Hancock
Table of Contents, Part II
XXIII
A Reflective Symmetry Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 M. Kazhdan, B. Chazelle, D. Dobkin, A. Finkelstein, T. Funkhouser Gait Sequence Analysis Using Frieze Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 Y. Liu, R. Collins, Y. Tsin Feature-Preserving Medial Axis Noise Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 672 R. Tam, W. Heidrich Hierarchical Shape Modeling for Automatic Face Localization . . . . . . . . . . . . . . . . 687 C. Liu, H.-Y. Shum, C. Zhang Using Dirichlet Free Form Deformation to Fit Deformable Models to Noisy 3-D Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 S. Ilic, P. Fua Transitions of the 3D Medial Axis under a One-Parameter Family of Deformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718 P. Giblin, B.B. Kimia Learning Shape from Defocus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 P. Favaro, S. Soatto A Rectilinearity Measurement for Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746 ˇ c, P.L. Rosin J. Zuni´ Local Analysis for 3D Reconstruction of Specular Surfaces – Part II . . . . . . . . . . . 759 S. Savarese, P. Perona Matching Distance Functions: A Shape-to-Area Variational Approach for Global-to-Local Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775 N. Paragios, M. Rousson, V. Ramesh Shape from Shading and Viscosity Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790 E. Prados, O. Faugeras, E. Rouy Model Acquisition by Registration of Multiple Acoustic Range Views . . . . . . . . . . 805 A. Fusiello, U. Castellani, L. Ronchetti, V. Murino
Structure from Motion General Trajectory Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823 J.Y. Kaminski, M. Teicher Surviving Dominant Planes in Uncalibrated Structure and Motion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837 M. Pollefeys, F. Verbiest, L. Van Gool A Bayesian Estimation of Building Shape Using MCMC . . . . . . . . . . . . . . . . . . . . 852 A.R. Dick, P.H.S. Torr, R. Cipolla
XXIV
Table of Contents, Part II
Structure and Motion for Dynamic Scenes – The Case of Points Moving in Planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867 P. Sturm What Does the Scene Look Like from a Scene Point? . . . . . . . . . . . . . . . . . . . . . . . 883 M. Irani, T. Hassner, P. Anandan
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
3D Statistical Shape Models Using Direct Optimisation of Description Length Rhodri H. Davies1 , Carole J. Twining1 , Tim F. Cootes1 , John C. Waterton2 , and Chris J. Taylor1 1
Division of Imaging Science, Stopford Building, Oxford Road, University of Manchester, Manchester, M13 9PT, UK
[email protected] 2 AstraZeneca, Alderley Park, Macclesfield, Cheshire, SK10 4TG, UK
Abstract. We describe an automatic method for building optimal 3D statistical shape models from sets of training shapes. Although shape models show considerable promise as a basis for segmenting and interpreting images, a major drawback of the approach is the need to establish a dense correspondence across a training set of example shapes. It is important to establish the correct correspondence, otherwise poor models can result. In 2D, this can be achieved using manual ‘landmarks’, but in 3D this becomes impractical. We show it is possible to establish correspondences automatically, by casting the correspondence problem as one of finding the ‘optimal’ parameterisation of each shape in the training set. We describe an explicit representation of surface parameterisation, that ensures the resulting correspondences are legal, and show how this representation can be manipulated to minimise the description length of the training set using the model. This results in compact models with good generalisation properties. Results are reported for two sets of biomedical shapes, showing significant improvement in model properties compared to those obtained using a uniform surface parameterisation.
1
Introduction
Statistical models of shape show considerable promise as a basis for segmenting and interpreting images in 2D [5]. The basic idea is to establish, from a training set, the pattern of ‘legal’ variation in the shapes and spatial relationships of structures for a given class of images. Statistical analysis is used to give an efficient parameterisation of this variability, providing a compact representation of shape and allowing shape constraints to be applied effectively during image interpretation [6]. A key step in building a model involves establishing a dense correspondence between shape boundaries/surfaces over a reasonably large set of training images. It is important to establish the ‘correct’ correspondences, otherwise an inefficient parameterisation of shape can result, leading to difficulty in defining shape constraints effectively. In 2D, correspondence is often established using manually defined ‘landmarks’ but this is a time-consuming, error-prone A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 3–20, 2002. c Springer-Verlag Berlin Heidelberg 2002
4
R.H. Davies et al.
and subjective process. In principle, the method extends to 3D, but in practice, manual landmarking becomes impractical. In this paper we show how an ‘optimal’ model can be built by automatically defining correspondences across a training set of 3D shapes. Several previous attempts have been made to build 3D statistical shape models [3,4,10,12,13,14,21,25]. The problem of establishing dense correspondence over a set of training shapes can be posed as that of defining a parameterisation for each of the training set, assuming correspondence between equivalently parameterised points. Kelemen et. al [14] and Hill et. al [12] use different arbitrary parameterisations of the training shapes. Christiensen et al. [7], Szekely and Lavalle [23] and Rueckert et al. [21] describe methods for warping the space in which the shapes are embedded. Models can then be built from the resulting deformation field [10,13,21]. Brett and Taylor [3,4] and Wang et. al [25] use shape ‘features’ (e.g. regions of high curvature) to establish point correspondences. The correspondences found using the methods described above are not, in any obvious sense, the correct ones. We show in section 2 (fig. 1) that unsatisfactory models can result if correspondences are chosen inappropriately. We start from the position that the correct correspondences are, by definition, those that build the ‘best’ model. We define the ‘best’ model as that with optimal compactness, specificity and generalisation ability. We have shown elsewhere [8] that a model with these properties can be obtained using an objective function based on the minimum description length principle [18]. We have also described a method that uses the objective function to build models in 2D that are better than the best models we could build using manual landmarks [9]. The representation of the parameterisation did not, however, extend to surfaces in 3D. In this paper we describe the derivation of the objective function. We also describe a novel method of representing and manipulating the parameterisation of a surface allowing the construction of shape models in 3D. The method ensures that only valid correspondences can be represented. In the remainder of the paper, we establish notation, and outline the modelbuilding problem. We then provide a summary of the derivation of the minimum description length objective function. We show how a set of surface parameterisations can be represented explicitly and manipulated to build an optimal model. Finally, we present qualitative and quantitative results of applying the method to surfaces obtained from 3D MR images of brain ventricles and rat kidneys.
2
Statistical Shape Models
A 2d (3d) statistical shape model is built from a training set of example outlines (surfaces), aligned to a common coordinate frame. Each shape, Si , (i = 1, . . . ns ), can (without loss of generality) be represented by a set of n points regularly sampled on the shape, as defined by some parameterisation φi . This allows each shape Si to be represented by an np -dimensional shape vector xi , formed by concatenating the coordinates of its sample points. Using Principal Component
3D Statistical Shape Models
5
analysis, each shape vector can be expressed using a linear model of the form: ¯ + Pbi = x ¯+ pm bm (1) xi = x i , m
where x ¯ is the mean shape vector, P = {pm } are the eigenvectors of the covariance matrix (with corresponding eigenvalues {λm }) that describe a set of orthogonal modes of shape variation and b = {bm } are shape parameters that control the modes of variation. Since our training shapes are continuous, we are interested in the limit np → ∞. This leads to an infinitely large covariance matrix, but we note that there can only be, at most, ns − 1 eigenvalues that are not identically zero (although they may be computationally zero). This means that in the summation above, the index m only takes values in the range 1 to ns − 1. To calculate the non-zero eigenvalues, we consider the np × ns data matrix ¯) : i = 1, . . . ns }. The np × np W constructed from the set of vectors {(xi − x covariance matrix is given by D = np1ns WWT with eigenvectors and eigenvalues {pm , λm } thus: Dpm = λm pm . (2) If we define {pm , λm , } to be the eigenvectors and eigenvalues of the ns × ns matrix, D = np1ns WT W then: D pm = λm pm From (2) : Dpm = λm pm 1 ⇒ WWT pm = λm pm , np ns
(3)
pre-multiplying by WT : m
m
⇒ D (WT p ) = λm (WT p ) Similarly: D(Wpm ) = λm (Wpm ).
(4)
Therefore, for all λm = 0, we can assign indices such that: λm = λm and pm = Wpm .
(5)
Thus the ns − 1 eigenvalues of D, which are not identically zero, can be obtained directly from D , and the eigenvectors are a weighted sum of the training shapes. As shown in [15], in the limit np → ∞ the ij th element of D is given by the inner product of shapes i and j: ¯ ¯ Dij = dt (Si (φi (t)) − S(t)) · (Sj (φj (t)) − S(t)) (6) ns where S¯ = n1s i=1 Si is the mean shape and Si (φi ) is a continuous representation of Si parameterised by φi . The integral can be evaluated numerically.
6
R.H. Davies et al.
Fig. 1. The first mode of variation (±3σ) of two shape models built from the training set of hand outlines but parameterised differently. Model A was parameterised using manual ‘landmarks’ and model B was parameterised using arc-length parameterisation. The figure demonstrates that model B can represent invalid shape instances.
New examples of the class of shapes can be generated by choosing values of {bm } within the range found in the training set. The utility of the linear model of shape shown in (1) depends on the appropriateness of the set of parameterisations {φi } that are chosen. An inappropriate choice can result in the need for a large set of modes (and corresponding shape parameters) to approximate the training shapes to a given accuracy and may lead to ‘legal’ values of {bm } generating ‘illegal’ shape instances. For example, figure 1 shows two 2D models generated from a set of 17 hand outlines. Model A uses a set of parameterisations of the outlines that cause ‘natural’ landmarks such as the tips of the fingers to correspond. Model B uses one such correspondence but then uses a simple path length parameterisation to position the other sample points. The variances of the three most significant modes of models A and B are (1.06, 0.58, 0.30) and (2.19, 0.78, 0.54) respectively. This suggests that model A is more compact than model B. All the example shapes generated by model A using values of {bm } within the range found in the training set are ‘legal’ examples of hands, whilst model B generates implausible examples and is thus of limited utility for imposing shape constraints when the model is used for image search, see fig. 1. The set of parameterisations used for model A were obtained by marking ‘natural’ landmarks manually on each training example, then using simple path length parameterisation to sample a fixed number of equally spaced points between them. This manual mark-up is a time-consuming and subjective process. In principle, the modelling approach extends to 3D, but in practice, manual landmarking becomes impractical. We propose to overcome this problem by automatically defining correspondences between a training set of example shapes.
3
An Information Theoretic Objective Function
We wish to define a criterion for selecting the set of parameterisations {φi } that are used to construct a statistical shape model from a set of training boundaries {Si }. Our aim is to choose {φi } so as to obtain the ‘best possible’ model. Ideally, we would like a model with optimal:
3D Statistical Shape Models
7
Generalisation Ability: the model can describe any instance of the object not just those seen in the training set; Specificity: the model can only represent valid instances of the object; Compactness: the variation is explained with few parameters. To achieve this, we follow the principle of Occam’s razor which can be paraphrased: ‘simple descriptions generalise best’. As a quantitative measure of ‘simplicity’, we choose to apply The Minimum Description Length (MDL) Principle [18,19]. The MDL principle is based on the idea of transmitting a data set as an encoded message, where the code is based on some pre-arranged set of parametric statistical models. The full transmission then has to include not only the encoded data values, but also the coded model parameter values. Thus MDL balances the model complexity, expressed in terms of the cost of sending the model parameters, against the quality of fit between the model and the data, expressed in terms of the coding length of the data. Comparison of Description Lengths calculated using models from different classes can be used as a way of solving the Model Selection Problem [11]. However, our emphasis here is not on selecting the class of model, but on using the Description Length for a single class of model as an objective function for optimisation of correspondence between the shapes. We will use the simple two-part coding formulation of MDL. Although this does not give us a coding which is of the absolute minimum length [20], it does however give us a functional form which is computationally simple to evaluate, hence suitable to be used as an objective function for numerical optimisation. 3.1
The Model
Our training set of ns shapes is sampled according to the parameterisations {φi } to give a set of np -dimensional shape vectors {xi }. We choose to model this set of shape vectors using a multivariate Gaussian model. The initial step in constructing such a model is to change to a coordinate system whose axes are aligned with the principal axes of the data set. This corresponds to the orientation of the linear model defined earlier (1): xi = x ¯+
n s −1
pm bm i .
(7)
m=1
The ns − 1 mutually-orthogonal eigenvectors {pm } span the subspace which contains the training set, and, by appropriate scaling, can be transformed into an orthonormal basis set for this subspace. We will also order these vectors in terms of non-decreasing eigenvalue to give us our final orthonormal basis set {˜ pm }. To transmit a shape xi using this model, we first have to transmit the mean shape x ¯, then the deviations from the mean shape, which can be written thus: ˜ m · (xi − x ¯). yim ≡ p
(8)
8
R.H. Davies et al.
¯ is We will assume that the code length for the transmission of the mean shape x a constant for a given training set and number of sample points. Furthermore, the code length for the transmission of the set of ns − 1, np -dimensional basis vectors {˜ pm } is also constant for a given training set and number of sample points and need not be considered further. 3.2
The Description Length
˜ m , we now have to transmit the set of values Y m ≡ {yim : For each direction p i = 1 to ns }. Since we have aligned our coordinate axes with the principal axes of the data, and aligned our origin with the mean shape, each direction can now be modelled using a one-dimensional, centred Gaussian. In the Appendix, we derive an expression for the Description Length of one-dimensional, bounded and quantised data, coded using a centred Gaussian model. To utilise this result, we first have to calculate a strict upper-bound R on the range of our data and also estimate a suitable value for the data quantisation parameter ∆. Suppose that, for our original shape data, we know that the coordinates of our sample points are strictly bounded thus: −
r r ≤ xiα ≤ for all α = 1 to np , i = 1 to ns . 2 2
(9)
Then, the strict bound for the coordinates {yim } is given by: √ R = r np , so that |yim | ≤ R for all i, m.
(10)
The data quantisation parameter ∆ can be determined by quantising the coordinates of our original sample points. Comparison of the original shape and the quantised shape then allows a maximum permissible value of ∆ to be determined. For example, for boundaries obtained from voxelated images, ∆ will typically be of the order of the voxel size. This also determines our lower bound on the modelled variance σmin ≡ 2∆. The parameters R and ∆ are held constant for a given training set, hence we need not consider the Description Length for the transmission of these values. Our original data values Y m are now replaced by their quantised1 values Yˆ m . The variance of the quantized data is then calculated thus: (σ m )2 =
ns 1 (ˆ y m )2 . ns i=1 i
(11)
The Description Length Dm for each direction is then given by (see Appendix): • If σ m ≥ σmin : • If σ m < σmin but the range of Yˆ m ≥ ∆: • Else: 1
Dm = D(1) (σ m , ns , R, ∆) Dm = D(2) (σ m , ns , R, ∆) Dm = 0.
We will use a ˆ to denote the quantised value of a continuum variable a.
3D Statistical Shape Models
3.3
9
The Objective Function
Let us define ng to be the number of directions for which the first of the above criteria holds, and nmin the number which satisfy the second. Then, since the directions are ordered in terms of non-increasing eigenvalue/variance, the total Description Length for our training set, and our Objective function, can be written thus: F=
ng p=1
ng +nmin
D(1) (σ p , ns , R, ∆) +
D(2) (σ q , ns , R, ∆).
(12)
q=ng +1
We now consider the form of this objective function. For the linear model defined earlier (1): ns 1 (y m )2 (13) n p λm = ns i=1 i In the limit ∆ → 0, the quantised values in Yˆ approach their continuum values, so that: (14) σ m → n p λm . If we also consider the limit where ns is sufficiently large, it can be seen that the functions D(1) and D(2) can be written in the form: D(1) (σ m , ns , R, ∆) ≈ f (R, ∆, ns ) + (ns − 2) ln σ m D(2) (σ m , ns , R, ∆) ≈ f (R, ∆, ns ) + (ns − 2) ln σmin +
(ns + 3) 2
m
σ σmin
2
(15) −1
where f is some function which depends only on R, ∆ and ns . So, in this dual m limit, the part of the objective function which depends on the {σ } contains terms similar to the determinant of the covariance matrix (that is ln λm ) used by Kotcheff and Taylor [15]. However, our objective function is well-defined, even in the limit λm → 0, where in fact such a direction makes no contribution to the objective function. Whereas in the form used previously, without the addition of artificial correction terms, it would have an infinitely large contribution.
4
Manipulating the Parameterisation of a Surface
In order to build a statistical shape model, we need to sample a number of corresponding points on each shape. As demonstrated in section 2, the choice of correspondences determines the quality of the model. We have chosen to cast this correspondence problem as that of defining the parameterisation φi , of each shape so as to minimise the value of F in (12). We propose to sove for {φi } using numerical optimisation, which requires that we formulate an explicit (and ideally, compact) representation of the φi that can be manipulated. We would also like the computational complexity to be minimal (i.e. it should only involve the evaluation of elementary functions).
10
R.H. Davies et al.
Each surface in our training set is originally represented as a triangular mesh that is topologically equivalent to a sphere. We obtain an initial parameterisation by mapping each surface mesh to a unit sphere, where the mapping must be such that there is no folding or tearing. Each mapped mesh can then be represented thus: (16) Si = Si (θ, ψ), Si R3 where Si is the set of original positions of the mesh vertices for the ith surface in Euclidean space, and (θ, ψ) are the spherical polar coordinates of each mapped vertex. Various approaches have been described to achieve such mappings [1,2, 24]. Since we intend to optimise the parameterisations, the final result should significantly not depend on this initial mapping. We have used the method described by Brechbulher [2] to produce the results reported below. Changes in parameterisation of a given surface correspond to altering the positions of the mapped vertices on the sphere. That is: Si → Si , θ → θ , ψ → ψ where
Si (θ, ψ) =
Si (θ , ψ )
and
θ =
(17) φθi (θ, ψ),
ψ =
φψ i (θ, ψ).
Note that we have separate parameterisation functions φi = (φθi , φψ i ) for each surface. Valid parameterisation functions φi correspond to exact homeomorphic mappings of the sphere. That is, mappings that are continuous, one-to-one and onto. In the following sections, we present a number of such mappings. 4.1
Symmetric Theta Transformations
Consider an arbitrary point P on the unit sphere. For simplicity, assume that the spherical polar co-ordinates (θ, ψ) on the sphere have been redefined so that P corresponds to the point θ = 0. Let us first consider a rotationally symmetric mapping that reparameterises the θ coordinate: θ → f (θ).
(18)
For the mapping to be homeomorphic and continuous with the identity, f must be a differentiable non-decreasing monotonic function over the range 0 < θ < π, with f (0) = 0, f (π) = π. Any such monotonic function f can be rewritten in terms of the cumulative distribution function of some density function ρ(θ), defined over the range 0 ≤ θ ≤ π. As our normalised density function, we take a constant term plus a wrapped Cauchy distribution. The wrapped Cauchy [16] is a normalisable, uni-modal distribution for circular data, of variable width, which has an analytic indefinite integral:
1 − α2 1 1+A (19) ρ(θ) = N 1 + α2 − 2α cos θ where N = π [1 + A]. Hence: θ f (θ) = π 0
1 (1 + α2 ) cos θ − 2α ds ρ(s) = θ + A arccos 1+A 1 + α2 − 2α cos θ
(20)
3D Statistical Shape Models
11
Fig. 2. Left: Original sphere. Right: Sphere after asymmetric θ transformation.
where α (α ≡ e−a , a ∈ R) is the width, and A (A ≥ 0) is the amplitude of the Cauchy. The constant term is included so that f (θ) = θ when A = 0. i.e. the parameterisation is unchanged when the Cauchy has zero magnitude. 4.2
Asymmetric Theta Transformations
We can also perform non-symmetric transformations: θ → f (θ, ψ).
(21)
Define ψ to be the ψ coordinates redefined so that a point (Q = P ) corresponds to ψ = 0. An asymmetric transformation around the point P , towards a point Q can be achieved using (20) and making the amplitude A a smooth function of the ψ coordinate: A → A(ψ ). (22) One such way to do this is to use the wrapped Cauchy distribution to obtain: A(ψ ) = A0
1 − β2 1 − β2 − 2 1 + β − 2β cos ψ (1 + β)2
(23)
where β (β ≡ e−b ) is the width of the subsidiary Cauchy. We have chosen the formulation such that A(ψ ) has a minimum value of zero. An example of an asymmetric transformation is shown in figure 2.
12
R.H. Davies et al.
4.3
Shear and Twist
We also consider transformations of the ψ coordinate. This is equivalent to shearing and twisting the sphere about the axis defined by the point P . So, for example, we could consider a reparameterisation of the form: ψ → ψ + g(θ) where
g(θ) =
B 2π
1+
γ2
1 − γ2 − 2γ cos (θ − θ0 )
(24)
(25)
where B is the amplitude, γ (γ ≡ e−c , c ∈ R) is the width and θ0 is the position of the centre. This transformation is continuous with the identity at B = 0 (i.e. the transformation has no affect when B = 0). It can also be localised about θ = θ0 in the limit of zero width. An example of such a transformation is shown in Figure 3. Figure 4 shows an example of applying a combination of all the transformations described above.
5
Optimising the Parameterisations
We now wish to manipulate {φi } so as to minimise F in (12). We have found that, for the objects modelled in this paper, the symmetric theta transformations, alone, provide a sufficient group of reparameterisations. To manipulate the parameterisations, we fix the position P and width a of each of the Cauchy kernels and vary its magnitude A. To select the positions P of the kernels, we (approximately) uniformly sample the sphere and centre the kernels at these sample points. It is not strictly possible to position an arbitrary number of equidistant points on the sphere, but a good
Fig. 3. Left: Original sphere. Right: Sphere after shear transformation.
3D Statistical Shape Models
13
Fig. 4. Left: Original sphere. Right: Sphere after combined transformation.
approximation can be obtained by recursive subdivision of a polyhedron (initially an octohedron) and projecting the points onto the sphere surface. At each level of recursion, each triangle is divided into 4 smaller triangles by placing 3 new vertices halfway along each edge - this gives 12 × 4k−2 new kernels on the k th level of recursion. We choose to use a multiresolution approach to the optimisation. The basic idea is to begin with broad Cauchies and to iteratively refine the parameterisation by introducing additional, narrower Cauchies between the existing ones. We have found, by cross validation, that the best value for the width parameter, is ak = 1/2k−2 , i.e. the widths are halved at each level of recursion. A local optimisation algorithm is employed to find the optimum magnitude of each kernel. At each level of recursion, we have 12 × 4k−2 kernels for each shape, creating a 12ns × 4k−2 -dimensional configuration space. This is generally too difficult to optimise robustly and reliably. We have found, however, that optimising the magnitude of a single kernel in turn on each shape gives better results and is substantially quicker. We used the Nelder-Mead simplex algorithm [17] to obtain the results reported in section 6.
6
Results
We present qualitative and quantitative results of applying our method to a training set of 16 rat kidneys and 8 anterior horns of brain ventricles. In each case the shapes were segmented by hand from a set of 3D magnetic resonance images. The algorithm was run for three levels of recursion, giving a total of 66 kernels per shape. We compare the results to models built by uniformly sampling the surface. In figures 5 and 6, we show qualitative results by displaying the variation captured by the first three modes of each model (bm varied by ±2[standard deviatiations seen across the training set]). We show quantitative results in
14
R.H. Davies et al.
Fig. 5. The first three modes of variation ±2λm of the automatically produced model of the brain ventricle
Fig. 6. The first three modes of variation ±2λm of the automatically produced model of the rat kidneys
3D Statistical Shape Models
15
Table 1. A quantitative comparison of each model showing the variance explained by each mode. F is the value of the objective function and VT is the total variance.
Mode 1 2 3 4 5 6 VT F
Kidneys
Automatic 3.06 2.75 1.35 0.94 0.76 0.46 10.51 1396
Uniform 3.23 3.01 1.70 1.08 0.87 0.57 11.98 1417
Mode 1 2 3 4 5 6 VT F
Ventricles
Automatic 11.98 1.41 1.16 0.69 0.36 0.22 15.90 366
Uniform 12.58 1.77 1.49 0.79 0.45 0.24 17.46 371
Fig. 7. A plot showing the cumulative variance described by each mode of the model. This is a measure of the compactness of each model.
table 1, tabulating the variance explained by each mode, the total variance and the value of F . In both cases F is substantially better for the optimised model. Figure 7 shows the cumulative variance plotted against the number of modes used for each model; this measures the compactness of the model. The plots show that, for the entire range of modes, the optimised models are substantially more compact than those obtained by uniformly-sampling the surface. To test the generalisation ability of the models, we performed leaveone-out tests on each model. In figure 8 we show the approximation error for representing an unseen shape as a function of the number of modes used in the reconstruction. The optimised models perform substantially better than the uniformly sampled models whatever the number of modes used, demonstrating superior generalisation ability.
7
Discussion and Conclusions
We have described a method for building 3D statistical shape models by automatically establishing optimal correspondences between sets of shapes. We have shown that the method produces models that are more compact than those
16
R.H. Davies et al.
Fig. 8. Leave one out tests on the models. The plot shows the number of modes used against the mean squared approximation error. This measures the ability of each model to generalise to unseen examples.
based on uniformly-sampled shapes and have substantially better generalisation ability. We have described a novel method of reparameterising closed surfaces. The method guarantees that the resulting reparameterisation is homeomorphic - an essential constraint. With a suitable choice of boundary conditions, the representation can also be applied to reparameterise open surfaces. Although we have only reported results on a relatively small set of example shapes, we believe the method scales easily to deal with sets of 40-50 shapes. The local optimisation algorithm employed will, however, be substantially slower. We are currently looking for more robust and faster methods of finding the minimum. The MDL objective function and optimisation approach provides a unified framework for dealing with all aspects of model building. For example, we plan to investigate including the alignment of the shapes in the optimisation.
Appendix: Description Length for One-Dimensional Centred Gaussian Models In this Appendix, we show how to construct an expression for the description length required to send a one-dimensional, centred data set using a Gaussian model. Our data model is the family of centred Gaussian distributions: 1 y2 (26) ρ(y; σ) = √ exp − 2 . 2σ σ 2π Following the two-part coding scheme [19], the total description length is computed as the sum of two parts; the description length for sending the value and accuracy of the parameter σ, and the description length for coding the data according to this model. To calculate these description lengths, we use the fundamental result that the ideal-coding codeword length for a discrete event A, encoded using a statistical model with associated event probabilities P (A) is given by the Shannon Coding codeword length [22]:
3D Statistical Shape Models
L(A; P ) = − log2 P (A) bits, or: − ln P (A) nats.2
17
(27)
We take our centred data set to be Y = {yi : i = 1 to ns }, where the data is known to lie within a strictly bounded region. To reduce the continuum values {yi } to a set of discrete events, we quantize the data values using a parameter ∆, so that Y → Yˆ = {ˆ yi : i = 1 to ns } 3 , where for any quantized value yˆ from any possible data set: − R < yˆ < R and yˆ = m∆, m ZZ.
(28)
Coding the Parameters We will assume that the parameter σ is described to an accuracy δ. We will also assume that our now quantized parameter σ ˆ is bounded, hence has the allowed values: ˆ ≤ σmax . (29) σ ˆ = nδ, n IN and σmin ≤ σ Given the absence of any prior knowledge, we assume a flat distribution for σ ˆ over this range, which gives us the codeword length: σmax − σmin . (30) Lσˆ = ln δ Note that our receiver cannot decrypt the value of σ ˆ without knowing the value of δ. If we assume that the quantization parameter δ is of the form: δ = 2±k , k IN
(31)
then it can easily be seen that it can be coded directly with a codeword length: Lδ = 1 + | log2 δ| bits ≈ 1 + | ln δ| nats
(32)
where the additional bit/nat codes for the sign in the exponent of δ. The total code length for transmitting the parameters is then given by Lσˆ + Lδ . Coding the Data For our Gaussian data model, the probability P (ˆ y ) associated with a bin centred at yˆ is: yˆ+∆/2 2 ∆ P (ˆ y) = y /2ˆ (33) dk ρ(k; σ ˆ ) ≈ √ exp −ˆ σ2 . σ ˆ 2π yˆ−∆/2
2
3
In what follows, we will restrict ourselves to natural logarithms, and codeword lengths in units of nats. However, expressions can be easily converted to the more familiar binary lengths by noting that 1 bit ≡ ln 2 nats. We will use α ˆ to denote the quantized value of the continuum value α.
18
R.H. Davies et al.
It can be shown numerically that this is a very good approximation (mean fractional error ≤ 1.0% ± 0.8%) for all values σ ˆ ≥ 2∆, hence we will take σmin = 2∆. The code length for the data is then: Ldata = −
ns
ln P (ˆ yi ) = −ns ln ∆ +
i=1
ns ns 1 ln(2πˆ σ2 ) + 2 yˆ2 2 2ˆ σ i=1 i
(34)
The variance of the quantized data is: σ2 =
ns 1 yˆ2 and σmax = R. ns i=1 i
(35)
In general σ differs from the nearest quantized value σ ˆ thus: δ . (36) 2 So, averaging over an ensemble of data sets, and assuming a flat distribution for dσ over this range, we find: δ2 δ2 1 1 δ2 , 2 ≈ 2 1 + 2 , ln σ d2σ = ˆ 2 ≈ ln σ 2 − . (37) 12 σ ˆ σ 4σ 12σ 2 σ ˆ = σ + dσ , |dσ | ≤
By substituting these expressions into (34) and using (35) gives the following expression for the Description Length of the data: ns δ 2 ns ns ln(2πσ 2 ) + + . (38) 2 2 12σ 2 The total Description Length for the parameters and the data is then: ns ns ns δ 2 σmax − σmin . (39) +| ln δ|−ns ln ∆+ ln(2πσ 2 )+ + L(1) = 1+ln δ 2 2 12σ 2 Ldata = −ns ln ∆ +
By differentiating w.r.t. δ and setting the derivative to zero, we find that the optimum parameter accuracy δ is:
12 ∗ δ (σ, ns ) = min 1, σ (40) ns which then allows us to write the above expression as: L(1) = D(1) (σ, ns , R, ∆).
(41)
In the case where σ < σmin , but the quantized data occupies more than one bin, we will model the data using a Gaussian of width σmin and a quantization parameter δ = δ ∗ (σmin , ns ). An analogous derivation to that given above then gives us the Description Length: ns σmax − σmin 2 ) + | ln δ| − ns ln ∆ + ln(2πσmin L(2) = 1 + ln δ 2 ns δ 2 ns σ 2 δ2 − + (42) 1 + ≡ D(2) (σ, ns , R, ∆). 2 2 2 24σmin 2σmin 4σmin
3D Statistical Shape Models
19
The only remaining case is where all the quantized data lies in one bin at the origin. This requires no further information to describe it fully, hence has a description length of zero. Acknowledgements. The authors would like to thank Alan Brett and Johann Kim for their contribution to this work. Rhodri Davies would like to acknowledge the BBSRC and AstraZeneca4 for their financial support.
References 1. Angenent, S., S. Haker, A. Tannenbaum and R. Kikinis: “On the laplace-beltrami operator and brain surface flattening”. IEEE Trans. Medical Imaging, 18: p. 700711. 1999 2. Brechbulher, C., G. Gerig and O. Kubler, “Parameterisation of closed surfaces for 3-D shape description”, in Computer Vision and Image Understanding, 61(2), p154-170, 1995 3. Brett A.D. and C.J. Taylor, “A Method of automated landmark generation for automated 3D PDM construction”, in Image and Vision Computing, 18(9), p. 739-748, 2000 4. Brett A.D. and C.J. Taylor, “Construction of 3D Shape Models of Femoral Articular Cartialge Using Harmonic Maps”, in MICCAI’00, p. 1205-1214, 2000 5. Cootes, T., A. Hill, C. Taylor, and J. Haslam, “The use of Active shape models for locating structures in medical images”. Image and Vision Computing, 12: p. 355-366. 1994. 6. Cootes, T., C. Taylor, D. Cooper and J. Graham, “Active shape models - their training and application”. Computer Vision and Image Understanding, 61: p. 3859. 1995. 7. Christensen, G. E., S. C. Joshi, M.I. Miller, “Volumetric Transformation of Brain Anatomy”, in IEEE Trans. Medical Imaging, 16, p864-877, 1997 8. Davies, Rh. H., C.J. Twining, T.F. Cootes, J.C. Waterton and C.J. Taylor, “A Minimum Description Length Approach to Statistical Shape Modelling”, in IEEE trans. Medical Imaging, To appear 9. Davies, Rh. H., T.F. Cootes, C.J. Twining and C.J. Taylor, “An Information Theoretic Approach to Statistical Shape Modelling”, in British Machine Vision Conference - BMVC’01, p 3-12, 2001 10. Fluete, M., S. Lavalee, “Building a Complete Surface Model from Sparse Data Using Statistical Shape Models: Application to Computer Assisted Knee Surgery”, in MICCAI’98, p878-887, 1998 11. M. H. Hansen and B. Yu, “Model Selection and the Principle of Minimum Description Length”, Technical Memorandum, Bell Labs, Murray Hill, N.J. 1998. 12. Hill,A., A. Thornham and C.J. Taylor, “Model based interpretation of 3D medical images”, in British Machine Vision Conference, BMVC’93, p339-348, 1993 13. Joshi et. al , “Gaussian Random Fields on Sub-Manifolds for Characterizing Brain Surfaces”, in IPMI’97, p381-386, 1997 14. Kelemen, A., G. Szekely, and G. Gerig, “Elastic model-based segmentation of 3-D neuroradiological data sets”. IEEE Transactions On Medical Imaging, 18(10): p. 828-839. 1999. 4
AstraZeneca, Alderley Park, Macclesfield, Cheshire, UK
20
R.H. Davies et al.
15. Kotcheff, A.C.W. and C.J. Taylor, ”Automatic Construction of Eigenshape Models by Direct Optimisation”. Medical Image Analysis, 2: p. 303-314. 1998. 16. Mardia, K. V., Statistics of Directional Data, Academic Press,London,1972 17. Press, W.H., S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in C, Cambridge University Press; 1993 18. Rissanen, J. R. ”A universal prior for integers and estimation by minimum description length”, Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983. 19. Rissanen, J. R. Stochastic Complexity in Statistical Inquiry, World Scientific Series in Computer Science, Vol. 15, World Scientific Publishing Co., Singapore, 1989. 20. Rissanen, J. R. “Fisher Information and Stochastic Complexity”, IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 40-47, 1996. 21. Rueckert, D., Frangi, F., Schnabel, J.A., “Automatic construction of 3D statistical deformation models using non-rigid registration”, MICCAI’01 p77-84. 2001. 22. Shannon, C. E., “A mathematical theory of communication,”, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, 1948. 23. Szeleky, G., S. Lavalee, “Matching 3-D Anatomical Surface with Non-Rigid Deformations using Octree-Splines”, in International Journal of Computer Vision, 18(2), p171-186, 1996 24. D. Tosun and J.L. Prince, “Hemispherical map for the human brain cortex,” Proc. SPIE Medical Imaging 2001, p 290-300. 2001. 25. Wang, Y., B. S. Peterson, and L. H. Staib. “Shape-based 3D surface correspondence using geodesics and local geometry”. CVPR 2000, v. 2: p. 644-51. 2000.
Approximate Thin Plate Spline Mappings Gianluca Donato1 and Serge Belongie2 1
Digital Persona, Inc., Redwood City, CA 94063
[email protected] 2 U.C. San Diego, La Jolla, CA 92093-0114
[email protected]
Abstract. The thin plate spline (TPS) is an effective tool for modeling coordinate transformations that has been applied successfully in several computer vision applications. Unfortunately the solution requires the inversion of a p × p matrix, where p is the number of points in the data set, thus making it impractical for large scale applications. As it turns out, a surprisingly good approximate solution is often possible using only a small subset of corresponding points. We begin by discussing the obvious approach of using the subsampled set to estimate a transformation that is then applied to all the points, and we show the drawbacks of this method. We then proceed to borrow a technique from the machine learning community for function approximation using radial basis functions (RBFs) and adapt it to the task at hand. Using this method, we demonstrate a significant improvement over the naive method. One drawback of this method, however, is that is does not allow for principal warp analysis, a technique for studying shape deformations introduced by Bookstein based on the eigenvectors of the p × p bending energy matrix. To address this, we describe a third approximation method based on a classic matrix completion technique that allows for principal warp analysis as a by-product. By means of experiments on real and synthetic data, we demonstrate the pros and cons of these different approximations so as to allow the reader to make an informed decision suited to his or her application.
1
Introduction
The thin plate spline (TPS) is a commonly used basis function for representing coordinate mappings from R2 to R2 . Bookstein [3] and Davis et al. [5], for example, have studied its application to the problem of modeling changes in biological forms. The thin plate spline is the 2D generalization of the cubic spline. In its regularized form the TPS model includes the affine model as a special case. One drawback of the TPS model is that its solution requires the inversion of a large, dense matrix of size p × p, where p is the number of points in the data set. Our goal in this paper is to present and compare three approximation methods that address this computational problem through the use of a subset of corresponding points. In doing so, we highlight connections to related approaches in the area of Gaussian RBF networks that are relevant to the TPS mapping A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 21–31, 2002. c Springer-Verlag Berlin Heidelberg 2002
22
G. Donato and S. Belongie
problem. Finally, we discuss a novel application of the Nystr¨ om approximation [1] to the TPS mapping problem. Our experimental results suggest that the present work should be particularly useful in applications such as shape matching and correspondence recovery (e.g. [2,7,4]) as well as in graphics applications such as morphing.
2
Review of Thin Plate Splines
Let vi denote the target function values at locations (xi , yi ) in the plane, with i = 1, 2, . . . , p. In particular, we will set vi equal to the target coordinates (xi ,yi ) in turn to obtain one continuous transformation for each coordinate. We assume that the locations (xi , yi ) are all different and are not collinear. The TPS interpolant f (x, y) minimizes the bending energy 2 2 2 (fxx + 2fxy + fyy )dxdy If = R2
and has the form f (x, y) = a1 + ax x + ay y +
p
wi U ((xi , yi ) − (x, y))
i=1
where U (r) = r2 log r. In order for f (x, y) to have square integrable second derivatives, we require that p
wi = 0
i=1
p i=1
w i xi =
p
and wi yi = 0 .
i=1
Together with the interpolation conditions, f (xi , yi ) = vi , this yields a linear system for the TPS coefficients: K P w v = (1) a o PT O where Kij = U ((xi , yi ) − (xj , yj )), the ith row of P is (1, xi , yi ), O is a 3 × 3 matrix of zeros, o is a 3 × 1 column vector of zeros, w and v are column vectors formed from wi and vi , respectively, and a is the column vector with elements a1 , ax , ay . We will denote the (p + 3) × (p + 3) matrix of this system by L; as discussed e.g. in [7], L is nonsingular. If we denote the upper left p × p block of L−1 by L−1 p , then it can be shown that T If ∝ v T L−1 p v = w Kw .
Approximate Thin Plate Spline Mappings
23
When there is noise in the specified values vi , one may wish to relax the exact interpolation requirement by means of regularization. This is accomplished by minimizing H[f ] =
n
(vi − f (xi , yi ))2 + λIf .
i=1
The regularization parameter λ, a positive scalar, controls the amount of smoothing; the limiting case of λ = 0 reduces to exact interpolation. As demonstrated in [9,6], we can solve for the TPS coefficients in the regularized case by replacing the matrix K by K + λI, where I is the p × p identity matrix.
3
Approximation Techniques
Since inverting L is an O(p3 ) operation, solving for the TPS coefficients can be very expensive when p is large. We will now discuss three different approximation methods that reduce this computational burden to O(m3 ), where m can be as small as 0.1p. The corresponding savings factors in memory (5x) and processing time (1000x) thus make TPS methods tractable when p is very large. In the discussion below we use the following partition of the K matrix: A B K= (2) BT C with A ∈ Rm×m , B ∈ Rm×n , and C ∈ Rn×n . Without loss of generality, we will assume the p points are labeled in random order, so that the first m points represent a randomly selected subset. 3.1
Method 1: Simple Subsampling
The simplest approximation technique is to solve for the TPS mapping between a randomly selected subset of the correspondences. This amounts to using A in place of K in Equation (1). We can then use the recovered coefficients to extrapolate the TPS mapping to the remaining points. The result of applying this approximation to some sample shapes is shown in Figure 1. In this case, certain parts were not sampled at all, and as a result the mapping in those areas is poor. 3.2
Method 2: Basis Function Subset
An improved approximation can be obtained by using a subset of the basis functions with all of the target values. Such an approach appears in [10,6] and Section 3.1 of [8] for the case of Gaussian RBFs. In the TPS case, we need to account for the affine terms, which leads to a modified set of linear equations. Starting from the cost function R[w, ˜ a] =
1 λ T ˜w v − K ˜ − P a2 + w ˜ Aw ˜ , 2 2
24
G. Donato and S. Belongie (a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 1. Thin plate spline (TPS) mapping example. (a,b) Template and target synthetic fish shapes, each consisting of 98 points. (Correspondences between the two shapes are known.) (c) TPS mapping of (a) onto (b) using the subset of points indicated by circles (Method 1). Corresponding points are indicated by connecting line segments. Notice the quality of the mapping is poor where the samples are sparse. An improved approximation can be obtained by making use of the full set of target values; this is illustrated in (d), where we have used Method 2 (discussed in Section 3.2). A similar mapping is found for the same set of samples using Method 3 (see Section 3.3). In (e-h) we observe the same behavior for a pair of handwritten digits, where the correspondences (89 in all) have been found using the method of [2].
we minimize it by setting ∂R/∂ w ˜ and ∂R/∂a to zero, which leads to the following (m + 3) × (m + 3) linear system, T T ˜ + λA K ˜TP ˜ K ˜ v K w ˜ K (3) = T T ˜ T a P v P K P P ˜ T = [A B T ], w ˜ is an m × 1 vector of TPS coefficients, and the rest where K of the entries are as before. Thus we seek weights for the reduced set of basis functions that take into account the full set of p target values contained in v. If we call P˜ the first m rows of P and I˜ the first m columns of the p × p identity matrix, then under the assumption P˜ T w ˜ = 0, Equation (3) is equivalent to ˜ + λI˜ P w ˜ v K = a o P˜ T O which corresponds to the regularized version of Equation (1) when using the ˜ and P˜ T in place of K and P T . subsampled K The application of this technique to the fish and digit shapes is shown in Figure 1(d,h).
Approximate Thin Plate Spline Mappings
3.3
25
Method 3: Matrix Approximation
The essence of Method 2 was to use a subset of exact basis functions to approximate a full set of target values. We now consider an approach that uses a full set of approximate basis functions to approximate the full set of target values. The approach is based on a technique known as the Nystr¨om method. The Nystr¨om method provides a means of approximating the eigenvectors of K without using C. It was originally developed in the late 1920s for the numerical solution of eigenfunction problems [1] and was recently used in [11] for fast approximate Gaussian process regression and in [8] (implicitly) to speed up several machine learning techniques using Gaussian kernels. Implicit to the Nystr¨ om method is the assumption that C can be approximated by B T A−1 B, i.e. A B ˆ K= (4) B T B T A−1 B If rank(K) = m and the m rows of the submatrix [A B] are linearly indepenˆ = K. In general, the quality of the approximation can be expressed dent, then K as the norm of the difference C − B T A−1 B, the Schur complement of K. Given the m × m diagonalization A = U ΛU T , we can proceed to find the approximate eigenvectors of K: U T ˜ ˆ ˜ ˜ (5) K = U ΛU , with U = B T U Λ−1 ˜ are not orthogonal. To address this, first Note that in general the columns of U ˜ Λ1/2 so that K ˆ = ZZ T . Let QΣQT denote the diagonalization define Z = U of Z T Z. Then the matrix V = ZQΣ −1/2 contains the leading orthonormalized ˆ i.e. K ˆ = V ΣV T , with V T V = I. eigenvectors of K, From the standard formula for the partitioned inverse of L, we have −1 K + K −1 P S −1 P T K −1 −K −1 P S −1 L−1 = , S = −P T K −1 P −S −1 P T K −1 S −1 and thus
(I + K −1 P S −1 P T )K −1 v w −1 v = =L o a −S −1 P T K −1 v
ˆ −1 = V Σ −1 V T and Using the Nystr¨om approximation to K, we have K w ˆ = (I + V Σ −1 V T P Sˆ−1 P T )V Σ −1 V T v , a ˆ = −Sˆ−1 P T V Σ −1 V T v with Sˆ = −P T V Σ −1 V T P , which is 3 × 3. Therefore, by computing matrixvector products in the appropriate order, we can obtain estimates to the TPS
26
G. Donato and S. Belongie (a)
(b)
(c)
Fig. 2. Grids used for experimental testing. (a) Reference point set S1 : 12 × 12 points on the interval [0, 128]×[0, 128]. (b,c) Warped point sets S2 and S3 with bending energy 0.3 and 0.8, respectively. To test the quality of the different approximation methods, we used varying percentages of points to estimate the TPS mapping from S1 to S2 and from S1 to S3 .
coefficients without ever having to invert or store a large p × p matrix. For the regularized case, one can proceed in the same manner, using (V ΣV T + λI)−1 = V (Σ + λI)−1 V T . Finally, the approximate bending energy is given by ˆ = (V T w)T Σ(V T w) wT Kw Note that this bending energy is the average of the energies associated to the x and y components as in [3]. Let us briefly consider what w ˆ represents. The first m components roughly correspond to the entries in w ˜ for Method 2; these in turn correspond to the ˆ (i.e. K) ˜ for which exact information is available. The remaining columns of K ˆ with (implicitly) filled-in values for all but the entries weight columns of K first m entries. In our experiments, we have observed that the latter values of w ˆ are nonzero, which indicates that these approximate basis functions are not being disregarded. Qualitatively, the approximation quality of methods 2 and 3 are very similar, which is not surprising since they make use of the same basic information. The pros and cons of these two methods are investigated in the following section.
4 4.1
Experiments Synthetic Grid Test
In order to compare the above three approximation methods, we ran a set of experiments based on warped versions of the cartesian grid shown in Figure 2(a). The grid consists of 12 × 12 points in a square of dimensions 128 × 128. Call this set of points S1 . Using the technique described in Appendix A, we generated
Approximate Thin Plate Spline Mappings If=0.3
8
6
6
5
5
4
4
3
3
2
2
1
1
0
5
10
15 20 Percentage of samples used
25
Method 1 Method 2 Method 3
7
MSE
MSE
7
If=0.8
8
Method 1 Method 2 Method 3
30
27
0
5
10
15 20 Percentage of samples used
25
30
Fig. 3. Comparison of approximation error. Mean squared error in position between points in the target grid and corresponding points in the approximately warped reference grid is plotted vs. percentage of randomly selected samples used. Performance curves for each of the three methods are shown in (a) for If = 0.3 and (b) for If = 0.8.
point sets S2 and S3 by applying random TPS warps with bending energy 0.3 and 0.8, respectively; see Figure 2(b,c). We then studied the quality of each approximation method by varying the percentage of random samples used to estimate the (unregularized) mapping of S1 onto S2 and S3 , and measuring the mean squared error (MSE) in the estimated coordinates. The results are plotted in Figure 3. The error bars indicate one standard deviation over 20 repeated trials. 4.2
Approximate Principal Warps
In [3] Bookstein develops a multivariate shape analysis framework based on −1 −1 eigenvectors of the bending energy matrix L−1 p KLp = Lp , which he refers to as principal warps. Interestingly, the first 3 principal warps always have eigenvalue zero, since any warping of three points in general position (a triangle) can be represented by an affine transform, for which the bending energy is zero. The shape and associated eigenvalue of the remaining principal warps lend insight into the bending energy “cost” of a given mapping in terms of that mapping’s projection onto the principal warps. Through the Nystr¨ om approximation in ˆ −1 Method 3, one can produce approximate principal warps using L p as follows: ˆ −1 ˆ −1 + K ˆ −1 P S −1 P T K ˆ −1 L p =K = V Σ −1 V T + V Σ −1 V T P S −1 P T V Σ −1 V T = V (Σ −1 + Σ −1 V T P S −1 P T V Σ −1 )V T ∆ ˆ T = V ΛV
28
G. Donato and S. Belongie
Fig. 4. Approximate principal warps for the fish shape. From left to right and top to bottom, the surfaces are ordered by eigenvalue in increasing order. The first three principal warps, which represent the affine component of the transformation and have eigenvalue zero, are not shown.
where ∆ Λˆ = Σ −1 + Σ −1 V T P S −1 P T V Σ −1 = W DW T
to obtain orthogonal eigenvectors we proceed as in section 3.3 to get ˆΣ ˆW ˆT Λˆ = W ∆ ˆ = ˆ 1/2 and QΣQ ˆ T is the diagonalization of D1/2 W T W D1/2 . where W W D1/2 QΣ Thus we can write
ˆ ˆˆT T ˆ −1 L p = V W ΣW V An illustration of approximate principal warps for the fish shape is shown in Figure 4, wherein we have used m = 15 samples. As in [3], the principal warps are visualized as continuous surfaces, where the surface is obtained by ˆ −1 applying a warp to the coordinates in the plane using a given eigenvector of L p as the nonlinear spline coefficients; the affine coordinates are set to zero. The
Approximate Thin Plate Spline Mappings
29
Fig. 5. Exact principal warps for the fish shape.
corresponding exact principal warps are shown in Figure 5. In both cases, warps 4 through 12 are shown, sorted in ascending order by eigenvalue. Given a rank m Nystr¨ om approximation, at most m − 3 principal warps with nonzero eigenvalue are available. These correspond to the principal warps at the “low frequency” end, meaning that very localized warps, e.g. pronounced stretching between adjacent points in the target shape, will not be captured by the approximation. 4.3
Discussion
We now discuss the relative merits of the above three methods. From the synthetic grid tests we see that Method 1, as expected, has the highest MSE. Considering that the spacing between neighboring points in the grid is about 10, it is noteworthy, however, that all three methods achieve an MSE of less than 2 at 30% subsampling. Thus while Method 1 is not optimal in the sense of MSE, its performance is likely to be reasonable for some applications, and it has the advantage of being the least expensive of the three methods. In terms of MSE, Methods 2 and 3 perform roughly the same, with Method 2 holding a slight edge, more so at 5% for the second warped grid. Method 3 has a disadvantage built in relative to Method 2, due to the orthogonalization
30
G. Donato and S. Belongie (a)
(b)
Fig. 6. Comparison of Method 2 (a) and 3 (b) for poorly chosen sample locations. (The performance of Method 1 was terrible and is not shown.) Both methods perform well considering the location of the samples. Note that the error is slightly lower for Method 3, particularly at points far away from the samples.
step; this leads to an additional loss in significant figures and a slight increase in MSE. In this regard Method 2 is the preferred choice. While Method 3 is comparatively expensive and has slightly higher MSE than Method 2, it has the benefit of providing approximate eigenvectors of the bending energy matrix. Thus with Method 3 one has the option of studying shape transformations using principal warp analysis. As a final note, we have observed that when the samples are chosen badly, e.g. crowded into a small area, Method 3 performs better than Method 2. This is illustrated in Figure 6, where all of the samples have been chosen at the back of the tail fin. Larger displacements between corresponding points are evident near the front of the fish for Method 2. We have also observed that the bending energy ˜ exhibits higher variance than that of Method 3; estimate of Method 2 (w ˜ T Aw) e.g. at a 20% sampling rate on the fish shapes warped using If = 0.3 over 100 trials, Method 2 estimates If to be 0.29 with σ = 0.13 whereas Method 3 gives 0.25 and σ = 0.06. We conjecture that this advantage arises from the presence of the approximate basis functions in the Nystr¨ om approximation, though a rigorous explanation is not known to us.
5
Conclusion
We have discussed three approximate methods for recovering TPS mappings between 2D pointsets that greatly reduce the computational burden. An experimental comparison of the approximation error suggests that the two methods that use only a subset of the available correspondences but take into account the full set of target values perform very well. Finally, we observed that the method based on the Nystr¨om approximation allows for principal warp analysis and per-
Approximate Thin Plate Spline Mappings
31
forms better than the basis-subset method when the subset of correspondences is chosen poorly. Acknowledgments. The authors wish to thanks Charless Fowlkes, Jitendra Malik, Andrew Ng, Lorenzo Torresani, Yair Weiss, and Alice Zheng for helpful discussions. We would also like to thank Haili Chui and Anand Rangarajan for useful insights and for providing the fish datasets.
Appendix: Generating Random TPS Transformations To produce a random TPS transformation with bending energy ν, first choose a set of p reference points (e.g. on a grid) and form L−1 p . Now generate a random vector u, set its last three components to zero, and normalize it. Compute the T diagonalization L−1 p = U ΛU , with the eigenvalues sorted in descending order. √ Finally, compute w = νU Λ1/2 u. Since If is unaffected by the affine terms, their values are arbitrary; we set translation to (0, 0) and scaling to (1, 0) and (0, 1).
References 1. C. T. H. Baker. The numerical treatment of integral equations. Oxford: Clarendon Press, 1977. 2. S. Belongie, J. Malik, and J. Puzicha. Matching shapes. In Proc. 8th Int’l. Conf. Computer Vision, volume 1, pages 454–461, July 2001. 3. F. L. Bookstein. Principal warps: thin-plate splines and decomposition of deformations. IEEE Trans. Pattern Analysis and Machine Intelligence, 11(6):567–585, June 1989. 4. H. Chui and A. Rangarajan. A new algorithm for non-rigid point matching. In Proc. IEEE Conf. Comput. Vision and Pattern Recognition, pages 44–51, June 2000. 5. M.H. Davis, A. Khotanzad, D. Flamig, and S. Harms. A physics-based coordinate transformation for 3-d image matching. IEEE Trans. Medical Imaging, 16(3):317– 328, June 1997. 6. F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7(2):219–269, 1995. 7. M. J. D. Powell. A thin plate spline method for mapping curves into curves in two dimensions. In Computational Techniques and Applications (CTAC95), Melbourne, Australia, 1995. 8. A.J. Smola and B. Sch¨ olkopf. Sparse greedy matrix approximation for machine learning. In ICML, 2000. 9. G. Wahba. Spline Models for Observational Data. SIAM, 1990. 10. Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In Proc. IEEE Conf. Comput. Vision and Pattern Recognition, pages 520–526, 1997. 11. C. Williams and M. Seeger. Using the Nystr¨ om method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, pages 682–688, 2001.
DEFORMOTION Deforming Motion, Shape Average and the Joint Registration and Segmentation of Images Stefano Soatto1 and Anthony J. Yezzi2 1
Department of Computer Science, University of California, Los Angeles, Los Angeles – CA 90095
[email protected] 2 Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta – GA 30332
[email protected]
Abstract. What does it mean for a deforming object to be “moving” (see Fig. 1)? How can we separate the overall motion (a finite-dimensional group action) from the more general deformation (a diffeomorphism)? In this paper we propose a definition of motion for a deforming object and introduce a notion of “shape average” as the entity that separates the motion from the deformation. Our definition allows us to derive novel and efficient algorithms to register non-equivalent shapes using region-based methods, and to simultaneously approximate and register structures in grey-scale images. We also extend the notion of shape average to that of a “moving average” in order to track moving and deforming objects through time.
Fig. 1. A jellyfish is “moving while deforming.” What exactly does this mean? How can we separate its “global” motion from its “local” deformation?
1
Introduction
Consider a sheet of paper falling. If it were a rigid object, one could describe its motion by providing the coordinates of one particle and the orientation of an orthonormal reference frame attached to that particle. That is, 6 numbers
This research is supported in part by NSF grant IIS-9876145, ARO grant DAAD1999-1-0139 and Intel grant 8029.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 32–47, 2002. c Springer-Verlag Berlin Heidelberg 2002
DEFORMOTION
33
would be sufficient to describe the object at any instant of time. However, being a non-rigid object, in order to describe it at any instant of time one should really specify the trajectory of each individual particle on the sheet. That is, if γ0 represents the initial collection of particles, one could provide a function f that describes how the entire set of particles evolves in time: γt = f (γ0 , t). Indeed, if each particle can move independently, there may be no notion of “overall motion,” and a more appropriate description of f is that of a “deformation” of the sheet. That includes as a special case a rigid motion, described collectively by a rotation matrix1 R(t) ∈ SO(3) and a translation vector T (t) ∈ R3 , so that γt = f (γ0 , t) = R(t)γ0 + T (t) with R(t) and T (t) independent of the particle in γ0 . In practice, however, that is not how one usually describes a sheet of paper falling. Instead, one may say that the sheet is “moving” downwards along the vertical direction while “locally deforming.” The jellyfish in Fig. 1 is just another example to illustrate the same issue. But what does it even mean for a deforming object to be “moving”? From a mathematical standpoint, rigorously defining a notion of motion for deforming objects presents a challenge. In fact, if we describe the deformation f as the composition of a rigid motion (R(t), T (t)) and a “local deformation” function h(·, t), so that γt = h(R(t)γ0 + T (t), t), we can always find infinitely many ˜ t), R(t), ˜ different choices h(·, T˜(t) that give rise to the same overall deformation ˜ Rγ ˜ 0 + T˜(t), t) by simply choosing f : γt = f (γ0 , t) = h(R(t)γ0 + T (t), t) = h( . T ˜ ˜ ˜ ˜ T˜). Therefore, we could h(γ, t) = h(RR (γ − T ) + T, t) for any rigid motion (R, ˜ T˜), which is describe the motion of our sheet with (R, T ) as well as with (R, arbitrary, and in the end we would have failed in defining a notion of “motion” that is unique to the event observed. So, how can we define a notion of motion for a deforming object in a mathematically sound way that reflects our intuition? For instance, in Fig. 6, how do we describe the “motion” of a jellyfish? Or in Fig. 5 the “motion” of a storm? In neuroanatomy, how can we “register” a database of images of a given structure, say the corpus callosum (Fig. 9), by “moving” them to a common reference frame? All these questions ultimately boil down to an attempt to separate the overall motion from the more general deformation. Before proceeding, note that this is not always possible or even meaningful. In order to talk about the “motion” of an object, one must assume that “something” of the object is preserved as it deforms. For instance, it may not make sense to try to capture the “motion” of a swarm of bees, or of a collection of particles that indeed all move independently. What we want to capture mathematically is the notion of overall motion when indeed there is one that corresponds to our intuition! The key to this paper is the observation that the notion of motion and the notion of shape are very tightly coupled. Indeed, we will see that the shape average is exactly what allows separating the motion from the deformation.
1
SO(3) denotes the set of 3 × 3 orthogonal matrices with unit determinant.
34
S. Soatto and A.J. Yezzi
1.1
Prior Related Work
The study of shape spans at least a hundred years of research in different communities from mathematical morphology to statistics, geology, neuroanatomy, paleontology, astronomy etc. In statistics, the study of “Shape Spaces” was championed by Kendall, Mardia and Carne among others [15,20,8,24,9]. Shapes are defined as the equivalence classes of N points in RM under the similarity group2 , RM N /{SE(M ) × R}. Although the framework clearly distinguishes the notion of “motion” (along the fibers) from the “deformation” (across fibers), the analytical tools are essentially tied to the point-wise representation. One of our goals in this paper is to extend the theory to smooth curves, surfaces and other geometric objects that do not have distinct “landmarks.” In computer vision, a wide literature exists for the problem of “matching” or “aligning” objects based on their images, and space limitations do not allow us to do justice to the many valuable contributions. We refer the reader to [32] for a recent survey. A common approach consists of matching collections of points organized in graphs or trees (e.g. [19,11]). Belongie et al. [4] propose comparing planar contours based on their “shape context.” The resulting match is based on “features” rather than on image intensity directly, similar to [10]. “Deformable Templates,” pioneered by Grenander [12], do not rely on “features” or “landmarks;” rather, images are directly deformed by a (possibly infinite-dimensional) group action and compared for the best match in an “image-based” approach [34]. There, the notion of “motion” (or “alignment” or “registration”) coincides with that of deformation, and there is no clear distinction between the two [5]. Another line of work uses variational methods and the solution of partial differential equations (PDEs) to model shape and to compute distances and similarity. In this framework, not only can the notion of alignment or distance be made precise [3,33,25,28], but quite sophisticated theories that encompass perceptually relevant aspects can be formalized in terms of the properties of the evolution of PDEs (e.g. [18,16]). Kimia et al. [16] describes a scale-space that corresponds to various stages of evolution of a diffusing PDE, and a “reacting” PDE that splits “salient parts” of planar contours by generating singularities. The variational framework has also proven very effective in the analysis of medical images [23,30,22]. Although most of these ideas are developed in a deterministic setting, many can be transposed to a probabilistic context [35,7,31]. Scalespace is a very active research area, and some of the key contributions as they relate to the material of this paper can be found in [14,29,17,1,2] and references therein. The “alignment,” or “registration,” of curves has also been used to define a notion of “shape average” by several authors (see [21] and references therein). The shape average, or “prototype,” can then be used for recognition in a nearestneighbor classification framework, or to initialize image-based segmentation by providing a “prior.” Leventon et al. [21] perform principal component analysis 2
SE(M ) indicates the Euclidean group of dimension M .
DEFORMOTION
35
in the aligned frames to regularize the segmentation of regions with low contrast in brain images. Also related to this paper is the recent work of Paragios and Deriche, where active regions are tracked as they “move.” In [27] the notion of motion is not made distinct from the general deformation, and therefore what is being tracked is a general (infinite-dimensional) deformation. Our aim is to define tracking as a trajectory on a finite-dimensional group, despite infinite-dimensional deformations. Substantially different in methods, but related in the intent, is the work on stochastic filters for contour tracking and snakes (see [6] and references therein). Our framework is designed for objects that undergo a distinct overall “global” motion while “locally” deforming. Under these assumptions, our contribution consists of a novel definition of motion for a deforming object and a corresponding definition of shape average (Sect. 2). Our definition allows us to derive novel and efficient algorithms to register non-identical (or non-equivalent) shapes using region-based methods (Sect. 5). We use our algorithms to simultaneously approximate and register structures in images, or to simultaneously segment and calibrate images (Sect. 6). In the context of tracking, we extend our definition to a novel notion of “moving average” of shape, and use it to perform tracking for deforming objects (Sect. 4). Our definitions do not rely on a particular representation of objects (e.g. explicit vs. implicit, parametric vs. non-parametric), nor on the particular choice of group (e.g. affine, Euclidean), nor are they restricted to a particular modeling framework (e.g. deterministic, energy-based vs. probabilistic). For the implementation of our algorithms on deforming contours, we have chosen an implicit non-parametric representation in terms of level sets, and we have implemented numerical algorithms for integrating partial differential equations to converge to the steady-state of an energy-based functional. However, these choices can be easily changed without altering the nature of the contribution of this paper. Naturally, since shape and motion are computed as the solution of a nonlinear optimization problem, the algorithms we propose are only guaranteed to converge to local minima and, in general, no conclusions can be drawn on uniqueness. Indeed, it is quite simple to generate pathological examples where the setup we have proposed fails. In the experimental section we will highlight the limitations of the approach when used beyond the assumptions for which it is designed.
2
Defining Motion and Shape Average
The key idea underlying our framework is that the notion of motion throughout a deformation is very tightly coupled with the notion of shape average. In particular, if a deforming object is recognized as moving, there must be an underlying object (which will turn out to be the shape average) moving with the same motion, from which the original object can be obtained with minimal deformations. Therefore, we will model a general deformation as the composition of a group action g on a particular object, on top of which a local deformation is applied. The shape average is defined as the one that minimizes such deformations.
36
S. Soatto and A.J. Yezzi
ht
γt
gt (µ)
gt
. . .
. . .
h2
g2 µ
T12
γ2
g2 (µ)
h1
γ1
g1
g1 (µ)
Fig. 2. A model (commutative diagram) of a deforming contour.
Let γ1 , γ2 , . . . , γn be n “shapes” (we will soon make the notion precise in Def. 1). Let the map between each pair of shapes be Tij γi = Tij γj , i, j = 1 . . . n.
(1)
It comprises the action of a group g ∈ G (e.g. the Euclidean group on the plane G = SE(2)) and a more general transformation h that belongs to a pre-defined class H (for instance diffeomorphisms). The deformation h is not arbitrary, but depends upon another “shape” µ, defined in such a way that γi = hi ◦ gi (µ), i = 1 . . . n.
(2)
Therefore, in general, following the commutative diagram of Fig. 2, we have that . Tij = hi ◦ gi ◦ gj−1 (µ) ◦ h−1 (3) j so that g = gi gj−1 and h is a transformation that depends on hi , hj and µ. Given two or more “shapes” and a cost functional E : H → R+ defined on the set of diffeomorphisms, the motion gt and the shape average are defined as the n minimizers of t=1 E(ht ) subject to γt = ht ◦ gt (µ). Note that the only factors which determine the cost of ht are the “shapes” before and after the transfor. mation, µi = gi (µ) and γi , so that we can write, with an abuse of notation, . E(h(µi , γi )) = E(µi , γi ). We are therefore ready to define our notion of motion during a deformation. Definition 1 Let γ1 , . . . , γn be smooth boundaries of closed subsets of a differentiable manifold embedded in RN , which we call pre-shapes. Let H be a class of diffeomorphisms acting on γi , and let E : H → R+ be a positive, real-valued functional. Consider now a group G acting on γi via g(γi ). We say that gˆ1 , . . . , gˆn is a motion undergone by γi , i = 1 . . . n if there exists a pre-shape µ ˆ such that gˆ1 , . . . , gˆn , µ ˆ = arg min gt ,µ
n i=1
E(hi ) subject to γi = hi ◦ gi (µ)
i = 1...n
(4)
DEFORMOTION
37
The pre-shape µ ˆ is called the shape average relative to the group G, or Gaverage, and the quantity gˆi−1 (γi ) is called the shape of γi . Remark 1 (Invariance) In the definition above, one will notice that the shape average is actually a pre-shape, and that there is an arbitrary choice of group action g0 that, if applied to γi and µ, leaves the definition unchanged (the functional E is invariant with respect to g0 because T (g ◦ g0 , h ◦ g0 ) = T (g, h) ∀ g0 ). For the case of the Euclidean group SE(N ), a way to see this is to notice that the reference frame where µ is described is arbitrary. Therefore, one may choose, for instance, µ = h−1 1 (γ1 ). Remark 2 (Symmetries) In Def. 1 we have purposefully avoided to use the article “the” for the minimizing value of the group action gˆt . It is in fact possible that the minimum of (4) not be unique. A particular case when this occurs is when the pre-shape γ is (symmetric, or) invariant with respect to a particular element of the group G, or to an entire subgroup. Notice, however, that the notion of shape average is still well-defined even when the notion of motion is not unique. This is because any element in the symmetry group suffices to register the pre-shapes, and therefore compute the shape average (Fig. 3).
3
Shape and Deformation of a Planar Contour
In this section we consider the implementation of the program above for a simple case: two closed planar contours, γ1 and γ2 , where we choose as cost functional for the deformations h1 , h2 either the set-symmetric difference ∆ of their interior (the union minus the intersection), or what we call the signed distance score3 ψ . ψ(µ, γ) = ζ(γ)dx (5) µ ¯
where µ ¯ denotes the interior of the contour µ and ζ is the signed distance function of the contour γ; dx is the area form on the plane. In either case, since we have an arbitrary choice of the global reference frame, we can choose g1 = e, the . group identity. We also call g = g2 , so that µ2 = g(µ). The problem of defining the motion and shape average can then be written as gˆ, µ ˆ = arg min g,µ
3
2
E(hi ) subject to γ1 = h1 (µ); γ2 = h2 ◦ g(µ).
(6)
i=1
The rationale behind this score is that one wants to make the signed distance function as positive as possible outside the contour to be matched, and as negative as possible inside. This score can be interpreted as a weighted Monge-Kantorovic functional where the mass of a curve is weighted by its distance from the boundary.
38
S. Soatto and A.J. Yezzi
. As we have anticipated, we choose either E(hi ) = ∆(gi (µ), γi ) or E(hi ) = ψ(gi (µ), γi ). Therefore, abusing the notation as anticipated before Def. 1, we can write the problem above as an unconstrained minimization gˆ, µ ˆ = arg min φ(γ1 , γ2 ) g,µ
where
. φ(γ1 , γ2 ) = E(µ, γ1 ) + E(g(µ), γ2 )
(7)
and E is either ∆ or ψ. The estimate gˆ defines the motion between γ1 and γ2 , and the estimate µ ˆ defines the average of the two contours. If one thinks of contours and their interior, represented by a characteristic function χ, as a binary image, then the cost functional above is just a particular case of a more general cost functional where each term is obtained by integrating a function inside and a function outside the contours φ=
2 i=1
µ ¯ in
fin (x, γi )dx +
µ ¯ out
fout (x, γi )dx
(8)
where the bar in µ ¯ indicates that the integral is computed on a region inside or outside µ and we have emphasized the fact that the function f depends upon the contour γi . For instance, for the case of the set symmetric difference we have fin = (χγ − 1) and fout = χγ . To solve the problem, therefore, we need to minimize the following functional fin (x, γ1 ) + fin (g(x), γ2 )|Jg |dx + fout (x, γ1 ) + fout (g(x), γ2 )|Jg |dx µ ¯ in
µ ¯ out
(9)
where |Jg | is the determinant of the Jacobian of the group action g. This makes it easy to compute the component of the first variation of φ along the normal direction to the contour µ, so that we can impose ∇µ φ · N = 0
(10)
to derive the first-order necessary condition. If we choose G = SE(2), an isometry, it can be easily shown that ∇µ φ = fin (x, γ1 ) − fout (x, γ1 ) + fin (g(x), γ2 ) − fout (g(x), γ2 ) 3.1
(11)
Representation of Motions and Their Variation
For the specific case of matrix Lie groups (e.g. G = SE(2)), there exist twist coordinates ξ that can be represented as a skew-symmetric matrix ξ so that4
g = eξ 4
and
∂g ∂ ξ = g ∂ξi ∂ξi
(12)
The “widehat” notation, which indicates a lifting to the Lie algebra, should not be confused with the “hat”ˆ, which indicates an estimated quantity.
DEFORMOTION
39
∂ξ where the matrix ∂ξ is composed of zeros and ones and the matrix exponential i can be computed in closed form using Rodrigues’ formula. To compute the variation of the functional φ with respect to the group action g, we notice that the first two terms in φ do not contribute since they are inde. pendent of g. The second two terms are of the generic form A(g) = g(¯µ) f (x)dx. Therefore, we consider the variation of A with respect to the components of ∂A , which we will eventually use to compute the gradient with the twist ξi , ∂ξ i respect to the natural connection ∇G φ = ∂φ g. We first rewrite A(g) us∂ξ ing the change of measure g(¯µ) f (x)dx = µ¯ f ◦ g(x)|Jg |dx which leads to ∂A(g) = µ¯ ∂ξ∂ i (f ◦ g(x))|Jg |dx + µ¯ (f ◦ g(x)) ∂ξ∂ i |Jg |dx and note that the Eu∂ξi clidean group is an isometry and therefore the determinant of the Jacobian is one and the second integral can be re-written, using is zero. The last equation ∂g ∂g −1 , g N ds Green’s theorem, as g(µ) f (x) ∂ξi ◦ g (x), N ds = µ f ◦ g(x) ∂ξ ∗ i where g∗ indicates the push-forward. Notice that g is an isometry and therefore it does not affect the arc length; we then have
∂ ξ ∂A(g) = f (g(x)) g, g∗ N ds (13) ∂ξi ∂ξi µ
After collecting all the partial derivatives into an operator evolution of the group action. 3.2
∂φ ∂ξ ,
we can write the
Evolution
The algorithm for evolving the contour and the group action consists of a twostep process where an initial estimate of the contour µ ˆ = γ1 is provided, along with an initial estimate of the motion gˆ = e. The contour and motion are then updated in an alternating minimization where motion is updated according to ∂φ dˆ g = gˆ dt ∂ξ
(14)
Notice that this is valid not just for SE(2), but for any (finite-dimensional) matrix Lie group, although there may not be a closed-form solution for the exponential map like in the case of SE(3) and its subgroups. In practice, the group evolution (14) can be implemented in local (exponential) coordinates by evolv ∂φ ing ξ defined by g = eξ via dξ dt = ∂ξ . In the level set framework, the derivative of the cost function φ with respect to the coordinates of the group action ξi can be computed as the collection of two terms, one for fin , one for fout where ∂g(x) ∂φ ∂ξi = g(γ1,2 ) ∂ξi , f{in,out} (g(x), γ1,2 )J(g∗ T ) ds. As we have anticipated in Eq. (11), the contour µ ˆ evolves according to dˆ µ = (fin (x, γ1 ) − fout (x, γ1 ) + fin (g(x), γ2 ) − fout (g(x), γ2 ))N dt
(15)
40
S. Soatto and A.J. Yezzi
As we have already pointed out, the derivation can be readily extended to surfaces in space and to multiple objects, as we show in Sect. 6. 3.3
Distance between Shapes
The definition of motion gˆ and shape average µ ˆ as a minimizer of (6) suggests defining the distance5 between two shapes as the “energy” necessary to deform one into the other via the average shape: . ˆ j ). d(γi , γj ) = E(γi , T (ˆ g , h)γ
(16)
For instance, for the case of the set-symmetric difference of two contours, we have . (17) d∆ (γ1 , γ2 ) = χµˆ χγ1 + χgˆ(ˆµ) χγ2 dx and for the signed distance transform we have . dψ (γ1 , γ2 ) = ζ(γ1 )dx + ˆ µ ¯
ˆ ¯) g ˆ(µ
ζ(γ2 )dx.
(18)
In either case, a gradient flow algorithm based on Eq. (14) and (15), when it converges to a global minimum, returns an average shape and a set of group elements gi which minimize the sums of the distances between the contours γi and any other common contour modulo the chosen group.
4
Moving Average and Tracking
The discussion above assumes that an unsorted collection of shapes is available, where the deformation between any two shapes is “small” (modulo G), so that the whole collection can be described by a single average shape. Consider however the situation where an object is evolving in time, for instance Fig. 5. While the deformation between adjacent time instants could be captured by a group action and a small deformation, as time goes by the object may change so drastically that talking about a global time average may not make sense. One way to approach this issue is by defining a notion of “moving average”, similarly to what is done in time series analysis. In classical linear time series, however, the uncertainty is modeled via additive noise. In our case, the uncertainty is an infinite-dimensional deformation h that acts on the measured contour. So the model becomes µ(t + 1) = g(t)µ(t) (19) γ(t) = h(µ(t)) 5
Here we use the term distance informally, since we do not require that it satisfies the triangular inequality. The term pseudo-distance would be more appropriate.
DEFORMOTION
41
where µ(t) represents the moving average of order k = 1. A similar model can be used to define moving averages of higher-order k > 1. The procedure described in Sect. 3, initialized with µ(0) = γ1 , provides an estimate of the moving average of order 1, as well as the tracking of the trajectory g(t) in the group G, which in (19) is represented as the model parameter. Note that the procedure in Sect. 3 simultaneously estimates the state µ(t) and identifies the parameters g(t) of the model (19). It does so, however, without imposing restrictions on the evolution of g(t). If one wants to impose additional constraints on the motion parameters, one can augment the state of the model to include the parameters g, for instance g(t + 1) = eξ(t) g(t). This, however, is beyond the scope of this paper. In Fig. 5 we show the results of tracking a storm with a moving average of order one.
5
Simultaneous Approximation and Registration of Non-equivalent Shapes
So far we have assumed that the given shapes are obtained by moving and deforming a common underlying “template” (the average shape). Even though the given shapes are not equivalent (i.e. there is no group action g that maps one exactly onto the other), g is found as the one that minimizes the cost of the deviation from such an equivalence. In the algorithm proposed in Eq. (14)-(15), however, there is no explicit requirement that the deformation between the given shapes be small. Therefore, the procedure outlined can be seen as an algorithm to register shapes that are not equivalent under the group action. A registration is a group element gˆ that minimizes the cost functional (4). To illustrate this fact, consider the two considerably different shapes shown in Fig. 7, γ1 , γ2 . The simultaneous estimation of their average µ, for instance relative to the affine group, and of the affine motions that best matches the shape average onto the original ones, g1 , g2 , provides a registration that maps γ1 onto γ2 and vice-versa: g = g2 g1−1 . If instead of considering the images in Fig. 7 as binary images that represent the contours, we consider them as gray-scale images, then the procedure outlined, for the case where the score is computed using the set-symmetric difference, provides a way to simultaneously jointly segment the two images and register them. This idea is illustrated in Fig. 9 for true gray-scale (magnetic resonance) images of brain sections.
6
Experiments
Fig. 3 illustrates the difference between the motion and shape average computed under the Euclidean group, and the affine group. The three examples show the two given shapes γi , the mean shape registered to the original shapes, gi (µ) and the mean shape µ. Notice that affine registration allows to simultaneously capture the square and the rectangle, whereas the Euclidean average cannot be registered to either one, and is therefore only an approximation.
42
S. Soatto and A.J. Yezzi
Fig. 3. Euclidean (top) vs. affine (bottom) registration and average. For each pair of objects γ1 , γ2 , the registration g1 (µ), g2 (µ) relative to the Euclidean motion and affine motion is shown, together with the Euclidean average and affine average µ. Note that the affine average can simultaneously “explain” a square and a rectangle, whereas the Euclidean average cannot.
Fig. 4 compares the effect of choosing the signed distance score (left) and the set-symmetric difference (right) in the computation of the motion and average shape. The first choice results in an average that captures the common features of the original shapes, whereas the second captures more of the features in each one. Depending on the application, one may prefer one or the other.
Fig. 4. Signed distance score (left) vs. set-symmetric difference (right). Original contours (γ1 on the top, γ2 on the bottom), registered shape gi (µ) and shape average µ. Note that the original objects are not connected, but are composed by a circle and a square. The choice of pseudo-distance between contours influences the resulting average. The signed distance captures more of the features that are common to the two shapes, whereas the symmetric difference captures the features of both.
DEFORMOTION
43
c Fig. 5. Storm (first row) a collection of images from EUMETSAT 2001, (second row) affine motion of the storm based on two adjacent time instances, superimposed to the original images, (bottom) moving average of order 1.
Fig. 6. Jellyfish. Affine registration (top), moving average and affine motion (bottom) for the jellyfish in Fig. 1. (Bottom right) trajectory of the jellyfish (affine component of the group).
Fig. 5 shows the results of tracking a storm. The affine moving average is computed, and the resulting affine motion is displayed. The same is done for the jellyfish in Fig. 6. Fig. 7 and 8 are meant to challenge the assumptions underlying our method. The pairs of objects chosen, in fact, are not simply local deformations of one another. Therefore, the notion of shape average is not meaningful per se in this context, but serves to compute the change of (affine) pose between the two shapes (Fig. 7). Nevertheless, it is interesting to observe how the shape average allows registering even apparently disparate shapes. Fig. 8 shows a representative example from an extensive set of experiments. In some cases, the shape average contains disconnected components, in some other it includes small parts that are shared by the original dataset, whereas in others it removes parts that are not consistent among the initial shapes (e.g. the tails). Notice that our framework is not meant to capture such a wide range of variations. In particular, it does
44
S. Soatto and A.J. Yezzi
Fig. 7. Registering non-equivalent shapes. Left to right: two binary images representing two different shapes; affine registration; corresponding affine shape; approximation of the original shapes using the registration of the shape average based on the set-symmetric difference. Results for the signed distance score are shown in Fig. 8. This example is shown to highlight the limitations of our method.
Fig. 8. Biological shapes. For the signed distance score, we show the original shape with the affine shape average registered and superimposed. It is interesting to notice that different “parts” are captured in the average only if they are consistent in the two shapes being matched and, in some cases, the average shape is disconnected.
not possess a notion of “parts” and it is neither hierarchical nor compositional. In the context of non-equivalent shapes (shapes for which there is no group action mapping one exactly onto the other), the average shape serves purely as a support to define and compute motion in a collection of images of a given deforming shape. Fig. 9 shows the results of simultaneously segmenting and computing the average motion and registration for 4 images from a database of magnetic resonance images of the corpus callosum.
DEFORMOTION
45
Fig. 9. Corpus Callosum (top row) a collection of (MR) images from different patients (courtesy of J. Dutta), further translated, rotated and distorted to emphasize their misalignment (second row). Aligned contour (second row, dark gray, superimposed to original regions) and shape average (bottom) corresponding to the affine group.
Finally, Fig. 10 shows an application of the same technique to simultaneously register and average two 3D surfaces. In particular, two 3D models in different
Fig. 10. 3D Averaging and registration (left) two images of 3D models in different poses (center) registered average (right) affine average. Note that the original 3D surfaces are not equivalent. The technique presented allows “stitching” and registering different 3D models in a natural way.
46
S. Soatto and A.J. Yezzi
poses are shown. Our algorithm can be used to register the surfaces and average them, thus providing a natural framework to integrate surface and volume data. We wish to thank S. Belongie and B. Kimia for test data and suggestions.
References 1. L. Alvarez and J. M. Morel. Morphological approach to multiscale analysis: From principles to equations. In In Geometric–Driven Diffusion in Computer Vision, 1994. 2. L. Alvarez, J. Weickert, and J. Sanchez. A scale-space approach to nonlocal optical flow calculations. In In ScaleSpace ’99, pages 235–246, 1999. 3. R. Azencott, F. Coldefy, and L. Younes. A distance for elastic matching in object recognition. Proc. 13th Intl. Conf. on Patt. Recog, 1:687–691, 1996. 4. S. Belongie, J. Malik, and J. Puzicha. Matching shapes. In Proc. of the IEEE Intl. Conf. on Computer Vision, 2001. 5. D. Bereziat, I. Herlin, and L. Younes. Motion detection in meteorological images sequences: Two methods and their comparison. In Proc. of the SPIE, 1997. 6. A. Blake and M. Isard. Active contours. Springer Verlag, 1998. 7. C. Bregler and S. Omohundro. Surface Learning with Applications to Lip Reading. In J. D. Cowan, G. Tesauro and J. Alspector (eds), Advances in Neural Information Processing Systems (6), San Francisco, CA: Morgan Kaufmann Publishers, 1994. 8. T. K. Carne. The geometry of shape spaces. Proc. of the London Math. Soc. (3) 61, 3(61):407–432, 1990. 9. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active Shape Models - Their Training and Application. Computer Vision and Image Understanding, 61(1), 1995, 38-59. 10. H. Chui and A. Rangarajan. A new algorithm for non-rigid point matching. In Proc. of the IEEE Intl. Conf. on Comp. Vis. and Patt. Recog., pages 44–51, 2000. 11. M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1):67–92, 1973. 12. U. Grenander. General Pattern Theory. Oxford University Press, 1993. 13. U. Grenander and M. I. Miller. Representation of knowledge in complex systems. J. Roy. Statist. Soc. Ser. B, 56:549–603, 1994. 14. P.T. Jackway and R. Deriche. Scale-space properties of the multiscale morphological dilationerosion. IEEE Trans. PAMI, 18(1):38–51, 1996. 15. D. G. Kendall. Shape manifolds, procrustean metrics and complex projective spaces. Bull. London Math. Soc., 16, 1984. 16. B. Kimia, A. Tannenbaum, and S. Zucker. Shapes, shocks, and deformations i: the components of two-dimensional shape and the reaction-diffusion space. IJCV, 15:189–224, 1995. 17. R. Kimmel. Intrinsic scale space for images on surfaces: The geodesic curvature flow. In Lecture Notes In Computer Science: First International Conference on Scale-Space Theory in Computer Vision, 1997. 18. R. Kimmel, N. Kiryati, and A. M. Bruckstein. Multivalued distance maps for motion planning on surfaces with moving obstacles. IEEE TAC, 14(3):427–435, 1998.
DEFORMOTION
47
19. M. Lades, C. Borbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. on Computers, 42(3):300–311, 1993. 20. H. Le and D. G. Kendall. The Riemannian structure of Euclidean shape spaces: a novel environment for statistics. The Annals of Statistics, 21(3):1225–1271, 1993. 21. M. Leventon, E. Grimson, and O. Faugeras. Statistical shape influence in geodesic active contours, 2000. 22. R. Malladi, R. Kimmel, D. Adalsteinsson, V. Caselles G. Sapiro, and J. A. Sethian. A geometric approach to segmentation and analysis of 3d medical images. In Proc. Mathematical Methods in Biomedical Image Analysis Workshop, pages 21–22, 1996. 23. R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propagation: A level set approach. IEEE PAMI, 17(2):158–175, 1995. 24. K. V. Mardia and I. L. Dryden. Shape distributions for landmark data. Adv. appl. prob., 21(4):742–755, 1989. 25. M. I. Miller and L. Younes. Group actions, diffeomorphisms and matching: a general framework. In Proc. of SCTV, 1999. 26. S. Osher and J. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi equations. J. of Comp. Physics, 79:12–49, 1988. 27. N. Paragios and R. Deriche. Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE PAMI, 22(3):266–280, 2000. 28. C. Samson, L. Blanc-Feraud, G. Aubert, and J. Zerubia. A level set model for image classification. In in International Conference on Scale-Space Theories in Computer Vision, pages 306–317, 1999. 29. B. ter Haar Romeny, L. Florack, J. Koenderink, and M. Viergever (Eds.). Scalespace theory in computer vision. In LLNCS, Vol 1252. Springer Verlag, 1997. 30. P. Thompson and A. W. Toga. A surface-based technique for warping threedimensional images of the brain. IEEE Trans. Med. Imaging, 15(4):402–417, 1996. 31. K. Toyama and A. Blake. Probabilistic tracking in a Metric Space. In Proc. of the Eur. Conf. on Comp. Vision, 2001. 32. R. C. Veltkamp and M. Hagedoorn. State of the art in shape matching. Technical Report UU-CS-1999-27, University of Utrecht, 1999. 33. L. Younes. Computable elastic distances between shapes. SIAM J. of Appl. Math., 1998. 34. A. Yuille. Deformable templates for face recognition. J. of Cognitive Neurosci., 3(1):59–70, 1991. 35. S. Zhu, T. Lee, and A. Yuille. Region competition: Unifying snakes, region growing, energy /bayes/mdl for multi-band image segmentation. In ICCV, pages 416–423, 1995.
Region Matching with Missing Parts Alessandro Duci1,4 , Anthony J. Yezzi2 , Sanjoy Mitter3 , and Stefano Soatto4 1
Scuola Normale Superiore, Pisa – Italy 56100
[email protected] 2 Georgia Institute of Technology, Atlanta – GA 30332
[email protected] 3 Massachusetts Institute of Technology, Cambridge – MA 02139
[email protected] 4 University of California at Los Angeles, Los Angeles – CA 90095
[email protected]
Abstract. We present a variational approach to the problem of registering planar shapes despite missing parts. Registration is achieved through the evolution of a partial differential equation that simultaneously estimates the shape of the missing region, the underlying “complete shape” and the collection of group elements (Euclidean or affine) corresponding to the registration. Our technique applies both to shapes, for instance represented as characteristic functions (binary images), and to grayscale images, where all intensity levels evolve simultaneously in a partial differential equation. It can therefore be used to perform “region inpainting” and to register collections of images despite occlusions. The novelty of the approach lies on the fact that, rather than estimating the missing region in each image independently, we pose the problem as a joint registration with respect to an underlying “complete shape” from which the complete version of the original data is obtained via a group action. Keywords: shape, variational, registration, missing part, inpainting
1
Introduction
Consider different images of the same scene, taken for instance from a moving camera, where one or more of the images have been corrupted, so that an entire part is missing. This problem arises, for instance, in image registration with missing data, in the presence of occlusions, in shape recognition when one or more parts of an object may be absent in each view, and in “movie inpainting” where one or more frames are damaged and one wants to “transfer” adjacent frames to fill in the damaged part. We consider a simplified version of the problem, where we have a compact region in each image i, bounded by a closed planar contour, γi , and a region of the
This research is sponsored in part by ARO grant DAAD19-99-1-0139 and Intel grant 8029.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 48–62, 2002. c Springer-Verlag Berlin Heidelberg 2002
Region Matching with Missing Parts
49
image, with support described by a characteristic function χi , is damaged. We do not know a-priori what the region χi is, and we do not know the transformation mapping one image onto the other. However, we make the assumption that such a transformation can be well approximated by a finite-dimensional group gi , for instance the affine or the projective group. In addition, we do not know the value of the image in the missing region. Therefore, given a sequence of images, one has to simultaneously infer the missing regions χi as well as the transformations gi and the occluded portions of each contour γi . We propose an algorithm that stems from a simple generative model, where an unknown contour µ0 , the “complete shape,” is first transformed by a group action gi , and then occluded by a region χi (Figure 1). Therefore, one only estimates the complete shape and the group actions: gˆ1 , . . . , gˆk , µ ˆ0 , χ ˆ1 , . . . , χ ˆk = arg min
gi ,µ0
k
φ(χi (γi ), χi ◦ gi (µ0 ))
(1)
i=1
for a given discrepancy measure φ. A simpler case is when the occlusion occurs at the same location in all shapes; in this case, there is only one indicator function χ0 that acts on the complete shape µ0 .
Fig. 1. A contour undergoes a global motion and local occlusions.
50
2
A. Duci et al.
Background and Prior Work
The analysis of “Shape Spaces” was pioneered in Statistics by Kendall, Mardia and Carne among others [12,17,6,20]. Shapes are defined as the equivalence classes of points modulo the similarity group, RM N /SE(M ) × R. These tools have proven useful in contexts where N distinct “landmarks” are available, for instance in comparing biological shapes with N distinct “parts.” Various extensions to missing points have been proposed, mostly using expectationmaximization (EM), alternating between computing the sufficient statistics of the missing data and performing shape analysis in the standard framework of shape spaces. However, since this framework is essentially tied to representing shapes as collections of points, they do not extend to the case of continuous curves and surfaces in a straightforward way, and we will therefore not pursue them further here. In computational vision, a wide literature exists for the problem of “matching” or “aligning” discrete representations of collections of points, for instance organized in graphs or trees [16,8]. A survey of shape matching algorithms is presented in [29]. Just to highlight some representative algorithms, Belongie et al. [2] propose comparing planar contours based on the “shape context” of each point along the contour. This work is positioned somewhere in between landmark or feature-based approaches and image-based ones, similarly to [7]. Kang et al. [11] have recently approached the multiple image inpainting problem using a shape context representation. Deformable templates, pioneered by Grenander [9], do not rely on a pointwise representation; rather, images are deformed under the action of a group (possibly infinite-dimensional) and compared for the best match in an imagebased approach [32,3]. Grenander’s work sparked a current that has been particularly successful in the analysis of medical images, for instance [10]. In this work we would like to retain some of the features of deformable templates, but extend them to modeling missing parts. A somewhat different line of work is based on variational methods and the solution of partial differential equations (PDEs) to deform planar contours and quantify their “distance.” Not only can the notion of alignment or distance be made precise [1,31,21,14,26], but quite sophisticated theories of shape, that encompass perceptually relevant aspects, can be formalized in terms of the properties of the evolution of PDEs (e.g. [15,13]). The variational framework has also been proven very effective in the analysis of medical images [19,28,18]. Zhu et al. [33] have also extended some of these ideas to a probabilistic context. Other techniques rely on matching different representations, for instance skeletons [13], that are somewhat robust to missing parts. In [27] a similar approach is derived using a generic representation of 2-D shape in the form of structural descriptions from the shocks of a curve evolution process, acting on bounding contours. The possibility of making multiple registration by finding a mean shape and a rigid transformation was studied by Pennec [24] in the case of 3D landmarks. Leung, Burl and Perona [5] described an algorithm for locating quasi-frontal
Region Matching with Missing Parts
51
views of human faces in cluttered scenes that can handle partial occlusions. The algorithm is based on coupling a set of local feature detectors with a statistical model of the mutual distances between facial features. In this work, we intend to extend these techniques to situations where parts of the image cannot be used for matching (see [30]) and at the same time landmark approaches fail. In the paper of Berger and Gerig [4], a deformable area-based template matching is applied to low contrast medical images. In particular, they use a least squares template matching (LSM) with an automatic quality control of the resulting match. Nastar, Moghaddam and Pentland [22] proposed to use a statistical learning method for image matching and interpolation of missing data. Their approach is based on the idea of modeling the image like a deformable intensity surface represented by a 3D mesh and use principal component analysis to provide a priori knowledge about object-specific deformations. Rangarajan, Chui and Mjolsness [25] defined a novel distance measure for non-rigid image matching where probabilistic generative models are constructed for the nonrigid matching of point-sets. 2.1
Contributions of This Paper
This work presents a framework and an algorithm to match regions despite missing parts. To the best of our knowledge, work in this area, using region-based variational methods, is novel. Our framework relies on the notion of “complete shape” which is inferred simultaneously with the group actions that map the complete shape onto the incomplete ones. The complete shape and the registration parameters are defined as the ones that minimize a cost functional, and are computed using an alternating minimization approach where a partial differential equation is integrated using level set methods [23].
3
Matching with Missing Parts
The formulation of the problem and the derivation of the evolution equations are introduced in this section for the case of a planar shape under isometric transformations (rigid motions). In this case, the dimension of the space is 2 and the determinant of the Jacobian of the group is J(g) = 1 for all the elements g of the group G. The main assumption about the shape is that it must be a regular domain, that is an open and bounded subset of R2 with a finite number of connected components and a piece-wise C ∞ boundary. This regularity is required to avoid singular pathologies and to make the computation possible. The main notation that will be used in this section is listed below. Notation – – – – –
¯ are regular domains in R2 γ¯i , µ γi , µ are the boundaries of γ¯i , µ ¯ χ(γ) is the characteristic function of the set γ¯ A(γ) is the area (volume) of the region γ¯ ·, · is the usual inner product
52
3.1
A. Duci et al.
Formulation of the Problem
Let γ¯1 , . . . , γ¯k be regular domains of R2 , all obtained from the same regular domain µ ¯ ⊂ R2 by composition with characteristic functions χ1 , . . . , χk : R2 → R and actions of Lie group elements g1 , . . . , gk ∈ G. We want to find the best solution in the sense expressed by the functional φ=
k
A(¯ γi \ gi (¯ µ)) + α A(¯ µ)
(2)
i=1
where A denotes the area, α is a design constant, µ ¯, χi and gi are the unknowns and the sets γ¯i and the structure of G are given. The rationale behind the choice of the cost function φ is that one wants to maximize the overlap between the incomplete shapes and the registered complete shape (first term) while keeping the complete shape as small as possible (second term). This is equivalent to minimizing the area of the γ¯i that is not covered by the image of the complete shape after the application of the group action gi . At the same time, one needs to minimize a quantity related to the complete shape (e.g. the area) to constrain the solution to be non-singular. Without the second term, it is always possible to choose a compact complete shape that covers all the incomplete ones (e.g. a big square) and minimizes the first term. 3.2
Minimization with Respect to Shape
The functional φ can be written in integral form φ=
k i=1
γ ¯i
(1 − gi µ ¯)dx + α
dx
(3)
χ(µ)dx
(4)
µ ¯
and using the characteristic function notation φ=
k
χ(γi )(1 − χ(gi µ))dx + α
i=1
=
k
χ(γi )dx −
i=1
k
χ(γi )χ(gi µ))dx + α
χ(µ)dx.
(5)
i=1
Since the first term of φ is independent of µ, g and remembering that gi are isometries, the problem of minimizing φ is equivalent to that of finding the minimum of the energy E(gi , µ) =
µ ¯
α−
k i=1
χ(gi−1 γi ) dx.
(6)
Region Matching with Missing Parts
53
One can show that the first variation of this integral along the normal direction of the contour is simply its integrand, by using the divergence theorem, and therefore conclude that a gradient flow that minimizes the energy E with respect to the shape of the contour µ is given by k ∂µ −1 χ(gi γi ) N (7) = α− ∂t i=1 where N is the normal vector field of the contour µ.
3.3
Minimization with Respect to the Group Action
In order to compute the variation of the functional φ with respect to the group actions gi , we first notice that there is only one term that depends on gi in equation (6). Therefore, we are left with having to compute the variation of (8) − χ(gi−1 γi )dx. µ ¯
In order to simplify the notation, we note that the term above is of the generic form . W (g) = f (g(x))dx (9) µ ¯
with f = χ(γi ). Therefore, we consider the variation of W with respect to the components of the exponential coordinates1 ξi of the group gi = eξi ∂W ∂ = f (g(x))dx (10) ∂ξi ∂ξi µ¯ ∂ ∇f (g(x)) g(x)dx. (11) = ∂ξi µ ¯ Using Green’s theorem it is possible to write the variation as an integral along the contour of µ and one over g(¯ µ) with a divergence integrand
∂W ∂ ∂ ∗ −1 (g) = f (g(x)) g(x), g N d s − f (y)∇y · g(g (y)) d y. ∂ξi ∂ξi ∂ξi µ g(¯ µ) (12) Therefore, the derivative with respect of the group action is ∂ ∂φ = gi (x), gi∗ Ni d s. ∂ξi ∂ξi µ ¯ ∩gi−1 (γi ) 1
(13)
Every finite-dimensional Lie group admits exponential coordinates. For the simple case of the isometries of the plane, the exponential coordinates can be computed in closed-form using Rodrigues’ formula.
54
A. Duci et al.
Where N is the normal vector field to the boundary of gi−1 (γi ) and gi∗ is the push forward induced by the map gi . Detailed calculations are reported in Appendix A. 3.4
Evolution Equations
Within the level set framework, a function ψ is evolved instead of the contour µ. The function ψ is negative inside µ, positive outside and zero on the contour. The evolution of ψ depends on the velocity of µ via the Hamilton-Jacobi equation
k ut + α − i=1 χ(gi−1 γi ) |∇u| = 0, (14) u(0, x) = ψ (t) (x). The evolution equations follow ψ (t+1) (x) = u(1, x |ψ (t) ) µ ¯(t) = {x : ψ(x) < 0} (t+1) (t) = ξi − βξ ξi
(15)
µ ¯ ∩gi−1 (γi )
∂ ∗ gi (x), gi Ni d s ∂ξi
(16) (17)
where βξ is a step parameter and u(·, · |ψ (t) ) is the solution of (14) with initial condition u(0, x) = ψ (t) (x). 3.5
Generalization to Graylevel Images
There are many possible generalizations of the above formulas. Here we present the case of gray level images. The case of color images is very similar so the equations will not be stated. The main idea is to work with a generic level set between the minimum and the maximum intensity levels and write the functional to match all the level sets simultaneously using the partial differential equation in Eq. (25). To derive this equation, let k images of the same object Ii , i = 1, . . . , k be given and γ¯iδ = {x : Ii < δ}
(18)
be the generic δ-underlevel of Ii , where δ is a positive constant. Then let µ : R2 → R
(19)
be a function that represents the complete image intensity (the choice of the name µ is not casual, since this will turn out to be the complete shape for the case of greylevel images) and µ ¯δ = {x : µ < δ} µδ = ∂ µ ¯δ
(20)
Region Matching with Missing Parts
55
respectively its underlevel and the boundary of the underlevel. From the same considerations developed for the case of a contour, we write the functional φ as φ=
n k
A(¯ γiδ \ gi (¯ µj )) + α
i=1 j=1
n
A(µj )
(21)
j=1
where j is a discretization of the level set δ φ=
k n
χ(γiδ )(1
j
− χ(gi µ ))dx + α
i=1 j=1
n
χ(µj )dx
(22)
j=1
and in the same way the derivatives of the functional can be obtained as k ∂ µj −1 δ = α− χ(gi γi ) N j , ∂t i=1 n ∂ ∂φ ∗ δ =− gi (x), gi Ni d s. ∂ξi ∂ξi ¯ j ∩gi−1 (γiδ ) j=1 µ
(23) (24)
After letting j go to the limit, Eq. (23) gives the Hamilton-Jacobi equation for the function µ: k H (Ii (gi (x)) − µ(t, x)) − α |∇µ(t, x)| = 0 (25) µt (t, x) + i=1
where
1 if s > 0, H(s) = 0 if s ≤ 0.
(26)
Therefore, the evolution equation for both the function µ and the parameters gi is given by (t) k (t) (t) µ(t+1) (x) = µ(t) (x) − βµ i=1 H Ii (gi (x)) − µ (x) − α |∇µ (x)| (t) ∂ gi (t+1) (t) (t) (t) = ξi − βξ H Ii (gi ) − µ(t) dx ξi (t) , (∇γi )(gi (x)) ∂ξi
(27) where βµ and βξ are step parameters.
4
Experiments
A numerical implementation of the evolution equations (15),(16),(17) has been written within the level set framework proposed by Osher and Sethian [23] using
56
A. Duci et al.
Fig. 2. Hands. (Top) a collection of images of the same hand in different poses with different missing parts. The support of the missing parts is unknown. (Middle) similarity group, visualized as a “registered” image. (Bottom) estimated template corresponding to the similarity group (“complete shape”).
Fig. 3. Hands Evolution. (Top) evolution of the complete shape for t = 0, . . . , 20. (Bottom) evolution of g2 (µ) for t = 0, . . . , 20.
Region Matching with Missing Parts
57
an ultra narrow band algorithm and an alternating minimization scheme. A set of common shapes (hands, leaves, mice, letters) has been chosen and converted into binary images of 256 × 256 pixels. For each image of this set, a group of binary images with missing parts and with different poses has been generated (Figures 2, 4 curves γ1 , . . . , γ5 ). The following level set evolution equation has been used µt +
α−
k i=1
χ(gi−1 γi )
|∇µ| = 0
(28)
with a first-order central scheme approximation. The evolution of the pose parameters has been carried out using the integral (17) with the following approximation of the arclength ds ds ≈ |∇(µ)|dx.
(29)
Fig. 4. Letter “A.” (Top) a collection of images of the letter “A” in different poses with different missing parts. The support of the missing parts is unknown. (Middle) similarity group (“registration”). (Bottom) estimated template corresponding to the similarity group (“complete shape”).
58
A. Duci et al.
Fig. 5. Letter “A” Evolution. (Top) evolution of the complete shape for t = 0, . . . , 20. (Bottom) evolution of g3 (µ) for t = 0, . . . , 20.
The evolution has been initialized with the following settings µt=0 = γ1 Ti = Bγi − Bγ1
cos θi − sin θi Ri = , sin θi cos θi
(30) with θi = Eγ i OEγ1
where Bγi is the baricenter of γi and Eγi is the principal axis of inertia of the region γ¯i . The value of α has been set between 0 and 1. Ri , Ti are the rotational and translational components of gi = (Ri , Ti ) ∈ SE(2). In Figures 2, 4 some results have been illustrated. γj are the starting curves and µ is the complete shape in an absolute system after the computation. The figures show the computed gj (µ), the estimated rigid motions. Figure 6 shows the method applied to grayscale images of a face, where different portions have been purposefully erased. Figure 8 shows the results of matching a collection of images of the corpus callosum of a patient. One way to further improve this technique is to use richer finite-dimensional groups G that can account for more than simple rotations and translations. Simple examples include the affine and projective groups, for which the derivation is essentially the same, except for a few changes of measure due to the fact that they are not isometric. The experiments show that this method works very well even when the missing part in each image is pretty significant, up to about 20% of the area.
Region Matching with Missing Parts
59
Fig. 6. Faces (Top) a collection of images of the same face in different poses with different missing parts. The support of the missing parts is unknown. (Middle) similarity group, visualized as a “registered” image. (Bottom) estimated template corresponding to the similarity group (“complete image”).
Fig. 7. Face evolution. (Top) evolution of the complete image for t = 0, . . . , 189. (Bottom) evolution of g5 (µ) for t = 0, . . . , 189.
60
A. Duci et al.
Fig. 8. Corpus Callosum. (Top) a collection of images of the same corpus callosum in different poses with different missing parts. The support of the missing parts is unknown. (Middle) similarity group, visualized as a “registered” image. (Bottom) estimated template corresponding to the similarity group (“complete image”).
Fig. 9. Corpus Callosum evolution. (Top) evolution of the complete image for t = 0, . . . , 199. (Bottom) evolution of g2 (µ) for t = 0, . . . , 199.
Region Matching with Missing Parts
61
References 1. R. Azencott, F. Coldefy, and L. Younes. A distance for elastic matching in object recognition. Proc. 13th Intl. Conf. on Patt. Recog, 1:687–691, 1996. 2. S. Belongie, J. Malik, and J. Puzicha. Matching shapes. In Proc. of the IEEE Intl. Conf. on Computer Vision, 2001. 3. D. Bereziat, I. Herlin, and L. Younes. Motion detection in meteorological images sequences: Two methods and their comparison. In Proc. of the SPIE, 1997. 4. M. Berger and G. Gerig. Deformable area-based template matching with application to low contrast imagery, 1998. 5. M. Burl, T. Leung, and P. Perona. Face localization via shape statistics. In Proc. Intl. Workshop on automatic face and gesture recognition, pages 154–159, Zurich, June 1995. IEEE Computer Soc. 6. T. K. Carne. The geometry of shape spaces. Proc. of the London Math. Soc. (3) 61, 3(61):407–432, 1990. 7. H. Chui and A. Rangarajan. A new algorithm for non-rigid point matching. In Proc. of the IEEE Intl. Conf. on Comp. Vis. and Patt. Recog., pages 44–51, 2000. 8. M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1):67–92, 1973. 9. U. Grenander. General Pattern Theory. Oxford University Press, 1993. 10. U. Grenander and M. I. Miller. Representation of knowledge in complex systems. J. Roy. Statist. Soc. Ser. B, 56:549–603, 1994. 11. S. H. Kang, T. F. Chan, and S. Soatto. Multiple image inpainting. In Proc. of the 3DPVT, Padova, IT, June 2002. 12. D. G. Kendall. Shape manifolds, procrustean metrics and complex projective spaces. Bull. London Math. Soc., 16, 1984. 13. B. Kimia, A. Tannebaum, and S. Zucker. Shapes, shocks, and deformations i: the components of two-dimensional shape and the reaction-diffusion space. Int’l J. Computer Vision, 15:189–224, 1995. 14. R. Kimmel and A. Bruckstein. Tracking level sets by level sets: a method for solving the shape from shading problem. Computer Vision, Graphics and Image Understanding, (62)1:47–58, 1995. 15. R. Kimmel, N. Kiryati, and A. M. Bruckstein. Multivalued distance maps for motion planning on surfaces with moving obstacles. IEEE Trans. Robot. & Autom., 14(3):427–435, 1998. 16. M. Lades, C. Borbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. Wurtz, and W. Konen. Distortion invariatn object rcognition in the dynamic link architecture. IEEE Trans. on Computers, 42(3):300–311, 1993. 17. H. Le and D. G. Kendall. The riemannian structure of euclidean shape spaces: a novel environment for statistics. The Annals of Statistics, 21(3):1225–1271, 1993. 18. R. Malladi, R. Kimmel, D. Adalsteinsson, V. Caselles G. Sapiro, and J. A. Sethian. A geometric approach to segmentation and analysis of 3d medical images. In Proc. Mathematical Methods in Biomedical Image Analysis Workshop, pages 21–22, 1996. 19. R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propagation: A level set approach. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(2):158–175, 1995. 20. K. V. Mardia and I. L. Dryden. Shape distributions for landmark data. Adv. appl. prob., 21(4):742–755, 1989. 21. M. I. Miller and L. Younes. Group action, diffeomorphism and matching: a general framework. In Proc. of SCTV, 1999.
62
A. Duci et al.
22. C. Nastar, B. Moghaddam, and A. Pentland. Generalized image matching: Statistical learning of physically-based deformations. In Proceedings of the Fourth European Conference on Computer Vision (ECCV’96), Cambridge, UK, April 1996. 23. S. Osher and J. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi equations. J. of Comp. Physics, 79:12–49, 1988. 24. X. Pennec. Multiple Registration and Mean Rigid Shapes - Application to the 3D case. In I.L. Dryden K.V. Mardia, C.A. Gill, editor, Image Fusion and Shape Variability Techniques (16th Leeds Annual Statistical (LASR) Workshop), pages 178–185, july 1996. 25. Anand Rangarajan, Haili Chui, and Eric Mjolsness. A new distance measure for non-rigid image matching. In Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 237–252, 1999. 26. C. Samson, L. Blanc-Feraud, G. Aubert, and J. Zerubia. A level set model for image classification. In in International Conference on Scale-Space Theories in Computer Vision, pages 306–317, 1999. 27. K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker. Shock graphs and shape matching, 1998. 28. P. Thompson and A. W. Toga. A surface-based technique for warping threedimensional images of the brain. IEEE Trans. Med. Imaging, 15(4):402–417, 1996. 29. R. C. Veltkamp and M. Hagedoorn. State of the art in shape matching. Technical Report UU-CS-1999-27, University of Utrecht, 1999. 30. A. Yezzi and S. Soatto. Stereoscopic segmentation. In Proc. of the Intl. Conf. on Computer Vision, pages 59–66, 2001. 31. L. Younes. Computable elastic distances between shapes. SIAM J. of Appl. Math., 1998. 32. A. Yuille. Deformable templates for face recognition. J. of Cognitive Neurosci., 3(1):59–70, 1991. 33. S. Zhu, T. Lee, and A. Yuille. Region competition: Unifying snakes, region growing, energy /bayes/mdl for multi-band image segmentation. In Int. Conf. on Computer Vision, pages 416–423, 1995.
A
Group Action Derivation Details ∂ ∂ g(x)d x = ∇f (y) g(g −1 (y))d y ∂ξi ∂ξi µ ¯ g(¯ µ)
∂ ∂ f (y) g(g −1 (y)), N d s − f (y)∇y · g(g −1 (y)) d y = ∂ξi ∂ξi ∂g(¯ µ) g(¯ µ)
∂ ∂ ∗ −1 f (g(x)) g(x), g N d s + ∇y · g(g (y)) d y = ∂ξi ∂ξi µ g(¯ µ)∩¯ γi ∂ ∂ ∗ ∗ g(x), g N d s + g(x), g N d s =− ∂ξi ∂ξi µ∩g −1 (¯ γi ) ∂(¯ µ∩g −1 (¯ γi )) ∂ = g(x), g ∗ N d s ∂ξi µ ¯ ∩g −1 (γi )
∂W = ∂ξi
∇f (g(x))
What Energy Functions Can Be Minimized via Graph Cuts? Vladimir Kolmogorov and Ramin Zabih Computer Science Department, Cornell University, Ithaca, NY 14853
[email protected],
[email protected]
Abstract. In the last few years, several new algorithms based on graph cuts have been developed to solve energy minimization problems in computer vision. Each of these techniques constructs a graph such that the minimum cut on the graph also minimizes the energy. Yet because these graph constructions are complex and highly specific to a particular energy function, graph cuts have seen limited application to date. In this paper we characterize the energy functions that can be minimized by graph cuts. Our results are restricted to energy functions with binary variables. However, our work generalizes many previous constructions, and is easily applicable to vision problems that involve large numbers of labels, such as stereo, motion, image restoration and scene reconstruction. We present three main results: a necessary condition for any energy function that can be minimized by graph cuts; a sufficient condition for energy functions that can be written as a sum of functions of up to three variables at a time; and a general-purpose construction to minimize such an energy function. Researchers who are considering the use of graph cuts to optimize a particular energy function can use our results to determine if this is possible, and then follow our construction to create the appropriate graph.
1
Introduction and Summary of Results
Many of the problems that arise in early vision can be naturally expressed in terms of energy minimization. The computational task of minimizing the energy is usually quite difficult, as it generally requires minimizing a non-convex function in a space with thousands of dimensions. If the functions have a restricted form they can be solved efficiently using dynamic programming [2]. However, researchers typically have needed to rely on general purpose optimization techniques such as simulated annealing [3,10], which is extremely slow in practice. In the last few years, however, a new approach has been developed based on graph cuts. The basic technique is to construct a specialized graph for the energy function to be minimized, such that the minimum cut on the graph also minimizes the energy (either globally or locally). The minimum cut in turn can be computed very efficiently by max flow algorithms. These methods have been successfully used for a wide variety of vision problems including image restoration [7,8,12,14], stereo and motion [4,7,8,13,16,20,21], voxel occupancy [23], multicamera scene reconstruction [18] and medical imaging [5,6,15]. The output of A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 65–81, 2002. c Springer-Verlag Berlin Heidelberg 2002
66
V. Kolmogorov and R. Zabih
these algorithms is generally a solution with some interesting theoretical quality guarantee. In some cases [7,12,13,14,20] it is the global minimum, in other cases a local minimum in a strong sense [8] that is within a known factor of the global minimum. The experimental results produced by these algorithms are also quite good, as documented in two recent evaluations of stereo algorithms using real imagery with dense ground truth [22,24]. Minimizing an energy function via graph cuts, however, remains a technically difficult problem. Each paper constructs its own graph specifically for its individual energy function, and in some of these cases (especially [8,16,18]) the construction is fairly complex. The goal of this paper is to precisely characterize the class of energy functions that can be minimized via graph cuts, and to give a general-purpose graph construction that minimizes any energy function in this class. Our results play a key role in [18], provide a significant generalization of the energy minimization methods used in [4,5,6,8,12,15,23], and show how to minimize an interesting new class of energy functions. In this paper we only consider energy functions involving binary-valued variables. At first glance this restriction seems severe, since most work with graph cuts considers energy functions that involve variables with more than two possible values. For example, the algorithms presented in [8] for stereo, motion and image restoration use graph cuts to address the standard pixel labeling problem that arises in early vision. In a pixel labeling problem the variables represent individual pixels, and the possible values for an individual variable represent, e.g., its possible displacements or intensities. However, many of the graph cut methods that handle multiple possible values actually consider a pair of labels at a time. Even though we only address binary-valued variables, our results therefore generalize the algorithms given in [4,5,6,8,12,15,23]. As an example, we will show in section 4.1 how to use our results to solve the pixel-labeling problem, even though the pixels have many possible labels. An additional argument in favor of binary-valued variables is that any cut effectively assigns one of two possible values to each node of the graph. So in a certain sense any energy minimization construction based on graph cuts relies on intermediate binary variables. 1.1
Summary of Our Results
In this paper we consider two classes of energy functions. Let {x1 , . . . , xn }, xi ∈ {0, 1} be a set of binary-valued variables. We define the class F 2 to be functions that can be written as a sum of functions of up to 2 variables at a time, E i (xi ) + E i,j (xi , xj ). (1) E(x1 , . . . , xn ) = i
i<j
We define the class F 3 to be functions that can be written as a sum of functions of up to 3 variables at a time, E(x1 , . . . , xn ) = E i (xi ) + E i,j (xi , xj ) + E i,j,k (xi , xj , xk ). (2) i
i<j
i<j
What Energy Functions Can Be Minimized via Graph Cuts?
67
Obviously, the class F 2 is a strict subset of the class F 3 . The main result in this paper is a precise characterization of the functions in F 3 that can be minimized using graph cuts, together with a graph construction for minimizing such functions. Moreover, we give a necessary condition for all other classes which must be met for a function to be minimized via graph cuts. Our results also identify an interesting class of class of energy functions that have not yet been minimized using graph cuts. All of the previous work with graph cuts involves a neighborhood system that is defined on pairs of pixels. In the language of Markov Random Fields [10,19], these methods consider firstorder MRF’s. The associated energy functions lie in F 2 . Our results allow for the minimization of energy functions in the larger class F 3 , and thus for neighborhood systems involve triples of pixels. 1.2
Organization
The rest of the paper is organized as follows. In section 2 we give an overview of graph cuts. In section 3 we formalize the problem that we want to solve. Section 4 contains our main theorem for the class of functions F 2 and shows how it can be used. Section 5 contains our main theorems for other classes. Proofs of our theorems, together with the graph constructions, are deferred to section 6, with some additional details deferred to a technical report [17]. A summary of the actual graph constructions os given in the appendix.
2
Overview of Graph Cuts
Suppose G = (V, E) is a directed graph with two special vertices (terminals), namely the source s and the sink t. An s-t-cut (or just a cut as we will refer to it later) C = S, T is a partition of vertices in V into two disjoint sets S and T , such that s ∈ S and t ∈ T . The cost of the cut is the cut is the sum of costs of all edges that go from S to T : c(u, v). c(S, T ) = u∈S,v∈T,(u,v)∈E
The minimum s-t-cut problem is to find a cut C with the smallest cost. Due to the theorem of Ford and Fulkerson [9] this is equivalent to computing the maximum flow from the source to sink. There are many algorithms which solve this problem in polynomial time with small constants [1,11]. It is convenient to denote a cut C = S, T by a labeling f mapping from the set of the nodes V − {s, t} to {0, 1} where f (v) = 0 means that v ∈ S, and f (v) = 1 means that v ∈ T . We will use this notation later.
3
Defining Graph Representability
Let us consider a graph G = (V, E) with terminals s and t, thus V = {v1 , . . . , vn , s, t}. Each cut on G has some cost; therefore, G represents the energy
68
V. Kolmogorov and R. Zabih
function mapping from all cuts on G to the set of nonnegative real numbers. Any cut can be described by n binary variables x1 , . . . , xn corresponding to nodes in G (excluding the source and the sink): xi = 0 when vi ∈ S, and xi = 1 when vi ∈ T . Therefore, the energy E that G represents can be viewed as a function of n binary variables: E(x1 , . . . , xn ) is equal to the cost of the cut defined by the configuration x1 , . . . , xn (xi ∈ {0, 1}). Note that the configuration that minimizes E will not change if we add a constant to E. We can efficiently minimize E by computing the minimum s-t-cut on G. This naturally leads to the question: what is the class of energy functions E for which we can construct a graph that represents E? We can also generalize our construction. Above we used each node (except the source and the sink) for encoding one binary variable. Instead we can specify a subset V0 = {v1 , . . . , vk } ⊂ V − {s, t} and introduce variables only for the nodes in this set. Then there may be several cuts corresponding to a configuration x1 , . . . , xk . If we define the energy E(x1 , . . . , xk ) as the minimum among the costs of all such cuts, then the minimum s-t-cut on G will again yield the configuration which minimizes E. We will summarize graph constructions that we allow in the following definition. Definition 1. A function E of n binary variables is called graph-representable if there exists a graph G = (V, E) with terminals s and t and a subset of nodes V0 = {v1 , . . . , vn } ⊂ V −{s, t} such that for any configuration x1 , . . . , xn the value of the energy E(x1 , . . . , xn ) is equal to a constant plus the cost of the minimum s-t-cut among all cuts C = S, T in which vi ∈ S, if xi = 0, and vi ∈ T , if xi = 1 (1 ≤ i ≤ n). We say that E is exactly represented by G, V0 if this constant is zero. The following lemma is an obvious consequence of this definition. Lemma 2. Suppose the energy function E is graph-representable by a graph G and a subset V0 . Then it is possible to find the exact minimum of E in polynomial time by computing the minimum s-t-cut on G. In this paper we will give a complete characterization of the classes F 2 and F in terms of graph representability, and show how to construct graphs for minimizing graph-representable energies within these classes. Moreover, we will give a necessary condition for all other classes which must be met for a function to be graph-representable. Note that it would be suffice to consider only the class F 3 since F 2 ⊂ F 3 . However, the condition for F 2 is simpler so we will consider it separately. 3
4
The Class F 2
Our main result for the class F 2 is the following theorem.
What Energy Functions Can Be Minimized via Graph Cuts?
69
Theorem 3. Let E be a function of n binary variables from the class F 2 , i.e. it can be written as the sum E(x1 , . . . , xn ) = E i (xi ) + E i,j (xi , xj ). i
i<j
Then E is graph-representable if and only if each term E i,j satisfies the inequality E i,j (0, 0) + E i,j (1, 1) ≤ E i,j (0, 1) + E i,j (1, 0).
4.1
Example: Pixel-Labeling via Expansion Moves
In this section we show how to apply this theorem to solve the pixel-labeling problem. In this problem, are given the set of pixels P and the set of labels L. The goal is to find a labeling l (i.e. a mapping from the set of pixels to the set of labels) which minimizes the energy E(l) = Dp (lp ) + Vp,q (lp , lq ) p∈P
p,q∈N
where N ⊂ P × P is a neighborhood system on pixels. Without loss of generality we can assume that N contains only ordered pairs p, q for which p < q (since we can combine two terms Vp,q and Vq,p into one term). We will show how our method can be used to derive the expansion move algorithm developed in [8]. This problem is shown in [8] to be NP-hard if |L| > 2. [8] gives an approximation algorithm for minimizing this energy. A single step of this algorithm is an operation called an α-expansion. Suppose that we have some current configuration l0 , and we are considering a label α ∈ L. During the α-expansion operation a pixel p is allowed either to keep its old label lp0 or to switch to a new label α: lp = lp0 or lp = α. The key step in the approximation algorithm presented in [8] is to find the optimal expansion operation, i.e. the one that leads to the largest reduction in the energy E. This step is repeated until there is no choice of α where the optimal expansion operation reduces the energy. [8] constructs a graph which contains nodes corresponding to pixels in P. The following encoding is used: if f (p) = 0 (i.e., the node p is in the source set) then lp = lp0 ; if f (p) = 1 (i.e., the node p is in the sink set) then lp = α. Note that the key technical step in this algorithm can be naturally expressed as minimizing an energy function involving binary variables. The binary variables correspond to pixels, and the energy we wish to minimize can be written formally as Dp (lp (xp )) + Vp,q (lp (xp ), lq (xq )), (3) E(xp1 , . . . , xpn ) = p∈P
p,q∈N
where
∀p ∈ P
lp (xp ) =
lp0 , α,
xp = 0 xp = 1.
70
V. Kolmogorov and R. Zabih
We can demonstrate the power of our results by deriving an important restriction on this algorithm. In order for the graph cut construction of [8] to work, the function Vp,q is required to be a metric. In their paper, it is not clear whether this is an accidental property of the construction (i.e., they leave open the possibility that a more clever graph cut construction may overcome this restriction). Using our results, we can easily show this is not the case. Specifically, by theorem 1, the energy function given in equation 3 is graph-representable if and only if each term Vp,q satisfies the inequality Vp,q (lp (0), lq (0)) + Vp,q (lp (1), lq (1)) ≤ Vp,q (lp (0), lq (1)) + Vp,q (lp (1), lq (0)) or
Vp,q (β, γ) + Vp,q (α, α) ≤ Vp,q (β, α) + Vp,q (α, γ)
where β = lp0 , γ = lq0 . If Vp,q (α, α) = 0, then this is the triangle inequality: Vp,q (β, γ) ≤ Vp,q (β, α) + Vp,q (α, γ) This is exactly the constraint on Vp,q that was given in [8].
5
More General Classes of Energy Functions
We begin with several definitions. Suppose we have a function E of n binary variables. If we fix m of these variables then we get a new function E of n − m binary variables; we will call this function a projection of E. The notation for projections is as follows. Definition 4. Let E(x1 , . . . , xn ) be a function of n binary variables, and let I, J be a disjoint partition of the set of indices {1, . . . , n}: I = {i(1), . . . , i(m)}, J = {j(1), . . . , j(n − m)}. Let αi(1) , . . . , αi(m) be some binary constants. A projection E = E[xi(1) = αi(1) , . . . , xi(m) = αi(m) ] is a function of n − m variables defined by E (xj(1) , . . . , xj(n−m) ) = E(x1 , . . . , xn ), where xi = αi for i ∈ I. We say that we fix variables xi(1) , . . ., xi(m) . Now we give a definition of regular functions. Definition 5. • All functions of one variable are regular. • A function E of two variables is called regular if E(0, 0)+E(1, 1) ≤ E(0, 1)+ E(1, 0). • A function E of more than two variables is called regular if all projections of E of two variables are regular. Now we are ready to formulate our main theorem for F 3 .
What Energy Functions Can Be Minimized via Graph Cuts?
71
Theorem 6. Let E be a function of n binary variables from F 3 , i.e. it can be written as the sum E i (xi ) + E i,j (xi , xj ) + E i,j,k (xi , xj , xk ). E(x1 , . . . , xn ) = i
i<j
i<j
Then E is graph-representable if and only if E is regular. Finally, we give a necessary condition for all other classes. Theorem 7. Let E be a function of binary variables. If E is not regular then E is not graph-representable.
6
Proofs
6.1
Basic Lemmas
Definition 8. The functional π will be a mapping from the set of all functions (of binary variables) to the set of real numbers which is defined as follows. For a function E(x1 , . . . , xn ) n (Πi=1 (−1)xi ) E(x1 , . . . , xn ). π(E) = x1 ∈{0,1},...,xn ∈{0,1}
For example, for a function E of two variables π(E) = E(0, 0) − E(0, 1) − E(1, 0) + E(1, 1). Note that a function E of two variables is regular if and only if π(E) ≤ 0. It is trivial to check the following properties of π. Lemma 9. • π is linear, i.e. for a scalar c and two functions E , E of n variables π(E + E ) = π(E ) + π(E ) and π(c · E ) = c · π(E ). • If E is a function of n variables that does not depend on at least one of the variables then π(E) = 0. The next two lemmas provide “building blocks” for constructing graphs for complex functions. Lemma 10. Let I = {1, . . . , n}, I = {i (1), . . . , i (n )} ⊂ I, I = {i (1), . . . , i (n )} ⊂ I be sets of indices. If the functions E (xi (1) , . . . , xi (n ) ) and E (xi (1) , . . . , xi (n ) ) are graph-representable, then so is the function E(x1 , . . . , xn ) = E (xi (1) , . . . , xi (n ) ) + E (xi (1) , . . . , xi (n ) ).
72
V. Kolmogorov and R. Zabih
Proof. Let us assume for simplicity of notation that E and E are functions of all n variables: E = E (x1 , . . . , xn ), E = E (α1 , . . . , αn ). By the definition of graph respresentability, there exist constants K , K , graphs G = (V , E ), G = (V , E ) and the set V0 = {v1 , . . . , vn }, V0 ⊂ V − {s, t}, V0 ⊂ V − {s, t} such that E + K is exactly represented by G , V0 and E + K is exactly represented by G , V0 . We can assume that the only common nodes of G and G are V0 ∪ {s, t}. Let us construct the graph G = (V, E) as the combined graph of G and G : V = V ∪ V , E = E ∪ E . It is rather straightforward to show that G, V0 exactly represent E+(K +K ) (and, therefore, E is graph-representable). Due to space limitations, the proof is omitted but is contained in a technical report [17]. Lemma 11. Suppose E and E are two functions of n variables such that xk = 0 E(x1 , . . . , xn ), E (x1 , . . . , xn ) = ∀x1 , . . . , xn E(x1 , . . . , xn ) + C, xk = 1, for some constants k and C (1 ≤ k ≤ n). Then • E if graph-representable if and only if E is graph-representable; • E is regular if and only if E is regular. Proof. Let us introduce the following function E C : 0, C E (x1 , . . . , xn ) = ∀x1 , . . . , xn C,
xk = 0 xk = 1.
We need to show that E C is graph-representable for any C then the first part of the lemma will follow from the lemma 3 since E ≡ E +E C and E ≡ E +E −C . It is easy to construct a graph which represents E C . The set of nodes in this graph will be {v1 , . . . , vn , s, t} and the set of edges will include the only edge (s, vk ) with the capacity C (if C ≥ 0) or the edge (vk , t) with the capacity −C (if C < 0). It is trivial to check that this graph exactly represents E C (in the former case) or E C + C (in the latter case). Now let us assume that one of the functions E and E is regular, for example, E. Consider a projection of E of two variables: E [xi(1) = αi(1) , . . . , xi(m) = αi(m) ], where m = n − 2 and {i(1), . . . , i(m)} ⊂ {1, . . . , n}. We need to show that this function is regular, i.e. that the functional π of this function is nonpositive. Due to the linearity of π we can write π(E [xi(1) = αi(1) , . . . , xi(m) = αi(m) ]) = = π(E[xi(1) = αi(1) , . . . , xi(m) = αi(m) ])
+
+ π(E C [xi(1) = αi(1) , . . . , xi(m) = αi(m) ]). The first term is nonpositive by assumption, and the second term is 0 by lemma 2.
What Energy Functions Can Be Minimized via Graph Cuts?
6.2
73
Proof of Theorems 1 and 2: The Constructive Part
In this section we will give the constructive part of the proof: given a regular energy function from class F 3 we will show how to construct a graph which represents it. We will do it in three steps. First we will consider regular functions of two variables, then regular functions of three variables and finally regular functions of the form as in the theorem 2. This will also prove the constructive part of the theorem 1. Indeed, suppose a function is from the class F 2 and each term in the sum satisfies the condition given in the theorem 1 (i.e. regular). Then each term is graph-representable (as we will show in this section) and, hence, the function is graph-representable as well according to the lemma 3. The other direction of theorems 1 and 2 as well as the theorem 3 will be proven in the section 6.3. Functions of two variables. Let E(x1 , x2 ) be a function of two variables represented by a table E=
E(0, 0) E(0, 1) E(1, 0) E(1, 1)
Lemma 4 tells us that we can add a constant to any column or row without affecting theorem 2. Thus, without loss of generality we can consider only functions E of the form E=
0A 0 0
(we subtracted a constant from the first row to make the upper left element zero, then we subtracted a constant from the second row to make the bottom left element zero, and finally we subtracted a constant from the second column to make the bottom right element zero). π(E) = −A ≤ 0 since we assumed that E is regular; hence, A is non-negative. Now we can easily construct a graph G which represents this function. It will have four vertices V = {v1 , v2 , s, t} and one edge E = {(v1 , v2 )} with the cost c(v1 , v2 ) = A. It is easy to see that G, V0 = {v1 , v2 } represent E since the only case when the edge (v1 , v2 ) is cut (yielding a cost A) is when v1 ∈ S, v2 ∈ T , i.e. when x1 = 0, x2 = 1. Note that we did not introduce any additional nodes for representing binary interactions of binary variables. This is in contrast to the construction in [8] which added auxiliary nodes for representing energies that we just considered. Our construction yields a smaller graph and, thus, the minimum cut can potentially be computed faster. Functions of three variables. Now let us consider a regular function E of three variables. Let us represent it as a table
74
V. Kolmogorov and R. Zabih
E(0, 0, 0) E(0, 0, 1) E(0, 1, 0) E(0, 1, 1) E= E(1, 0, 0) E(1, 0, 1) E(1, 1, 0) E(1, 1, 1) Two cases are possible: Case 1. π(E) ≥ 0. We can apply transformations described in lemma 4 and get the following function: 0 0 E= 0 A3
0 A1 A2 A0
It is easy to check that these transformations preserve the functional π. Hence, A = A0 − (A1 + A2 + A3 ) = −π(E) ≤ 0. By applying the regularity constraint to the projections E[x1 = 0], E[x2 = 0], E[x3 = 0] we also get A1 ≤ 0, A2 ≤ 0, A3 ≤ 0. We can represent E as the sum of functions 0 0 0 0 0 0 0 0 0 A1 0 0 0 0 0 0 + + + E= 0 0 0 A2 0 0 0 0 0 A1 0 A2 A3 A3 0A We need to show that all terms here are graph-representable, then lemma 3 will imply that E is graph-representable as well. The first three terms are regular functions depending only on two variables and thus are graph-representable as was shown in the previous section. Let us consider the last term. The graph G that represents this term can be constructed as follows. The set of nodes will contain one auxilary node u: V = {v1 , v2 , v3 , u, s, t}. The set of edges will consist of directed edges E = {(v1 , u), (v2 , u), (v3 , u), (u, t)} with capacities A = −A. It is easy to check that G, V0 = {v1 , v2 , v3 } exactly represent the function E (x1 , x2 , x3 ) = E(x1 , x2 , x3 ) + A ; the proof is contained in [17]. Case 2. π(E) < 0. This case is similar to the case 1. We can transform the energy to A0 A E= 2 A1 0
A3 A1 0 A2 0 A3 A3 A0 0 0 0 A2 0 0 0 0 0 = + + + 0 A1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
where A = A0 − (A1 + A2 + A3 ) = π(E) < 0 and A1 ≤ 0, A2 ≤ 0, A3 ≤ 0 since E is regular. The first three terms are regular functions of two variables and the last term can be represented by the graph G = (V, E) where V = {v1 , v2 , v3 , u, s, t} and E = {(u, v1 ), (u, v2 ), (u, v3 ), (s, u)}; capacities of all edges are −A.
What Energy Functions Can Be Minimized via Graph Cuts?
75
Functions of many variables. Finally let us consider a regular function E which can be written as E i,j,k (xi , xj , xk ), E(x1 , . . . , xn ) = i<j
where i, j, k are indices from the set {1, . . . , n} (we omitted terms involving functions of one and two variables since they can be viewed as functions of three variables). Although E is regular, each term in the sum need not necessarily be regular. However we can “regroup” terms in the sum so that each term will be regular (and, thus, graph-representable). This can be done using the following lemma and a trivial induction argument. Definition 12. Let E i,j,k be a function of three variables. The functional N (E i,j,k ) is defined as the number of projections of two variables of E i,j,k with the positive value of the functional π. Note that N (E i,j,k ) = 0 exactly when E i,j,k is regular. Lemma 13. Suppose the function E of n variables can be written as E i,j,k (xi , xj , xk ), E(x1 , . . . , xn ) = i<j
where some of the terms are not regular. Then it can be written as ˜ i,j,k (xi , xj , xk ), E E(x1 , . . . , xn ) = i<j
where
˜ i,j,k ) < N (E
i<j
N (E i,j,k ).
i<j
Proof. For the simplicity of notation let us assume that the term E 1,2,3 is not regular and π(E 1,2,3 [x3 = 0]) > 0 or π(E 1,2,3 [x3 = 1]) > 0 (we can ensure this by renaming indices). Let Ck =
max π(E 1,2,k [xk = αk ])
αk ∈{0,1}
C3 = −
n
k ∈ {4, . . . , n}
Ck
k=4
Now we will modify the terms E 1,2,3 , . . ., E 1,2,n as follows: ˜ 1,2,k ≡ E 1,2,k − R[Ck ] E
k ∈ {3, . . . , n}
where R[C] is the function of two variables x1 and x2 defined by the table R[C] =
0 0 0C
76
V. Kolmogorov and R. Zabih
˜ i,j,k ≡ E i,j,k , (i, j) = (1, 2)). We have (other terms are unchanged: E ˜ i,j,k (xi , xj , xk ) E E(x1 , . . . , xn ) = n
i<j
n
since k=3 Ck = 0 and k=3 R[Ck ] ≡ 0. If we consider R[C] as a function of n variables and take a projection of two variables where the two variables that are not fixed are xi and xj (i < j), then the functional π will be C, if (i, j) = (1, 2), and 0 otherwise since in the latter case a projection actually depends on at most one variable. Hence, the only projections of two variables that could have changed their value of the ˜ 1,2,k ˜ 1,2,k [x3 = α3 , . . . , xn = αn ], k ∈ {3, . . . , n}, if we treat E functional π are E 1,2,k 1,2,k ˜ ˜ as functions of n variables, or E [xk = αk ], if we treat E as functions of three variables. First let us consider terms with k ∈ {4, . . . , n}. We have π(E 1,2,k [xk = αk ]) ≤ Ck , thus ˜ 1,2,k [xk = αk ]) = π(E 1,2,k [xk = αk ]) − π(R[Ck ][xk = αk ]) ≤ Ck − Ck = 0 π(E Therefore we did not introduce any nonregular projections for these terms. ˜ 1,2,3 [x3 = α3 ]). We can write Now let us consider the term π(E ˜ 1,2,3 [x3 = α3 ]) = π(E 1,2,3 [x3 = α3 ]) − π(R[C3 ][x3 = α3 ]) = π(E n n 1,2,3 [x3 = α3 ]) − (− Ck ) = π(E 1,2,k [xk = αk ]) = π(E k=4
k=3
where αk = arg maxα∈{0,1} π(E 1,2,k [xk = α]), k ∈ {4, . . . , n}. The last expression is just π(E[x3 = α3 , . . . , xn = αn ]) and is nonpositive since E is regular by ˜ 1,2,3 [x3 = 1]) are both ˜ 1,2,3 [x3 = 0]) and π(E assumption. Hence, values π(E nonpositive and, therefore, the number of nonregular projections has decreased. 6.3
Proof of Theorem 3
In this section we will prove a necessary condition for graph representability: if a function of binary variables is graph-representable then it is regular. It will also imply the corresponding directions of the theorems 1 and 2. Note that theorem 1 needs a little bit of reasoning, as follows. Let us consider a graph-representable function E from the class F 2 : E(x1 , . . . , xn ) = E i (xi ) + E i,j (xi , xj ) i
i<j
E is regular as we will prove in this section. It means that the functional π of any projection of E of two variables is nonpositive. Let us consider a projection where the two variables that are not fixed are xi and xj . By lemma 2 the value of the functional π of this projection is equal to π(E i,j ) (all other terms yield zero). Hence, all terms E i,j are regular, i.e. they satisfy the condition in the theorem 1.
What Energy Functions Can Be Minimized via Graph Cuts?
77
Definition 14. Let G = (V, E) be a graph, v1 , . . . , vk be a subset of nodes V and α1 , . . . , αk be binary constants whose values are from {0, 1}. We will define the graph G[x1 = α1 , . . . , xk = αk ] as follows. Its nodes will be the same as in G and its edges will be all edges of G plus additional edges corresponding to nodes v1 , . . . , vk : for a node vi , we add the edge (s, vi ), if αi = 0, or (vi , t), if αi = 1, with an infinite capacity. It should be obvious that these edges enforce constraints f (v1 ) = α1 , . . ., f (vk ) = αk in the minimum cut on G[x1 = α1 , . . . , xk = αk ], i.e. if αi = 0 then vi ∈ S, and if αi = 1 then vi ∈ T . (If, for example, αi = 0 and vi ∈ T then the edge (s, vi ) must be cut yielding an infinite cost, so it would not the minimum cut.) Now we can give a definition of graph representability which is equivalent to the definition 1. This new definition will be more convenient for the proof. Definition 15. We say that the function E of n binary variables is exactly represented by the graph G = (V, E) and the set V0 ⊂ V if for any configuration α1 , . . . , αn the cost of the minimum cut on G[x1 = α1 , . . . , xk = αk ] is E(α1 , . . . , αn ). Lemma 16. Any projection of a graph-representable function is graphrepresentable. Proof. Let E be a graph-representable function of n variables, and the graph G = (V, E) and the set V0 represents E. Suppose that we fix variables xi(1) , . . . , xi(m) . It is straightforward to check that the graph G[xi(1) = αi(1) , . . . , xi(m) = αi(m) ] and the set V0 = V0 − {vi(1) , . . . , vi(m) } represent the function E = E[xi(1) = αi(1) , . . . , xi(m) = αi(m) ]. This lemma implies that it suffices to prove theorem 3 only for energies of two variables. Let E(x1 , x2 ) be a graph-representable function of two variables. Let us represent this function as a table: E=
E(0, 0) E(0, 1) E(1, 0) E(1, 1)
Lemma 4 tells us that we can add a constant to any column or row without affecting theorem 3. Thus, without loss of generality we can consider only functions E of the form E=
0 0 0A
(we subtracted a constant from the first row to make the upper left element zero, then we subtracted a constant from the second row to make the bottom left element zero, and finally we subtracted a constant from the second column to make the upper right element zero).
78
V. Kolmogorov and R. Zabih
We need to show that E is regular, i.e. that π(E) = A ≤ 0. Suppose this is not true: A > 0. Suppose the graph G and the set V0 = {v1 , v2 } represent E. It means that there is a constant K such that G, V0 exactly represent E (x1 , x2 ) = E(x1 , x2 ) + K: E =
K K K K +A
The cost of the minimum s-t-cut on G is K (since this cost is just the minimum entry in the table for E ); hence, K ≥ 0. Thus the value of the maximum flow from s to t in G is K. Let G 0 be the residual graph obtained from G after pushing the flow K. Let E 0 (x1 , x2 ) be the function exactly represented by G 0 , V0 . By the definition of graph representability, E (α1 , α2 ) is equal to the value of the minimum cut (or maximum flow) on the graph G[x1 = α1 , x2 = α2 ]. The following sequence of operations shows one possible way to push the maximum flow through this graph. • First we take the original graph G and push the flow K; then we get the residual graph G 0 . (It is equivalent to pushing flow through G[x1 = α1 , x2 = α2 ] where we do not use edges corresponding to constraints x1 = α1 and x2 = α2 ). • Then we add edges corresponding to these constraints; then we get the graph G 0 [x1 = α1 , x2 = α2 ]. • Finally we push the maximum flow possible through the graph G 0 [x1 = α1 , x2 = α2 ]; the amount of this flow is E 0 (α1 , α2 ) according to the definition of graph representability. The total amount of flow pushed during all steps is K + E 0 (α1 , α2 ); thus, E (α1 , α2 ) = K + E 0 (α1 , α2 ) or
E(α1 , α2 ) = E 0 (α1 , α2 )
We proved that E is exactly represented by G 0 , V0 . The value of the minimum cut/maximum flow on G 0 is 0 (it is the minimum entry in the table for E); thus, there is no augmenting path from s to t in G 0 . However, if we add edges (v1 , t) and (v2 , t) then there will be an augmenting path from s to t in G 0 [x1 = α1 , x2 = α2 ] since E(1, 1) = A > 0. Hence, this augmenting path will contain at least one of these edges and, therefore, either v1 or v2 will be in the path. Let P be the part of this path going from the source until v1 or v2 is first encountered. Without loss of generality we can assume that it will be v1 . Thus, P is an augmenting path from s to v1 which does not contain edges that we added, namely (v1 , t) and (v2 , t). Finally let us consider the graph G 0 [x1 = 1, x2 = 0] which is obtained from 0 G by adding edges (v1 , t) and (s, v2 ) with infinite capacities. There is an augmenting path {P, (v1 , t)} from the source to the sink in this graph; hence, the minimum cut/maximum flow on it greater than zero, or E(1, 0) > 0. We get a contradiction.
What Energy Functions Can Be Minimized via Graph Cuts?
79
Acknowledgements. We thank Olga Veksler and Yuri Boykov for their careful reading of this paper, and for valuable comments which greatly improved its readibility. We also thank Ian Jermyn for helping us clarify the paper’s motivation. This research was supported by NSF grants IIS-9900115 and CCR-0113371, and by a grant from Microsoft Research.
Appendix: Summary of Graph Constructions We now summarize the graph constructions used for regular functions. The notation D(v, c) means that we add an edge (s, v) with the weight c if c > 0, or an edge (v, t) with the weight −c if c < 0. Regular Functions of One Binary Variable Recall that all functions of one variable are regular. For a function E(x1 ), we construct a graph G with three vertices V = {v1 , s, t}. There is a single edge D(v1 , E(1) − E(0)). Regular Functions of Two Binary Variables We now show how to construct a graph G for a regular function E(x1 , x2 ) of two variables. It will contain four vertices: V = {v1 , v2 , s, t}. The edges E are given below. • D(v1 , E(1, 0) − E(0, 0)); • D(v2 , E(1, 1) − E(1, 0)); • (v1 , v2 ) with the weight −π(E). Regular Functions of Three Binary Variables We next show how to construct a graph G for a regular function E(x1 , x2 , x3 ) of three variables. It will contain five vertices: V = {v1 , v2 , v3 , u, s, t}. If π(E) ≥ 0 then the edges are • • • • • • •
D(v1 , E(1, 0, 1) − E(0, 0, 1)); D(v2 , E(1, 1, 0) − E(1, 0, 0)); D(v3 , E(0, 1, 1) − E(0, 1, 0)); (v2 , v3 ) with the weight −π(E[x1 = 0]); (v3 , v1 ) with the weight −π(E[x2 = 0]); (v1 , v2 ) with the weight −π(E[x3 = 0]); (v1 , u), (v2 , u), (v3 , u), (u, t) with the weight π(E).
If π(E) < 0 then the edges are • • • • • • •
D(v1 , E(1, 1, 0) − E(0, 1, 0)); D(v2 , E(0, 1, 1) − E(0, 0, 1)); D(v3 , E(1, 0, 1) − E(1, 0, 0)); (v3 , v2 ) with the weight −π(E[x1 = 1]); (v1 , v3 ) with the weight −π(E[x2 = 1]); (v2 , v1 ) with the weight −π(E[x3 = 1]); (u, v1 ), (u, v2 ), (u, v3 ), (s, u) with the weight −π(E).
80
V. Kolmogorov and R. Zabih
References 1. Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. 2. Amir Amini, Terry Weymouth, and Ramesh Jain. Using dynamic programming for solving variational problems in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9):855–867, September 1990. 3. Stephen Barnard. Stochastic stereo matching over scale. International Journal of Computer Vision, 3(1):17–32, 1989. 4. S. Birchfield and C. Tomasi. Multiway cut for stereo and motion with slanted surfaces. In International Conference on Computer Vision, pages 489–495, 1999. 5. Yuri Boykov and Marie-Pierre Jolly. Interactive organ segmentation using graph cuts. In Medical Image Computing and Computer-Assisted Intervention, pages 276–286, 2000. 6. Yuri Boykov and Marie-Pierre Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In International Conference on Computer Vision, pages I: 105–112, 2001. 7. Yuri Boykov, Olga Veksler, and Ramin Zabih. Markov Random Fields with efficient approximations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 648–655, 1998. 8. Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, November 2001. 9. L. Ford and D. Fulkerson. Flows in Networks. Princeton University Press, 1962. 10. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. 11. A. Goldberg and R. Tarjan. A new approach to the maximum flow problem. Journal of the Association for Computing Machinery, 35(4):921–940, October 1988. 12. D. Greig, B. Porteous, and A. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, Series B, 51(2):271–279, 1989. 13. H. Ishikawa and D. Geiger. Occlusions, discontinuities, and epipolar lines in stereo. In European Conference on Computer Vision, pages 232–248, 1998. 14. H. Ishikawa and D. Geiger. Segmentation by grouping junctions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 125–131, 1998. 15. Junmo Kim, John Fish, Andy Tsai, Cindy Wible, Ala Willsky, and William Wells. Incorporating spatial priors into an information theoretic approach for fMRI data analysis. In Medical Image Computing and Computer-Assisted Intervention, pages 62–71, 2000. 16. Vladimir Kolmogorov and Ramin Zabih. Visual correspondence with occlusions using graph cuts. In International Conference on Computer Vision, pages 508–515, 2001. 17. Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via graph cuts? Technical report CUCS-TR2001-1857, Cornell Computer Science Department, November 2001. 18. Vladimir Kolmogorov and Ramin Zabih. Multi-camera scene reconstruction via graph cuts. In European Conference on Computer Vision, 2002. 19. S. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag, 1995.
What Energy Functions Can Be Minimized via Graph Cuts?
81
20. S. Roy. Stereo without epipolar lines: A maximum flow formulation. International Journal of Computer Vision, 1(2):1–15, 1999. 21. S. Roy and I. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. In International Conference on Computer Vision, 1998. 22. Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. Technical Report 81, Microsoft Research, 2001. To appear in IJCV. An earlier version appears in CVPR 2001 Workshop on Stereo Vision. 23. Dan Snow, Paul Viola, and Ramin Zabih. Exact voxel occupancy with graph cuts. In IEEE Conference on Computer Vision and Pattern Recognition, pages 345–352, 2000. 24. Richard Szeliski and Ramin Zabih. An experimental comparison of stereo algorithms. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, number 1883 in LNCS, pages 1–19, Corfu, Greece, September 1999. Springer-Verlag.
Multi-camera Scene Reconstruction via Graph Cuts Vladimir Kolmogorov and Ramin Zabih Computer Science Department, Cornell University, Ithaca, NY 14853
[email protected],
[email protected]
Abstract. We address the problem of computing the 3-dimensional shape of an arbitrary scene from a set of images taken at known viewpoints. Multi-camera scene reconstruction is a natural generalization of the stereo matching problem. However, it is much more difficult than stereo, primarily due to the difficulty of reasoning about visibility. In this paper, we take an approach that has yielded excellent results for stereo, namely energy minimization via graph cuts. We first give an energy minimization formulation of the multi-camera scene reconstruction problem. The energy that we minimize treats the input images symmetrically, handles visibility properly, and imposes spatial smoothness while preserving discontinuities. As the energy function is NP-hard to minimize exactly, we give a graph cut algorithm that computes a local minimum in a strong sense. We handle all camera configurations where voxel coloring can be used, which is a large and natural class. Experimental data demonstrates the effectiveness of our approach.
1
Introduction
Reconstructing an object’s 3-dimensional shape from a set of cameras is a classic vision problem. In the last few years, it has attracted a great deal of interest, partly due to a number of new applications both in vision and in graphics that require good reconstructions. While the problem can be viewed as a natural generalization of stereo, it is considerably harder. The major reason for this is the difficulty of reasoning about visibility. In stereo matching, most scene elements are visible from both cameras, and it is possible to obtain good results without addressing visibility constraints. In the more general scene reconstruction problem, however, very few scene elements are visible from every camera, so the issue of visibility cannot be ignored. In this paper, we approach the scene reconstruction problem from the point of view of energy minimization. Energy minimization has several theoretical advantages, but has generally been viewed as too slow for early vision to be practical. Our approach is motivated by some recent work in early vision, where fast energy minimization algorithms have been developed based on graph cuts [6,7,12,14,20,21]. These methods give strong experimental results in practice, as documented in two recent evaluations of stereo algorithms using real imagery with dense ground truth [22,27]. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 82–96, 2002. c Springer-Verlag Berlin Heidelberg 2002
Multi-camera Scene Reconstruction via Graph Cuts
83
The energy that we minimize has three important properties: – it treats the input images symmetrically, – it handles visibility properly, and – it imposes spatial smoothness while also preserving discontinuities. We begin with a review of related work. In section 3 we give a precise definition of the problem that we wish to solve, and define the energy that we will minimize. Section 4 describes how we use graph cuts to compute a strong local minimum of this energy. Experimental data is presented in section 5. Some details of the graph cut construction are deferred to the appendix, along with a proof that computing the global minimum of the energy is NP-hard.
2
Related Work
The problem of reconstructing a scene from multiple cameras has received a great deal of attention in the last few years. One extensively-explored approach to this problem is voxel occupancy. In voxel occupancy [18,25] the scene is represented as a set of 3-dimensional voxels, and the task is to label the individual voxels as filled or empty. Voxel occupancy is typically solved using silhouette intersection, usually from multiple cameras but sometimes from a single camera with the object placed on a turntable [8]. It is known that the output of silhouette intersection even without noise is not the actual 3-dimensional shape, but rather an approximation called the visual hull [17]. 2.1
Voxel Coloring and Space Carving
Voxel occupancy, however, fails to exploit the consistent appearance of a scene element between different cameras. This constraint, called photo-consistency, is obviously quite powerful. Two well-known recent algorithms that have used photo-consistency are voxel coloring [23] and space carving [16]. Voxel coloring makes a single pass through voxel space, first computing the visibility of each voxel and then its color. There is a constraint on the camera geometry, namely that no scene point is allowed to be within the convex hull of the camera centers. As we will see in section 3, our approach handles all the camera configurations where voxel coloring can be used. Space carving is another voxel-oriented approach that uses the photo-consistency constraint to prune away empty voxels from the volume. Space carving has the advantage of allowing arbitrary camera geometry. One major limitation of voxel coloring and space carving is that they lack a way of imposing spatial coherence. This is particularly problematic because the image data is almost always ambiguous. Another (related) limitation comes from the fact that these methods traverse the volume making “hard” decisions concerning the occupancy of each voxel they analyze. Because the data is ambiguous, such a decision can easily be incorrect, and there is no easy way to undo such a decision later on.
84
2.2
V. Kolmogorov and R. Zabih
Energy Minimization Approaches
Our approach to the scene reconstruction problem is to generalize some recently developed techniques that give strong results for stereo matching. It is well known that stereo, like many problems in early vision, can be elegantly stated in terms of energy minimization [19]. The energy minimization problem has traditionally been solved via simulated annealing [2,11], which is extremely slow in practice. In the last few years powerful energy minimization algorithms have been developed based on graph cuts [6,7,12,14,21]. These methods are fast enough to be practical, and yield quite promising experimental results for stereo [22, 27]. Unlike simulated annealing, graph cut methods cannot be applied to an arbitrary energy function; instead, for each energy function to be minimized, a careful graph construction must be developed. In this paper, instead of building a special purpose graph we will use some recent results [15] that give graph constructions for a quite general class of energy functions. While energy minimization has been widely used for stereo, only a few papers [13,20,24] have used it for scene reconstruction. The energy minimization formalism has several advantages. It allows a clean specification of the problem to be solved, as opposed to the algorithm used to solve it. In addition, energy minimization naturally allows the use of soft constraints, such as spatial coherence. In an energy minimization framework, it is possible to cause ambiguities to be resolved in a manner that leads to a spatially smooth answer. Finally, energy minimization avoids being trapped by early hard decisions. Although [20] and [24] use energy minimization via graph cuts, their focus is quite different from ours. [20] uses an energy function whose global minimum can be computed efficiently via graph cuts; however, the spatial smoothness term is not discontinuity preserving, and so the results tend to be oversmoothed. Visibility constraints are not used. [24] computes the global minimum of a different energy function as an alternative to silhouette intersection (i.e., to determine voxel occupancy). Their approach does not deal with photoconsistency at all, nor do they reason about visibility. Our method is perhaps closest to the work of [13], which also relies on graph cuts. They extend the work of [7], which focused on traditional stereo matching, to allow an explicit label for occluded pixels. While the energy function that they use is of a similar general form to ours, they do not treat the input images symmetrically. While we effectively compute a disparity map with respect to each camera, they compute a disparity map only with respect to a single camera. [26] also proposed the use of disparity maps for each camera, in an energy minimization framework. Their optimization scheme did not rely on graph cuts, and their emphasis was on the two-camera stereo problem. They also focused on handling transparency, which is an issue that we do not address. Another approach is to formulate the problem in terms of level sets, and to solve a system of partial differential equations using numerical techniques [9]. The major distinction between this approach and our work is that their problem formulation is continuous while ours is discrete. The discrete energy
Multi-camera Scene Reconstruction via Graph Cuts
85
minimization approach is known to lead to strong results for two-camera stereo [22,27], and is therefore worth investigating for multiple cameras. In [14] we proposed a (two camera) stereo algorithm that shares a number of properties with the present work. That method treats the input images symmetrically, and is based on graph cuts for energy minimization. It properly handles the limited form of visibility constraint that arises with two cameras (namely, occlusions). The major difference is that [14] is limited to the case of two cameras, while the multi-camera problem is much more difficult.
3
Problem Formulation
Suppose we are given n calibrated images of the same scene taken from different viewpoints (or at different moments of time). Let Pi be the set of pixels in the camera i, and let P = P1 ∪ . . . ∪ Pn be the set of all pixels. A pixel p ∈ P corresponds to a ray in 3D-space. Consider the point of the first intersection of this ray with an object in the scene. Our goal is to find the depth of this point for all pixels in all images. Thus, we want to find a labeling f : P → L where L is a discrete set of labels corresponding to different depths. In the current implementation of our method, labels correspond to increasing depth from a fixed camera. A pair p, l where p ∈ P, l ∈ L corresponds to some point in 3D-space. We will refer to such pairs as 3D-points. The limitation of our method is that there must exist a function D : R3 → R such that for all scene points P and Q, P occludes Q in a camera i only if D(P ) < D(Q). This is exactly the constraint used in voxel coloring [23]. If such a function exists then labels correspond to level sets of this function. In our current implementation, we make a slightly more specific assumption, which is that the cameras must lie in one semiplane looking at the other semiplane. The interpretation of labels will be as follows: each label corresponds to a plane in 3D-space, and a 3D-point p, l is the intersection of the ray corresponding to the pixel p and the plane l. Let us introduce the set of interactions I consisting of (unordered) pairs of 3D-points p1 , l1 , p2 , l2 “close” to each other in 3D-space. We will discuss several criteria for “closeness” later. In general, I can be an arbitrary set of pairs of 3D-points satisfying the following constraint: – Only 3D-points at the same depth can interact, i.e.if {p1 , l1 , p2 , l2 } ∈ I then l1 = l2 . Now we will define the energy function that we minimize. It will consist of three terms: E(f ) = Edata (f ) + Esmoothness (f ) + Evisibility (f ) The data term will impose photo-consistency. It is D(p, q) Edata (f ) = p,f (p),q,f (q)∈I
(1)
86
V. Kolmogorov and R. Zabih
where D(p, q) is a non-positive value depending on intensities of pixels p and q. It can be, for example, D(p, q) = min{0, (Intensity(p) − Intensity(q))2 − K} for some constant K > 0. Since we minimize the energy, terms D(p, q) that we sum will be small. These terms are required to be non-negative for technical reasons that will be described in section 4.2. Thus, pairs of pixels p, q which come from the same scene point according to the configuration f will have similar intensities, which causes photo-consistency. The smoothness term involves a notion of neighborhood; we assume that there is a neighborhood system on pixels N ⊂ {{p, q} | p, q ∈ P} This can be the usual 4-neighborhood system: pixels p = (px , py ) and q = (qx , qy ) are neighbors if they are in the same image and |px − qx | + |py − qy | = 1. We will write the smoothness term as V{p,q} (f (p), f (q)) Esmoothess (f ) = {p,q}∈N
We will require the term V{p,q} to be a metric. This imposes smoothness while preserving discontinuities, as long as we pick an appropriate robust metric. For example, we can use the robustified L1 distance V (l1 , l2 ) = min(|l1 − l2 |, K) for constant K. The last term will encode the visibility constraint: it will be zero if this constraint is satisfied, and infinity otherwise. We can write this using another set of interactions Ivis which contains pairs of 3D-points violating the visibility constraint: Evisibility (f ) = ∞ p,f (p),q,f (q)∈Ivis
We require the set Ivis to meet following condition: – Only 3D-points at different depths can interact, i.e. if {p1 , l1 , p2 , l2 } ∈ Ivis then l1 = l2 . The visibility constraint says that if a 3D-point p, l is present in a configuration f (i.e. l = f (p)) then it “blocks” views from other cameras: if a ray corresponding to a pixel q goes through (or close to) p, l then its depth is at most l. Again, we need a definition of “closeness”. We will use the set I this part of the construction of Ivis . The set Ivis can then be defined as follows: it will contain all pairs of 3D-points p, l, q, l such that p, l and q, l interact (i.e. they are in I) and l > l. An important issue is the choice of the sets of interactions I. This basically defines a discretization. It can consist, for example, of pairs of 3D-points
Multi-camera Scene Reconstruction via Graph Cuts l=2
l=3
C1
87
q
p
C2 Depth labels
Fig. 1. Example of pixel interactions. There is a photo-consistency constraint between the red round point and the blue round point, both of which are at the same depth (l = 2). The red round point blocks camera C2’s view of the green square point at depth l = 3. Color figures are available in the electronic version of this paper.
{p, l, q, l} such that the distance in 3D-space between these points is less than some threshold. However, with this definition of closeness the number of interactions per pixel may vary greatly from pixel to pixel. Instead, we chose the following criterion: if p is a pixel in the image i, q is a pixel in the image j and i < j then the 3D-points p, l, q, l interact if the closest pixel to the projection of p, l onto the image j is q. Of course, we can include only interactions between certain images, for example, taken by neighboring cameras. Note that it makes sense to include interactions only between different cameras. An example of our problem formulation in action is shown in figure 1. There are two cameras C1 and C2. There are 5 labels shown as black vertical lines. As in our current implementation, labels are distributed by increasing distance from a fixed camera. Two pixels, p from C1 and q from C2 are shown, along with the red round 3D-point q, 2 and the blue round 3D-point p, 2. These points share the same label, and interact (i.e., {p, 2, q, 2} ∈ I). So there is a photoconsistency term between them. The green square point q, 3 is at a different label (greater depth), but is behind the red round point. The pair of 3D-points {p, 2, q, 3} is in Ivis . So if the ray p from camera C1 sees the red round point p, 2, the ray q from C2 cannot see the green square point q, 3. We show in the appendix that minimizing our energy is an NP-hard problem. Our approach, therefore, is to constuct an approximation algorithm based on graph cuts that finds a strong local minimum.
88
4
V. Kolmogorov and R. Zabih
Graph Construction
We now show how to efficiently minimize E among all configurations using graph cuts. The output of our method will be a local minimum in a strong sense. In particular, consider an input configuration f and a disparity α. Another configuration f is defined to be within a single α-expansion of f when for all pixels p ∈ P either f (p) = f (p) or f (p) = α. This notion of an expansion was proposed by [7], and forms the basis for several very effective stereo algorithms [3,7,13,14]. Our algorithm is very straightforward; we simply select (in a fixed order or at random) a disparity α, and we find the unique configuration within a single α-expansion move (our local improvement step). If this decreases the energy, then we go there; if there is no α that decreases the energy, we are done. Except for the problem formulation and the choice of energy function, this algorithm is identical to the methods of [7,14]. One restriction on the algorithm is that the initial configuration must satisfy the visibility constraint. This will guarantee that all subsequent configurations will satisfy this constraint as well, since we minimize the energy, and configurations that do not satisfy the visibility constraint have infinite energy. The critical step in our method is to efficiently compute the α-expansion with the smallest energy. In this section, we show how to use graph cuts to solve this problem. 4.1
Graph Cuts
Let G = V, E be a weighted graph with two distinguished terminal vertices {s, t} called the source and sink. A cut C = V s , V t is a partition of the vertices into two sets such that s ∈ V s and t ∈ V t . (Note that a cut can also be equivalently defined as the set of edges between the two sets.) The cost of the cut, denoted |C|, equals the sum of the weights of the edges between a vertex in V s and a vertex in V t . The minimum cut problem is to find the cut with the smallest cost. This problem can be solved very efficiently by computing the maximum flow between the terminals, according to a theorem due to Ford and Fulkerson [10]. There are a large number of fast algorithms for this problem (see [1], for example). The worst case complexity is low-order polynomial; however, in practice the running time is nearly linear for graphs with many short paths between the source and the sink, such as the one we will construct. We will use a result from [15] which says that for energy functions of binary variables of the form E i,j (xi , xj ) (2) E(x1 , . . . , xn ) = i<j
it is possible to construct a graph for minimizing it if and only if each term E i,j satisfies the following condition: E i,j (0, 0) + E i,j (1, 1) ≤ E i,j (0, 1) + E i,j (1, 0)
(3)
Multi-camera Scene Reconstruction via Graph Cuts
89
If these conditions are satisfied then the graph G is constructed as follows. We add a node vi for each variable xi . For each term E i,j (xi , xj ) we add edges which are given in the appendix. Every cut on such a graph corresponds to some configuration x = (x1 , . . . , xn ), and vice versa: if vi ∈ V s then xi = 0, otherwise xi = 1. Edges on a graph were added in such a way that the cost of any cut is equal to the energy of the corresponding configuration plus a constant. Thus, the minimum cut on G yields the configuration that minimizes the energy. 4.2
α-Expansion
In this section we will show how to convert our energy function into the form of equation 2. Note that it is not necessary to use only terms E i,j for which i < j since we can swap the variables if necessary without affecting condition 3. Although pixels can have multiple labels and in general cannot be represented by binary variables, we can do it for the α-expansion operation. Indeed, any configuration f within a single α-expansion of the initial configuration f can be encoded by a binary vector x = {xp |p ∈ P} where f (p) = f (p) if xp = 0, and f (p) = α if xp = 1. Let us denote a configuration defined by a vector x as f x . Thus, we have the energy of binary variables: ˜ ˜data (x) + E ˜smoothness (x) + E ˜visibility (x) E(x) =E where
˜data (x) = Edata (f x ), E ˜smoothness (x) = Esmoothness (f x ), E ˜visibility (x) = Evisibility (f x ). E
Let’s consider each term separately, and show that each satisfies condition (3). 1. Data term. ˜data (x) = D(p, q) = T (f x (p) = f x (q) = l) · D(p, q) E p,f x (p),q,f x (q)∈I
p,l,q,l∈I
where T (·) is 1 if its argument is true and 0 otherwise. Let’s consider a single term E p,q (xp , xq ) = T (f x (p) = f x (q) = l) · D(p, q). Two cases are possible: 1A. l =α. If at least one of the labels f (p), f (q) is not l then E p,q ≡ 0. Suppose that f (p) = f (q) = l. Then E(xp , xq ) is equal to D(p, q), if xp = xq = 0, and is zero otherwise. We assumed that D(p, q) is non-positive, thus, condition (3) holds: E p,q (0, 0) + E p,q (1, 1) = D(p, q) ≤ 0 = E p,q (0, 1) + E p,q (1, 0) 1B. l = α. If at least one of the labels f (p), f (q) is α then E p,q depends only on at most one variable so condition (3) holds. (Suppose, for example, that f (q) = α, then f x (q) = α for any value of xq and, thus, E p,q (xp , 0) = E p,q (xp , 1) and E p,q (0, 0) + E p,q (1, 1) = E p,q (0, 1) + E p,q (1, 0).)
90
V. Kolmogorov and R. Zabih
Now suppose that both labels f (p), f (q) are not α. Then E(xp , xq ) is equal to D(p, q), if xp = xq = 1, and is zero otherwise. We assumed that D(p, q) is non-positive, thus, condition (3) holds: E p,q (0, 0) + E p,q (1, 1) = D(p, q) ≤ 0 = E p,q (0, 1) + E p,q (1, 0) 2. Smoothness term. ˜smoothness (x) = E
V{p,q} (f x (p), f x (q)).
{p,q}∈N
Let’s consider a single term E p,q (xp , xq ) = V{p,q} (f x (p), f x (q)). We assumed that V{p,q} is a metric; thus, V{p,q} (α, α) = 0 and V{p,q} (f (p), f (q)) ≤ V{p,q} (f (p), α)+ V{p,q} (α, f (q)), or E p,q (1, 1) = 0 and E p,q (0, 0) ≤ E p,q (0, 1) + E p,q (1, 0). Therefore, condition (3) holds. 3. Visibility term. ˜visibility (x) = ∞ E p,f x (p),q,f x (q)∈Ivis
=
T (f x (p) = lp ∧ f x (q) = lq ) · ∞.
p,lp ,q,lq ∈Ivis
Let’s consider a single term E p,q (xp , xq ) = T (f x (p) = lp ∧ f x (q) = lq ) · ∞. E p,q (0, 0) must be zero since it corresponds to the visibility cost of the initial configuration and we assumed that the initial configuration satisfies the visibility constraint. Also E p,q (1, 1) is zero (if xp = xq = 1, then f x (p) = f x (q) = α and, thus, the conditions f x (p) = lp and f x (q) = lq cannot both be true since Ivis includes only pairs of 3D-points at different depths). Therefore, condition (3) holds since E p,q (0, 1) and E p,q (1, 0) are non-negative.
5
Experimental Results
Some details of our implementation are below. The planes for labels were chosen to be orthogonal to the main axis of the middle camera, and to increase with depth. The labels are distributed so that disparities depend on labels approximately linearly. The depth difference between planes was chosen in such a way that a label change by one results in disparity change by at most one both in x and y directions. We selected the labels α in random order, and kept this order for all iterations. We performed three iterations. (The number of iterations until convergence was at most five but the result was practically the same). We started with an initial solution in which all pixels are assigned to the label having the largest depth. For our data term D we made use of the method of Birchfield and Tomasi [4] to handle sampling artifacts, with a slight variation: we compute intensity intervals for each band (R,G,B) using four neighbors, and then take the average
Multi-camera Scene Reconstruction via Graph Cuts
91
data penalty. (We used color images; results for grayscale images are slightly worse). The constant K for the data term was chosen to be 30. We used a simple Potts model for the smoothness term: lq U{p,q} if lp = (4) V{p,q} (lp , lq ) = 0 otherwise The choice of U{p,q} was designed to make it more likely that a pair of adjacent pixels in one image with similar intensities would end up with similar disparities. For pixels {p, q} ∈ N the term U{p,q} was implemented as an empirically selected decreasing function of ∆I(p, q) as follows: (5) U{p,q} = 3λ if ∆I(p, q) < 5 λ otherwise where ∆I(p, q) is the average of values |Intensity(p) − Intensity(q)| for all three bands (R,G,B). Thus, the energy depends only on one parameter λ. For different images we picked λ empirically. We performed experiments on three datasets: the ’head and lamp’ image from Tsukuba University, the flower garden sequence and the Dayton sequence. We used 5 images for the Tsukuba dataset (center, left, right, top, bottom), 8 images for the flower garden sequence and 5 images for the Dayton sequence. For the Tsukuba dataset we performed two experiments: in the first one we used four pairs of interacting cameras (center-left, center-right, center-top, center-bottom), and in the second one we used all 10 possible pairs. For the flower garden and Dayton sequences we used 7 and 4 interacting pairs, respectively (only adjacent cameras interact). The table below show image dimensions and running times obtained on 450MHz UltraSPARC II processor. We used the max flow algorithm of [5], which is specifically designed for the kinds of graphs that arise in vision. Source code for this algorithm is available on the web from http://www.cs.cornell.edu/People/vnk. dataset Tsukuba Tsukuba Flower garden Dayton
number of number of image number of images interactions size labels 5 4 384 x 288 16 5 10 384 x 288 16 8 7 352 x 240 16 5 4 384 x 256 16
running time 369 secs 837 secs 693 secs 702 secs
We have computed the error statistics for the Tsukuba dataset, which are shown in the table below. We determined the percentage of the pixels where the algorithm did not compute the correct disparity (the “Errors” column), or a disparity within ±1 of the correct disparity (“Gross errors”). For comparison, we have included the results from the best known algorithm for stereo reported in [27], which is the method of [7]. The images are shown in 2, along with the areas where we differ from ground truth (black is no difference, gray is a difference of ±1, and white is a larger difference). Inspecting the image shows that we in general achieve greater accuracy at discontinuities; for example, the camera in the background and the lamp
92
V. Kolmogorov and R. Zabih
are more accurate. The major weakness of our output is in the top right corner, which is an area of low texture. The behavior of our method in the presence of low texture needs further investigation. Errors Gross errors 4 interactions 6.13% 2.75% 10 interactions 4.53% 2.30% 3.99% Boykov-Veksler-Zabih [7] 9.76% We have also experimented with the parameter sensitivity of our method for the Tsukuba dataset using 10 interactions. Since there is only one parameter, namely λ in equation 5, it is easy to experimentally determine the algorithm’s sensitivity. The table below shows that our method is relatively insensitive to the exact choice of λ. λ 2 5 10 20 40 Error 4.67% 4.53% 5.28% 5.78% 8.27% Gross errors 2.43% 2.30% 2.30% 2.23% 2.86%
6
Extensions
It would be interesting to extend our approach to more general camera geometries, for example the case when cameras look at the object in the center from all directions. Labels could then correspond to spheres. More precisely, there would be two labels per sphere corresponding to the closest and farthest intersection points. Acknowledgements. This research was supported by NSF grants IIS-9900115 and CCR-0113371, and by a grant from Microsoft Research. We also thank Rick Szeliski for providing us with some of the imagery used in section 5. Appendix: Edges for Terms E i,j In this section we show how to add edges for a term E i,j in the equation 2 assuming that condition (3) holds. This is a special case of a more general construction given in [15]. – if E(1, 0) > E(0, 0) then we add an edge (s, vi ) with the weight E(1, 0) − E(0, 0), otherwise we add an edge (vi , t) with the weight E(0, 0) − E(1, 0); – if E(1, 0) > E(1, 1) then we add an edge (vj , t) with the weight E(1, 0) − E(1, 1), otherwise we add an edge (s, vj ) with the weight E(1, 1) − E(1, 0); – the last edge that we add is (vi , vj ) with the weight E(0, 1) + E(1, 0) − E(0, 0) − E(1, 1). We have omitted indices i, j in E i,j for simplicity of notation. Of course it is not necessary to add edges with zero weights. Also, when several edges are added from one node to another, it is possible to replace them with one edge with the weight equal to the sum of weights of individiual edges.
Multi-camera Scene Reconstruction via Graph Cuts
93
Center image
Ground truth
Our results - 4 interactions
Our results - 10 interactions
Boykov-Veksler-Zabih results [7]
Comparison of our results with ground truth
Fig. 2. Results on Tsukuba dataset.
Appendix: Minimizing Our Energy Function Is NP-Hard It is shown in [7] that the following problem, referred to as Potts energy minimization, is NP-hard. We are given as input a set of pixels S with a neighborhood system N ⊂ S × S, and a set of label values L and a non-negative function C : S × L → + . We seek the labeling f : S → Lthat minimizes EP (f ) = C(p, f (p)) + T (f (p) = f (q)). (6) p∈P
{p,q}∈N
94
V. Kolmogorov and R. Zabih
Middle image
Our results
Fig. 3. Results on the flower garden sequence.
Middle image
Our results
Fig. 4. Results on the Dayton sequence.
We now sketch a proof that an arbitrary instance of the Potts energy minimization problem can be encoded as a problem minimizing the energy E defined in equation 1. This shows that the problem of minimizing E is also NP-hard. We start with a Potts energy minimization problem consisting of S, V, N and C. We will create a new instance of our energy minimization problem as follows. There will be |L| + 1 images. The first one will be the original image S, and for each each label l ∈ L there will be an image Pl . For each pixel p ∈ S there will be a pixel in Pl which we will denote as pl ; thus, |Pl | = |S| for all labels l ∈ L. The set of labels L, the neighborhood system N and the smoothness function V{p,q} will be the same as in the Potts energy minimization problem: V{p,q} (lp , lq ) = T (lp = lq ),
{p, q} ∈ N
The set of interactions I will include all pairs {p, l, pl , l}, where p ∈ S, l ∈ L. The corresponding cost will be D(p, pl ) = C(p, l) − Kp where Kp is a constant which ensures that all costs D(p, pl ) are non-positive (for example, Kp = maxl∈L C(p, l)). There will be no visibility term (i.e. the set of
Multi-camera Scene Reconstruction via Graph Cuts
95
interactions Ivis will be empty). The instance of our minimization problem is now completely specified. Any labeling f : S → Lcan be extended to the labeling f¯ : P → Lif we set f¯(pl ) = l, pl ∈ Pl . It is easy to check that E(f¯) = EP (f ) − K where K = p∈S Kp is a constant. Moreover, there is no labeling f¯ : P → L with the smaller value of the energy E such that f¯ (p) = f¯(p) for all pixels p ∈ S. Thus, the global minimum f¯ of the energy E will yield the global minimum of the Potts energy EP (we need to take the restriction of f¯ on S).
References 1. Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. 2. Stephen Barnard. Stochastic stereo matching over scale. International Journal of Computer Vision, 3(1):17–32, 1989. 3. S. Birchfield and C. Tomasi. Multiway cut for stereo and motion with slanted surfaces. In International Conference on Computer Vision, pages 489–495, 1999. 4. Stan Birchfield and Carlo Tomasi. A pixel dissimilarity measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):401–406, April 1998. 5. Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of mincut/max-flow algorithms for energy minimization in computer vision. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, volume 2134 of LNCS, pages 359–374. Springer-Verlag, September 2001. 6. Yuri Boykov, Olga Veksler, and Ramin Zabih. Markov Random Fields with efficient approximations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 648–655, 1998. 7. Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, November 2001. 8. R. Cipolla and A. Blake. Surface shape from the deformation of apparent contours. International Journal of Computer Vision, 9(2):83–112, November 1992. 9. O.D. Faugeras and R. Keriven. Complete dense stereovision using level set methods. In European Conference on Computer Vision, 1998. 10. L. Ford and D. Fulkerson. Flows in Networks. Princeton University Press, 1962. 11. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. 12. H. Ishikawa and D. Geiger. Occlusions, discontinuities, and epipolar lines in stereo. In European Conference on Computer Vision, pages 232–248, 1998. 13. S.B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition, 2001. Expanded version available as MSR-TR-2001-80.
96
V. Kolmogorov and R. Zabih
14. Vladimir Kolmogorov and Ramin Zabih. Visual correspondence with occlusions using graph cuts. In International Conference on Computer Vision, pages 508–515, 2001. 15. Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via graph cuts? In European Conference on Computer Vision, 2002. Also available as Cornell CS technical report CUCS-TR2001-1857. 16. K.N. Kutulakos and S.M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):197–216, July 2000. 17. A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2):150–162, February 1994. 18. W.N. Martin and J.K. Aggarwal. Volumetric descriptions of objects from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):150– 158, March 1983. 19. Tomaso Poggio, Vincent Torre, and Christof Koch. Computational vision and regularization theory. Nature, 317:314–319, 1985. 20. S. Roy. Stereo without epipolar lines: A maximum flow formulation. International Journal of Computer Vision, 1(2):1–15, 1999. 21. S. Roy and I. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. In International Conference on Computer Vision, 1998. 22. Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. Technical Report 81, Microsoft Research, 2001. To appear in IJCV. An earlier version appears in CVPR 2001 Workshop on Stereo Vision. 23. S.M. Seitz and C.R. Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):1–23, November 1999. 24. Dan Snow, Paul Viola, and Ramin Zabih. Exact voxel occupancy with graph cuts. In IEEE Conference on Computer Vision and Pattern Recognition, pages 345–352, 2000. 25. R. Szeliski. Rapid octree construction from image sequences. Computer Vision, Graphics and Image Processing, 58(1):23–32, July 1993. 26. R. Szeliski and P. Golland. Stereo matching with transparency and matting. In International Conference on Computer Vision, pages 517–523, 1998. 27. Richard Szeliski and Ramin Zabih. An experimental comparison of stereo algorithms. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, number 1883 in LNCS, pages 1–19, Corfu, Greece, September 1999. Springer-Verlag.
A Markov Chain Monte Carlo Approach to Stereovision Julien S´en´egas12 1 2
Centre de G´eostatistique, Ecole des Mines de Paris, 35 rue Saint-Honor´e, 77305 Fontainebleau Cedex, France Istar, 2600 route des Crˆetes, BP282 06905 Sophia-Antipolis Cedex, France
[email protected]
Abstract. We propose Markov chain Monte Carlo sampling methods to address uncertainty estimation in disparity computation. We consider this problem at a postprocessing stage, i.e. once the disparity map has been computed, and suppose that the only information available is the stereoscopic pair. The method, which consists of sampling from the posterior distribution given the stereoscopic pair, allows the prediction of large errors which occur with low probability, and accounts for spatial correlations. The model we use is oriented towards an application to mid-resolution stereo systems, but we give insights on how it can be extended. Moreover, we propose a new sampling algorithm relying on Markov chain theory and the use of importance sampling to speed up the computation. The efficiency of the algorithm is demonstrated, and we illustrate our method with the computation of confidence intervals and probability maps of large errors, which may be applied to optimize a trajectory in a three dimensional environment. Keywords: stereoscopic vision, digital terrain models, disparity, uncertainty estimation, sampling algorithms, Bayesian computation, inverse problems
1
Introduction
In this paper, we address the problem of assessing the uncertainty of depth estimates computed from a stereo system. We exclude the possibility of using any external source of information on the observed scene, i.e. only the radiometric information provided by the stereoscopic pair will be used. The application we have in mind is the certification of digital terrain models for the aircraft industry, but the methods we propose here are quite general and can be applied to any stereo system with some adaptations of the model used. 1.1
Errors in Stereovision
A stereo system consists of two cameras observing the same scene from two different points. With this set-up, a left and right image are obtained. A physical A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 97–111, 2002. c Springer-Verlag Berlin Heidelberg 2002
98
J. S´en´egas
point M projects onto the image plane of each camera. These projections form a pair of homologous points, and their coordinates in each image plane can be expressed as a function of the stereo system parameters (mainly the baseline b and the focal length f ) and the coordinates of M . The principle of depth estimation from stereovision relies on the inversion of this relation. An extensive introduction to three dimensional computer vision and depth from stereo can be found in [10]. Many different techniques are used for depth computation, depending on the type of sensors and the nature of the scene. However, they generally consist of the following steps. First, the stereo system parameters are computed, using calibration methods (see [8] for an example). It is then possible to reproject the images in the epipolar geometry, in which two homologous points have the same vertical coordinate. The difference between their horizontal coordinates is called the disparity. Hence, the rectification step reduces the search of homologous points to a mono-dimensional one (along the rows of the images). The stereo problem itself amounts then to the computation of the disparity map, which is done by matching pixels (or features) in the left image with the corresponding pixels in the right image [9]. Lastly, the computation of the real world coordinates, the triangulation, follows from the disparity map and the geometry of the stereo system. Uncertainty in depth estimates arises from errors or approximations in each step described above, however it is important to distinguish between these errors. Calibration, rectification and triangulation rely on geometric relations verified by the stereo system. Even if these are only approximations of the true geometric constraints, the errors associated with them can be controlled and their range computed. For instance, quantization errors, which come from the knowledge of the image points coordinates at fixed positions only, with a precision equal to half the sampling interval δ [20], depend only on the stereo system parameters. Possibly, the precision can be improved by modifying the stereo system: for example, an increase of the product bf enables to obtain more accurate depth estimates [20]. In opposition, matching errors are strongly dependent on the nature of the radiometric information, and cannot be entirely controlled by the design of the stereo system. The reason for this lies in the implicit matching assumption that pairs of homologous points can be identified. In many situations, violations of this assumption are encountered: the geometric deformations induced by the projections may be strong, the images are corrupted by sensor noise or the radiometric information itself may be ambiguous (existence of repetitive features, lack of texture in the images,...). As a result, matching errors are likely to be dependent on the specific features in the images and to vary strongly over the disparity map. 1.2
A Stochastic Approach
A convenient framework to handle uncertainty in disparity computation is to consider the disparity D and the stereoscopic pair Y = (I1 , I2 ), as random
A Markov Chain Monte Carlo Approach to Stereovision
99
functions. This approach has been successfully applied in a Bayesian context to compute disparity estimates [1]. However, to our knowledge, the problem of uncertainty assessment has been only partly handled, mostly through the computation of standard deviation [23,16]. Standard deviation estimates give only a rough idea of the range of possible errors, and are not appropriate to describe large errors, which appear with very low probability and are spatially correlated. Therefore, specific methods have to be proposed. Typically, if we consider an object in motion in a 3D environment, one must find the path to a given target which minimizes the risk of collisions. Such problems require the computation of the whole conditional probability distribution πD (d|Y = y) of the disparity D given the stereoscopic pair Y . Note that once πD (d|Y = y) is known, we can easily compute the corresponding probability distributions for the real world coordinates, using the geometric transformation of the stereo system. This motivates the choice of the disparity as variable of interest. The main contribution of this paper is to propose a theoretically founded approach to the problem of uncertainty assessment in disparity computation. Especially, we show in Sect. 2 that this problem amounts to sample from the posterior distribution πD and to use Monte Carlo integrations. We also explicit the model πD used for computation, but point out that this one is very dependent on the application. We give insight on how this model can be adapted to specific applications. On the opposite, the sampling algorithms we propose in Sect. 3 are quite general and can be used for different forms of probability models πD . The only restriction is a gaussian assumption for the prior model. Lastly, in Sect. 4, we illustrate our methods with an application to the study of a stereoscopic pair of SPOT images. In the sequel, we will assume that an estimate dˆ of the disparity has been already computed, for example with a correlation based algorithm. This means that the algorithms we propose in the sequel are not used for disparity computation, but only for uncertainty assessment. One reason is that Monte Carlo algorithms are computationally cumbersome. Another reason is that we are interested in the uncertainty which is intrinsic to the radiometric information. Indeed, we consider the uncertainty estimation mostly at a postprocessing stage, and would like to propose a method that is widely independent of the disparity computation itself.
2
Stochastic Framework: Stereovision as an Inverse Problem
Let Y = (I1 , I2 ) be a stereoscopic pair in the epipolar geometry. We consider a physical point M , and its projections m1 with coordinates (u1 , v1 ) in I1 and m2 , with coordinates (u2 , v2 ) in I2 . The epipolar constraint imposes v1 = v2 and we can define the disparity in the reference system of I1 as the difference: d(u1 , v1 ) = u2 − u1
(1)
100
J. S´en´egas
Therefore, a pair of homologous points is fully defined by the coordinates (u, v) of m1 in I1 and the disparity d(u, v). Note that a symmetric definition of disparity in the cyclopean image is possible, see [1]. Although we proceed with definition 1, the approach we propose is readily applicable to this alternative definition. This purely geometric presentation however corresponds to the direct problem, where the coordinates of the homologous points are known. In practice, these are to be found, and generally a loss function Cu,v (d), which expresses the similarity of the images in the neighborhood of I1 (u, v) and I2 (u + d, v), is used [9]. The estimate dˆ of the “true” disparity is then taken to be the value which maximizes Cu,v (d), possibly under some constraints, i.e.: ˆ v) = arg max Cu,v (d) d(u, d
(2)
Therefore, when computing the disparity, and whatever method is used, one must move from a purely geometric definition 1 to an optimization criterion 2. In other words, disparity computation in stereovision is an inverse problem, and therefore, the result is only an estimate of the underlying disparity. This estimate is sensible to the amount of noise in the stereoscopic pair, but, even if perfect signal is assumed, uncertainty remains, because of the ambiguity of the information, due to occlusions, repetitive features, lack of texture in the images,... The Bayesian framework [3] is well adapted to inverse problems. Bayesian inference has been widely used in image analysis [2], and efficient algorithms have been proposed to solve the restoration problem, as in [13]. More recently, this formalism has been applied to the stereo-reconstitution problem [23,1]. Let πD (d|Y = y) be the posterior (or conditional) probability of the disparity D given the stereoscopic pair Y = y. Using Bayes relation, this one can be expressed by means of the conditional distribution of Y given D = d, πY (y|D = d), and the marginal distributions πD (d) and πY (y). Since here Y is known and remains equal to y, πY (y) is a constant, and we can introduce the likelihood L(y|D = d) = πY π(y|D=d) , which can be computed up to a multiplicative conY (y) stant, and for clarity, we denote g(d) = πD (d) the prior distribution. With these notations, the posterior distribution writes: πD (d|Y = y) = L(y|D = d)g(d)
(3)
In stereovision, this formalism has been mostly applied to compute an estimate of the disparity, through maximum a posteriori (MAP) estimation [1], where the maximization of πD (d|Y = y) is used as criterion. However, here, we are interested in the computation of the whole posterior distribution πD (d|Y = y). Especially, for any πD −measurable function h, we would like to compute the integral: EπD (h) =
h(z)πD (z|Y = y)dz
(4)
With an appropriate choice of h, every probability concerning D can be computed. For example, with h(d) = 1|d−d|≥s where dˆ is an estimate of the disparity, ˆ
A Markov Chain Monte Carlo Approach to Stereovision
101
we obtain the probability that the estimation error D− dˆ is greater than a threshold s: ˆ ≥ s) = πD (z|Y = y)dz (5) P (|D − d| ˆ |z−d|≥s
In the following, we explicit the computation of 4 and derive an expression for πD . 2.1
MCMC Methods
In practice, the numerical computation of 4 is not possible since πD is a function of n variables, where n is the size of the image. Another reason is that πD is known only up to a multiplicative factor. A solution is provided by Monte Carlo integration which evaluates the integral in 4 by drawing samples {zi , i = 1, ..., m} from πD and then approximating m
EπD (h) ≈
1 h(zi ) m i=1
(6)
The laws of large numbers ensure that the approximation can be made as accurate as desired by increasing the sample size m. Monte Carlo methods were used in similar problems [11,7], but mostly to compute map estimates. We show in this paper that more information can be extracted from the posterior distribution. A method to draw samples from πD consists of building a Markov chain having πD as its stationary distribution. In practice, the problem amounts to determine a transition probability kernel which ensures the convergence to πD . This is the principle of the well known Metropolis-Hastings algorithm [15] or the popular Gibbs sampler [13]. For a detailed introduction to Markov chain theory, the interested reader is referred to [18] and [24]. Practical applications of Markov chain Monte Carlo can be found in [14]. As a conclusion, assessment of disparity uncertainty and disparity estimation are two very different problems, and the former requires the development of specific methods. Uncertainty can be assessed under a Bayesian framework by drawing samples from the posterior distribution of the disparity given the stereoscopic pair. Two points remain to be solved: the specification of the model, which is treated in Sect. 2.2, and the choice of a sampling algorithm, which is the subject of Sect. 3. 2.2
Model Derivation
We need to specify the likelihood term L(y|D = d), which describes the relation between the images of the pair at fixed disparity, and the prior model g, which reflects the spatial structure of the disparity. Let us point out that the choice of the probability model πD is specific to the application, i.e. to the type of images recorded and the nature of the scene observed. In our case, we use
102
J. S´en´egas
stereoscopic pairs of SPOT images for the computation of mid-resolution digital terrain models (with a ground resolution approximately equal to 10 m). We first derive an expression for the image model (likelihood) and then for the disparity model (prior). A common modeling for the likelihood [1] consists of assuming that the recorded intensity I for a specific feature is the same for both images I1 and I2 and sensor noise is additive. Therefore, for fixed disparity d, the relation between I1 and I2 writes: I1 (u, v) − I2 (u + d(u, v), v) = η(u, v)
(7)
where η(u, v) can be interpreted as the residual between the intensities after a correction on the coordinates with the disparity. The choice of the image model amounts hence to the modeling of η, and especially its spatial correlation. If we assume the residual η to be uncorrelated gaussian noise with mean µ and standard deviation κ, as in [4,6,1], we end up with the following expression of the likelihood: (ηij − µ)2 L(y|D = d) ∝ exp − (8) 2κ2 i,j The assumption of uncorrelated noise η may be questioned. Especially, one may argue that if the deformations induced by the stereo system are relatively important, the residual η will show strong patterns. Therefore, this point is very dependent on the setting of the stereo system and the nature of the scene. For satellite images at mid-resolution, these deformations may be neglected in a first approach, and the assumption of uncorrelated gaussian noise is reasonable. However, this is definitely a challenging aspect of the problem. Note that, as we have an estimate dˆ of the disparity, we can compute from I1 ˆ v), v). Hence, and I2 an estimate of the residual ηˆ(u, v) = I1 (u, v) − I2 (u + d(u, the model parameters µ and κ can be estimated by maximum likelihood. For the disparity model, we adopt a gaussian random function framework, which is relatively common [23,4,1]. Under the gaussian assumption, the prior distribution is entirely specified by the mean m and the spatial covariance C. ˆ for example under These parameters can be estimated from the disparity map d, an assumption of spatial stationarity [5]. Finally, the prior distribution g takes the following form: 1 −1 g(d) ∝ exp − (d − m) C (d − m) (9) 2 Here again, let us point out that the choice of the prior model is dependent on the application. Especially, the behavior of the covariance at the origin will reflect the smoothness of the disparity: for example, a model with a linear behavior (such as Brownian motion [1]) is liable to reproduce discontinuities. Our experience is that covariances which are differentiable at the origin (such as the
A Markov Chain Monte Carlo Approach to Stereovision
103
cubic covariance) are well adapted in the case of mid-resolution stereo systems. However, for high-resolution applications, such as urban or indoor scenes, other models may be chosen. For example, in [1], Belhumeur proposes to model directly the discontinuities at the edges of objects through Poisson processes. More generally, one may argue that for such applications, stochastic geometric models, which directly take into account the form of objects, may be better adapted. Such models were applied to building detection [12] and road extraction [22]. Let us point out that, although we made the gaussian assumption for the prior distribution g, the posterior distribution is not necessarily gaussian. The nature of the posterior distribution depends largely on the likelihood term L, which is not a quadratic form of d. For example, the presence of repetitive features in the stereoscopic pair would result in a multimodal posterior distribution.
3
Sampling Algorithm
In this section we expose the details of the sampling algorithm. We recall that the probability distribution to be sampled from is of the form: π(z|Y = y) = L(y|Z = z)g(z)
(10)
where L(y|Z = z) is the likelihood function, which may be known up to multiplicative factor, and g(z) is the prior distribution. 3.1
Markov Chain Sampling Algorithms
As stated previously, iterative Markov chain algorithms have to be used. In our case, the difficulty lies in the form of the likelihood term L(Y |D = d), which cannot be approximated efficiently and is expected to vary strongly with the coordinates (u, v). The Gibbs sampler, which in combination with Markov random fields models is very popular in image analysis, relies on the use of conditional densities of the form π(zS |z−S ), where S is a subspace of the grid to be simulated and −S its complementary part [3]. Unfortunately, because of the expression of the distribution πD , it is not possible to sample directly from the conditional distributions. Other alternatives are Metropolis-Hastings algorithms [14] and variants such as discretized Langevin diffusions [19]. The principle of the Metropolis-Hastings algorithm is to generate a transition kernel q(z, from the current state z to the state z from a proposal transition
z ), )q(z ,z) and to accept this transition with probability a(z, z ) = min 1, π(z π(z)q(z,z ) . The efficiency of this algorithm relies merely on the proposal transition kernel q, but when the target distribution π cannot be easily approximated, optimal choices of q are not possible, and very often an ad hoc transition kernel is used, like in the independence sampler or the random walk [14].
104
3.2
J. S´en´egas
A New Sampling Algorithm
These difficulties have led us to propose a new sampling algorithm which makes use of the target distribution π to generate the transitions. Let us assume that the prior distribution g(z) denotes now a gaussian distribution. The algorithm is based on the following remark. Let W1 and W2 be two independent gaussian processes with zero mean and covariance C (which we denote W1 , W2 ∼ g). Then, for any θ ∈ [0, 2π[, the linear combination W (θ) = W1 cos θ + W2 sin θ is also a gaussian process with zero mean and covariance C. We propose to use this relation, together with an appropriate choice of θ, to build the transitions of a Markov chain Z = (Zk , k ≥ 0). The algorithm is the following: 1. sample w ∼ g, 2. compute the proposal transitions: 2π 2π + w sin i , zi = zk cos i m m
∀i = 0, m − 1
(11)
and the corresponding transition probabilities: L(y|Z = zi ) a(z, zi ) = m−1 , j=0 L(y|Z = zj )
∀i = 0, m − 1
(12)
3. choose zi ∼ a(z, zi ) and set zk+1 = zi . We prove in [21] that the Markov chain built according to this algorithm has π as stationary distribution. A comparison study of this algorithm with the independence sampler and the Langevin diffusion can be found in [21]. The interesting feature of this algorithm is that at each iteration, a whole path of possible transitions is proposed, in opposition to a single one in the Metropolis-Hastings algorithm, and this is done at a low computational cost since it only requires the computation of linear combinations of the form 11. Therefore, transitions are likely to occur more often, which improves mixing within the chain and speeds up the convergence. An illustration of this feature is displayed on Fig. 1. 3.3
Size Reduction and Importance Sampling
For large grids, the computational burden of Markov chain algorithms can become very cumbersome. Gaussian samples can be generated very quickly (with algorithms such as those relying on the Fast Fourier Transform [5]), but the transition rate of the Markov chain tends to decrease drastically with the grid size, so that the sampling algorithm must be run longer to achieve convergence. To cope with this difficulty, we propose to split the sampling in two consecutive phases. Consider a subgrid S of the total sampling grid G and its complementary −S in
1.0
A Markov Chain Monte Carlo Approach to Stereovision
105
0.05 0.06
0.5
0.07
0.12
0.11 0.1 0.09 0.08
y 0.0
0.08 0.07 0.06
-0.5
0.05 0.04 0.03
-1.0
0.02 0.01
-1.0
-0.5
0.0 x
0.5
1.0
Fig. 1. Path of possible transitions generated from the current point and an independent simulation (black dots). The target distribution, displayed in the background with level lines, is a bigaussian distribution. The circles along the path, which correspond each to a different value of θ, are proportional to the transition probability a(z, zi ).
G. zS denotes the restriction of the vector z to S. The posterior distribution π can be decomposed as follows: π(z|Y = y) =
L(y|Z = z) g(z−S |ZS = zS )L(y|ZS = zS )g(zS ) L(y|ZS = zS )
(13)
The term L(y|ZS = zS )g(zS ) is the posterior distribution π(zS |Y = y) on S, whereas g(z−S |ZS = zS ) is the conditional distribution of the gaussian vector z−S given zS (this is the conditional distribution which is used in the Gibbs sampler). Sampling from this conditional distribution can be done directly and very efficiently using an algorithm such as conditioning by kriging [5]. The first L(y|Z=z) term w = L(y|Z is a correction factor which shall be used to weight the S =zS ) samples in the Monte Carlo integration. We end up with the following algorithm: 1. run the Markov chain algorithm on the subgrid S, i 2. for each generated sample zSi , generate z−S using the conditional gaussian i ) the sample on distribution g(z−S |ZS = zS ), and construct zi = (zSi , z−S the total grid G, L(y|Z=zi ) 3. compute the weight wi = L(y|Z i , S =zS ) 4. compute the integral 1 Eπ (h) ≈ m
i=1
m
wi
wi h(zi ) .
(14)
i=1
The last step is an application of importance sampling theory [14]. One may run the risk that all weights wi but a few become equal to 0, and only a few samples take the total weight. This could happen if the distribution used for sampling π (z|Y = y) = π(zS |Y = y)g(z−S |ZS = zS ) and the target distribution
106
J. S´en´egas
π(z|Y = y) are very different. In our case, this should be avoided due to the strong spatial correlations exhibited by the disparity D. The idea is that if we know the disparity on a subgrid S, then we are able to compute accurate estimates of D on the total grid G, and correct fluctuations can be reproduced by sampling from the conditional distribution g(z−S |ZS = zS ). An overview of the sampling algorithm is shown on Fig. 2. Markov chain
Importance
Monte Carlo
sampling
sampling
integration
Left
Probability
Image
Σ
map
Stereoscopic pair
Right Image
Fig. 2. Overview of the sampling algorithm. First, the Markov chain sampling algorithm is run on a sub-grid, conditional to the stereoscopic pair. Then, for each generated sample, the whole grid is generated. Lastly, the Monte Carlo integration is performed.
4
Experimental Results
We have applied the model of Sect. 2.2 and the sampling algorithms introduced in Sect. 3 to the study of a stereoscopic pair of rectified SPOT images of the Marseille area, of size 512 × 512 pixels (Fig. 3). A disparity map (Fig. 3) has been computed using a correlation based algorithm. 2000 samples have been generated. Using the importance sampling algorithm, the size of the sampling grid has been reduced to 64 × 64, yielding thus a size reduction factor equal to 64. To illustrate the efficiency of the sampling algorithms, we run a Metropolis-Hastings algorithm (denoted as MH) on the total grid, then the sampling algorithm of Sect. 3 (denoted TCP for transitions along continuous paths), first on the total grid, and then on the reduced grid with the importance sampling method (denoted further TCPIS). We compared the computation time required to generate 100 samples with at least 10 state transitions between two samples, assuming that these conditions ensure similar convergence properties. Computations were performed on a SUN Ultra 60 computer with a 400 Mhz ULTRA SPARC II processor. The results are shown on Tab. 1. The TCP algorithm allows to reduce the computation time by a factor 10 in comparison to the independence Metropolis-Hastings sampler. The size reduction coupled with
A Markov Chain Monte Carlo Approach to Stereovision
(a) One image of the stereoscopic pair
107
(b) Disparity
Fig. 3. One rectified image of the stereoscopic pair (Marseille area, France) and the corresponding disparity (right) Table 1. Comparison of computation time for MH, TCP and TCPIS. MH TCP TCPIS CPU time (sec.) 642571 55454 1326
the importance sampling algorithm (TCPIS) allows to reduce again the computation time, by a factor 40. The overall computation time is still relatively large (a few hours for our case study), but the TCPIS algorithm allows a tremendous gain in comparison with crude sampling algorithms. The application we aim at is the detection of possible large errors; that is, given an error threshold s, which are the pixels for which disparity errors larger than s are likely to occur? This is typically the type of questions one may face in applications in the aircraft industry. This problem can be formulated as the computation of the probability P (D− dˆ ≥ s), and, for a given risk α, the location ˆ v) ≥ s) ≤ α, which correspond to of the pixels (u, v) for which P (D(u, v) − d(u, the safe area. Figure 4 displays the probability maps obtained for s = 2 and s = 3. The probability of errors varies strongly with the location (u, v), that is why usual methods estimating error ranges over a large domain are not appropriate for such applications. It is then possible to threshold these maps to a given risk value, say 0.1%, to identify the safe areas. In real applications however, it is not realistic to consider that such information is required for every pixel. For example, if one thinks of an object in motion over a domain, then one must guarantee that no collision occurs along
108
J. S´en´egas
Fig. 4. Probability map of positive errors larger than 2 (left) and 3 (right) pixels. One scale unit on the vertical axis equals 10%.
its trajectory. Hence, we need to consider a whole domain, and compute the probability that no error greater than a given threshold s occurs in this domain, with a probability 1 − α. Since errors are spatially correlated, this is not equal to the product of the corresponding probabilities over the domain. We point out that this type of computation definitely requires the knowledge of the entire multivariate distribution πD and not only the marginal distribution of a given pixel. For this purpose, we divided the disparity map in 64 sub-domains of equal surface, and computed the probability that errors larger than 3 (respectively 4) pixels occur anywhere in the sub-domain (Fig. 5). Thus, as previously, it is possible to determine the safe areas as the sub-domains for which this probability is lower than α = 0.1%. The results obtained are rather pessimistic: the 3 pixel threshold map has only 2 safe domains, this contrasts with the situation we would have expected from the right map displayed on Fig. 4. This surprising result emphasizes the importance of spatial correlations and justifies the use of Monte Carlo methods, although they are computationally demanding. We consider now the opposite situation, which is often encountered in practice: a risk value α is given, and one needs to compute a confidence interval [Zinf , Zsup ] for the variable Z such that P (Z ∈ [Zinf , Zsup ]) ≥ 1 − α. An example of the use of confidence intervals in a different context of computer vision can be found in [17]. To derive a confidence interval, we propose to make use of the min and max of the disparity samples. Indeed, for n + 1 independent identically distributed random variables Zi , we have the following result: P (Zn+1 ∈ [min Zi , max Zi ]) = i≤n
i≤n
n−1 n+1
(15)
A Markov Chain Monte Carlo Approach to Stereovision
109
Fig. 5. Probability map of positive errors larger than 3 (left) and 4 (right) pixels in a subdomain. One scale unit on the vertical axis equals 10%.
Therefore, for any risk value α, we can choose the smallest integer n such that n−1 n+1 ≥ 1 − α and compute for each pixel the min and the max of the n samples. The confidence interval has then the form of two embedding surfaces. In order to validate our results, we have used a reference disparity map obtained from high resolution images, which has been degraded to the SPOT resolution. Using this reference map, it is possible to compute the percentage of points lying in the confidence interval computed previously and to compare this value to the theoretical value given by 15. The risk value was set equal to 0.1%, and therefore we used n = 2000 samples to construct the confidence interval. From the reference map, we found that 0.11% of the total points fell outside the confidence interval, which is in very good agreement with the theoretical value 0.1%. Moreover, the mean size over the grid of the computed disparity interval is 4.4 pixels: although the risk value is very small, the precision given by the confidence interval is reasonable. Therefore, the method proposed in this paper allows to compute accurate statistics.
5
Conclusion
In this paper, we have considered the problem of assessing the uncertainty of a disparity map with only the use of the stereoscopic pair as information. Since the usual standard deviation approach does not permit to handle the case of large errors which occur with very low probability and are spatially correlated, specific methods have to be proposed. A solution consists of computing the posterior distribution of the disparity given the stereoscopic pair. Hence, uncertainty assessment amounts to sample from this posterior distribution and to compute the desired probabilities using a Monte Carlo approach. From a
110
J. S´en´egas
practical point of view, a model for the posterior distribution must be chosen, and an efficient sampling algorithm must be used to keep the computation time reasonable. This method is very general and can be applied to different types of stereovision systems, but the choice of the stochastic model is specific to the application. The model explicited in this paper considers an application to the computation of mid-resolution digital terrain models, for which the disparity map can be assumed to vary relatively smoothly and the gaussian assumption is reasonable. For more complex applications, models which take explicitly into account the geometry of the scene may be more appropriate. An important problem is the choice of the sampling algorithm. The algorithm we propose considers the case of gaussian priors, which is quite general. It is based on the construction of a continuous path of possible transitions, and compared to standard sampling algorithms such as the independence sampler proves very competitive. Moreover, we have shown that the computation can be tremendously speeded up by first running the Markov chain on a subgrid and then generating conditionally the samples on the whole grid. This Markov chain Monte Carlo approach has been applied to the study of a stereoscopic pair of SPOT images. The results show that the disparity statistics are highly non-stationary. This is due to the nature of the radiometric information itself. These statistics can be used to determine the safe areas, for which large errors occur with probability inferior to a given risk value, or to compute precise confidence intervals. The range of applications is however brighter: for example, they may be used to optimize a functional over the domain under study. We think especially of the motion of an object in a three-dimensional environment, for which the path to a given target has to be optimized and collisions avoided.
References [1] P.N. Belhumeur. A Bayesian approach to binocular stereopsis. International Journal of Computer Vision, 19(3):237–262, 1996. [2] J. Besag. On the statistical analysis of dirty pictures (with discussion). J. Roy. Statist. Soc. Ser. B, 48:259–302, 1986. [3] J. Besag, P. Green, D. Higdon, and K. Mengersen. Bayesian computation and stochastic systems. Statistical Science, 10(1):3–66, 1995. [4] C. Chang and S. Chatterjee. Multiresolution stereo - A Bayesian approach. IEEE International Conference on Pattern Recognition, 1:908–912, 1990. [5] J.P. Chil`es and P. Delfiner. Geostatistics: Modeling Spatial Uncertainty. Wiley and Sons, 1999. [6] I.J. Cox. A maximum likelihood n-camera stereo algorithm. International Conference on Computer Vision and Pattern Recognition, pages 733–739, 1994. [7] F. Dellaert, S. Seitz, S. Thrun, and C. Thorpe. Feature correspondance: A Markov chain Monte Carlo approach. In Advances in Neural Information Processing Systems 13, pages 852–858, 2000.
A Markov Chain Monte Carlo Approach to Stereovision
111
[8] F. Devernay and O. Faugeras. Automatic calibration and removal of distorsions from scenes of structured environments. In Leonid I. Rudin and Simon K. Bramble, editors, Investigate and Trial Image Processing, Proc. SPIE, volume 2567. SPIE, San Diego, CA, 1995. [9] U.R. Dhond and J.K. Aggarwal. Structure from stereo - A review. IEEE Transactions on systems, man and cybernetics, 19(6):1489–1510, 1989. [10] O. Faugeras. Three-Dimensional Computer Vision. Massachussets Institute of Technology, 1993. [11] D. Forsyth, S. Ioffe, and J. Haddon. Bayesian structure from motion. In International Conference on Computer Vision, volume 1, pages 660–665, 1999. [12] L. Garcin, X. Descombes, H. Le Men, and J. Zerubia. Building detection by markov object processes. In Proceedings of ICIP’01, Thessalonik, Greece, 2001. [13] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. [14] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall, 1996. [15] W.K. Hastings. Monte carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970. [16] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), september 1994. [17] R. Mandelbaum, G. Kamberova, and M. Mintz. Stereo depth estimation: a confidence interval approach. In International Conference on Computer Vision, pages 503–509, 1998. [18] S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability. SpringerVerlag London, 1993. [19] G.O. Roberts and J.S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. J. Roy. Statist. Soc. Ser. B, 60:255–268, 1998. [20] J.J. Rodriguez and J.K. Aggarwal. Stochastic analysis of stereo quantization error. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5):467–470, 1990. [21] J. S´en´egas, M. Schmitt, and P. Nonin. Conditional simulations applied to uncertainty assessment in DTMs. In G. Foody and P. Atkinson, editors, Uncertainty in Remote Sensing and GIS. Wiley and Sons, 2002. to appear. [22] R. Stoica, X. Descombes, and J. Zerubia. Road extraction in remote sensed images using a stochastic geometry framework. In Proceedings of MaxEnt’00, Gif Sur Yvette, July 8-13, 2000. [23] R. Szeliski. Bayesian Modeling of Uncertainty in Low-level Vision. Kluwer Academic Publishers, 1989. [24] R.L. Tweedie. Markov chains: Structure and applications. In B. N. Shanbhag and C.R. Rao, editors, Handbook of Statistics 19, pages 817–851. Elsevier Amsterdam, 1998.
A Probabilistic Theory of Occupancy and Emptiness Rahul Bhotika1 , David J. Fleet2 , and Kiriakos N. Kutulakos3 1
Computer Science Department, University of Rochester Rochester, NY 14627, USA
[email protected] 2 Palo Alto Research Center, Palo Alto CA 94304, USA
[email protected] 3 Computer Science Department, University of Toronto Toronto, ON M5S3G4, Canada
[email protected]
Abstract. This paper studies the inference of 3D shape from a set of n noisy photos. We derive a probabilistic framework to specify what one can infer about 3D shape for arbitrarily-shaped, Lambertian scenes and arbitrary viewpoint configurations. Based on formal definitions of visibility, occupancy, emptiness, and photo-consistency, the theoretical development yields a formulation of the Photo Hull Distribution, the tightest probabilistic bound on the scene’s true shape that can be inferred from the photos. We show how to (1) express this distribution in terms of image measurements, (2) represent it compactly by assigning an occupancy probability to each point in space, and (3) design a stochastic reconstruction algorithm that draws fair samples (i.e., 3D photo hulls) from it. We also present experimental results for complex 3D scenes.
1
Introduction
Many applications, from robotics [1] to computer graphics [2] and virtual reality, rely on probabilistic representations of occupancy to express uncertainty about the physical layout of the 3D world. While occupancy has an intuitive and well-defined physical interpretation for every 3D point (i.e., a point is occupied when the scene’s volume contains it), little is known about what probabilistic information regarding occupancy is theoretically computable from a set of n noisy images. In particular, what information about the occupancy and emptiness of space is intrinsic to a set of n stochasticallygenerated photos taken at arbitrary viewpoints? Importantly, how can we answer this question for general scenes whose shape is arbitrary and unknown? This paper outlines a general probabilistic theory of occupancy and emptiness for Lambertian scenes that is intended to handle arbitrary 3D shapes, arbitrary viewpoint configurations and noisy images. Based on formal definitions of several key concepts, namely, visibility, occupancy, emptiness and photo-consistency, our theoretical development yields a formulation of the Photo Hull Distribution. The Photo Hull Distribution is the tightest probabilistic bound on the scene’s true shape that can be inferred from the n input photos. Importantly, we show that the theory leads directly to a stochastic algorithm that (1) draws fair samples from the Photo Hull Distribution and (2) converges A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 112–130, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Probabilistic Theory of Occupancy and Emptiness L
R
L
R
1 3
2 4
7
1 6
2
3
5
4
(a)
113
(b)
Fig. 1. (a) Noiseless images of four Lambertian points with identical albedos afford seven consistent scene interpretations—{2,3},{1,4} {1,2,3}, {1,2,4}, {1,3,4},{2,3,4}, and {1,2,3,4}. Maximizing image consistency in the presence of even small amounts of noise will cause one of these interpretations to be arbitrarily preferred. The resulting shape (e.g., {2, 3}), however, cannot conclusively rule out the possibility that other points may also be occupied (e.g., {1}). (b) The probabilistic event “5 is visible from L” implies that points 6 and 7 are in empty space, thereby inducing a cycle of probabilistic dependencies (arrows) that make the events “4 is visible from L” and “5 is visible from L” dependent on each other. Such cycles characterize the visibility and occupancy dependencies between every pair of points in IR3 within the cameras’ view.
to an optimal conservative estimate of the occupancy probability, i.e., to the lowest upper bound on that probability that is theoretically computable from the photos. Experimental results illustrating the applicability of our framework are given for real scenes. Our work is motivated by four objectives that we believe are key to designing a general and useful probabilistic theory of occupancy and emptiness: – Understanding stereo ambiguities: Since stereo reconstruction can be ambiguous even in the noiseless case, a clear understanding of the mathematical relationship between shape ambiguity and shape uncertainty, and their effect on occupancy computation, becomes crucial in the design of stable algorithms (Figure 1a). – Probabilistic treatment of visibility: Unconstrained scenes induce arbitrarily complex visibility relationships between cameras and surface points, and arbitrarily long probabilistic dependency cycles between any two 3D points (Figure 1b). One must provide a physically-valid and self-consistent probabilistic interpretation of occupancy (which refers to individual points) and visibility (which depends on a scene’s global shape). – Algorithm-independent analysis of occupancy: Without a clear understanding of what is theoretically computable from a set of input images, it is difficult to assess the performance of individual algorithms and the effect of heuristics or approximations they employ. – Handling sensor and model errors: Since computational models are inherently limited in their ability to describe the physical world and the true image formation process, it is important to account for such errors. A key aspect of our analysis is the explicit distinction we make between the notions of ambiguity and uncertainty. Intuitively, we interpret shape ambiguity as the multiplicity of solutions to the n-view reconstruction problem given noiseless images. Shape
114
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
uncertainty, on the other hand, exists in the presence of noise and modeling errors and is modeled as a probability distribution that assigns a probability to every 3D shape. While ambiguities have long been recognized as an important problem in stereo (e.g., [3]), the question of how these ambiguities affect the estimation of occupancy and emptiness is not well understood. Existing stereo algorithms make one or more assumptions that allow one to estimate a unique shape. Most methods compute the shape that maximizes some form of image consistency likelihood (i.e., that a reconstructed shape is consistent with the image measurements), assuming constraints or prior beliefs about shape and image disparities. Examples include the use of SSD-based likelihoods [4], window-based correlation [5], ordering constraints on correspondence [6,7], shape priors [8], as well as MRF [9,10] and parametric models [11,12]. But these approaches do not estimate the probability that a region is empty or occupied. This is because image consistency is not equivalent to occupancy, nor is it the opposite of emptiness (Figure 1a). In general, ascertaining the emptiness of a region requires that one consider the likelihood that no consistent scene interpretation intersects that region, a problem for which no formal analysis currently exists. Even though recent formulations of the n-view stereo problem have demonstrated an ability to resolve global visibility relationships among views [13,14,15,16,17,18,19], most of these ignore the existence of noise and the resulting model uncertainty. Two notable exceptions are the Roxel Algorithm [20] and the Probabilistic Space Carving Algorithm [21], which attempt to capture shape uncertainty by assigning a “soft” occupancy value between zero and one to every voxel in space. Working under the assumption that an accurate treatment of long-range probabilistic dependencies is intractable, these approaches rely on heuristic approximations to assess a voxel’s visibility. For instance, the Roxel Algorithm assesses a voxel’s visibility by assuming that voxels in front of it are “semi-transparent,” an assumption that is not consistent with image formation models for opaque scenes. Analogously, the Probabilistic Space Carving Algorithm assumes that visibilities of distant voxels are independent from each other and that visibility information can be propagated locally from voxel to voxel without regard to any specific global 3D shape. Neither algorithm takes into account the global nature of visibility, leading to occupancy calculations that do not afford a probabilistic interpretation in a strict mathematical sense. Similar problems would arise in any method that relies on a local model for propagating probabilities since long-distance dependency cycles, like those depicted in Figure 1b, must be taken into account. To our knowledge, no attempts have been made to define the notions of “occupancy probability” and “emptiness probability” in a mathematically rigorous manner when noisy images are given as input. Instead, existing approaches define their goal procedurally, by giving an algorithm that assigns soft occupancies to 3D points. The question of what probabilistic information is theoretically computable from a set of n noisy images, without regard to specific algorithms or implementations, is therefore still open. In practice, this makes it difficult to characterize the behavior of individual algorithms, to define precise notions of optimality and to compare results from different algorithms. In this paper, we argue that a probabilistic treatment of occupancy and emptiness that accounts for stereo ambiguities, shape uncertainty and all long-range probabilistic dependencies caused by visibility can be formulated. Moreover, it leads to a fairly
A Probabilistic Theory of Occupancy and Emptiness
115
straightforward reconstruction algorithm. At the heart of our approach lies the realization that we can overcome technical difficulties via a probabilistic modeling of shape space, i.e., the space of all 3D reconstructions. Using the photo-consistency theory in [19] as a starting point, we show that shape-space modeling allows one to derive a complete theory of occupancy and emptiness. This theory arises from a small set of “axioms” (a generative image formation model and a radiance independence assumption) and does not rely on heuristic assumptions about scene shape or the input photos.
2
Shape-Space Modeling: Photo Hulls, Occupancy, and Emptiness
In the formulation of shape ambiguity and uncertainty that follows, we represent shapes using a discrete volumetric approximation and we further define a bounding volume V ⊆ 3 that is assumed to contain the scene. The shape space of V is defined simply as its powerset, 2V . With these definitions, we use the photo-consistency theory of Kutulakos and Seitz [19] to model shape ambiguities and to introduce the concept of the Photo Hull Distribution to model shape uncertainty. 2.1 The Photo Hull Photo-consistency is based on the idea that a valid solution to the reconstruction problem should reproduce the scene’s input photographs exactly when no image noise is present. When one or more such photographs are available, each taken from a known viewpoint, they define an equivalence class of solutions called photo-consistent shapes. For Lambertian shapes, a voxel v ∈ S is photo-consistent if (1) v is not visible to any camera, or (2) we can assign a radiance to v that is equal to the color at v’s projection in all photos where v is visible. Accordingly, a shape S ∈ 2V is photo-consistent if all its voxels are photo-consistent. The equivalence class of photo-consistent shapes captures all the ambiguities present in a set of n noiseless images and is closed under set union (Figure 2a): Theorem 1 (Photo Hull Theorem [19]) For every volume V that contains the true scene and
every noiseless image set I, there exists a unique shape called the Photo Hull, H(V, I), that is the union of all shapes in 2V that are photo-consistent with I. This shape is a superset of the true scene and is itself photo-consistent.
The photo hull is the tightest possible bound on the true scene that can be recovered from n photographs in the absence of a priori scene information. The photo hull is especially important in our analysis for two reasons. First, it is algorithm-independent and is applicable to scenes of arbitrary shape. Second, it suggests that even though photoconsistency is an ambiguous property (i.e., many distinct 3D shapes can possess it), emptiness of space is not; the volume V − H(V, I) cannot intersect any photo-consistent 3D shape. Hence, by studying how noise affects the information we can recover about the photo hull and about the emptiness of space (rather than about individual photoconsistent shapes) we can establish a clear mathematical distinction between the analyses of shape ambiguity and shape uncertainty. For simplicity and without loss of generality we assume in this paper that the true scene S ∗ is equal to the photo hull.
116
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
(a)
(b)
(c)
Fig. 2. (a) The photo hull H(V, I) contains all photo-consistent shapes (e.g., S ∗ and a shape S). (b) When noise is present, the photo hull cannot be established with certainty. The Photo Hull Distribution assigns a probability to every possible hull (e.g., H1 and H2 ). (c) Foreground and background modeling: the foreground and background models are satisfied for voxels v and v , respectively.
2.2 The Photo Hull Distribution When the set I of input images is generated stochastically, the emptiness or occupancy of voxels cannot be determined with certainty any more (Figure 2b). To model this uncertainty, we introduce the Photo Hull Distribution: De nition 1 (Photo Hull Probability) The Photo Hull Probability of S ∈ 2V is the conditional probability
p [Hull(S, V, I) | V, I] : 2V → [0, 1], where Hull(S, V, I) is the event that S is the photo hull H(V, I).
The Photo Hull Distribution captures all the information that we can possibly extract from noisy photographs of an arbitrarily-shaped scene in the absence of prior information. This is because (1) scene radiances are not observable and hence shape information cannot be extracted with certainty, and (2) the inherent ambiguities of the (noiseless) n-view stereo problem preclude bounding the scene with anything but its photo hull. While useful as an analytical tool, the Photo Hull Distribution is impractical as a shape representation because it is a function over 2V , an impossibly large set. To overcome this limitation, we also study how to reconstruct 3D shapes that have high probability of being the photo hull and how to compute scene reconstructions that represent shape uncertainty in a compact way. To achieve the latter objective, we introduce a probabilistic definition of occupancy: De nition 2 (Occupancy Probability) The occupancy probability of a voxel v is given by the marginal probability p [Occ(v, V, I) | V, I] = p [Hull(S, V, I) | V, I] . {S|S∈2V and v∈S}
A Probabilistic Theory of Occupancy and Emptiness
117
Intuitively, the occupancy probability defines a scalar field over V whose value at a voxel v ∈ V is equal to the probability that v is contained in the photo hull. Our goal in the remainder of this paper is to characterize the information that n noisy photos provide about the shape of an arbitrary 3D scene. We achieve this by deriving an image-based expression for the photo hull distribution and a method for drawing fair samples from it. Toward this end, we first factor the photo hull probability of a shape S into two components, namely, the probability that S is photo-consistent conditioned on the emptiness of its exterior V − S (so that the shape is visible), and the probability that its exterior is empty: p [Hull(S, V, I ) | V, I] = p [PhotoC(S, I ) | Empty(V − S, I ), V, I] p [Empty(V − S, I ) | V, I] . (1) Here, PhotoC(S, I) denotes the event that S is photo-consistent with image set I. Similarly, Empty(V − S, I) denotes the event that the exterior volume V − S does not intersect any shape in 2V that is photo-consistent with I. The following sections derive expressions for these two fundamental distributions in terms of image measurements.
3
Probabilistic Modeling of Photo-Consistency
When confronted with the problem of formulating the photo-consistency probability p [PhotoC(S, I) | Empty(V − S, I), V, I] in terms of image measurements two questions arise: (1) how do we express the photo-consistency probability of a 3D shape S in terms of individual pixels (which are projections of individual voxels on the shape), and (2) how do we ensure that all global dependencies between 3D points are correctly accounted for? To answer these questions, we use two key ideas. First, we assume radiance independence, that the radiances of any two surface points on the true scene are independent. This simplifies the probabilistic modeling of the photo-consistency of individual voxels. The second idea, which we call the visibility-conditioning principle, states that, with the radiance independence assumption, the photo-consistency of individual voxels are conditionally independent, when conditioned on the voxels’ visibility in the true scene, S ∗ . This conditional independence is derived from our image formation model that is outlined below, and leads to the following factorization of the photo-consistency probability of shape S:1 Lemma 1 (Photo-Consistency Probability Lemma) p [PhotoC(S, I) | Empty(V − S, I), V, I] =
p [PhotoC(v, I, βS ) | I, βS (v)] ,
(2)
v∈S
where PhotoC(v, I, βS ) denotes the event that voxel v is photo-consistent, conditioned on the set of cameras from which it is visible, βS (v). The following sections give our simple model of image formation and derive the form of the voxel photo-consistency probability that it entails. 1
See the Appendix for proof sketches of some of the paper’s main results. Complete proofs can be found in [22].
118
3.1
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
Image Formation Model
Let S ∗ be an opaque, Lambertian 3D scene that occupies an arbitrary volume in 3 . Also assume that the scene is viewed under perspective projection from n distinct and known viewpoints, c1 , . . . , cn , in 3 − S ∗ . Then, the measured RGB value of a pixel depends on three factors: (1) visibility; (2) surface radiance; and (3) pixel noise. We use the term visibility state to denote the set of input viewpoints from which x is visible. It can be represented as an n-bit vector βS ∗ (x) whose i-th bit, [βS ∗ (x)]i , is one if and only if x is visible from ci . Each 3D scene produces a unique visibility function, βS ∗ : 3 → {0, 1}n , mapping points to their visibility states. The radiance, r(.), is a function that assigns RGB values to surface points on S ∗ . Let r(.) be a random variable with a known distribution, R, over [0, 1]; i.e., ∀x : r(x) ∼ R3 [0, 1]. Ignoring noise, if a surface point x is visible from viewpoint ci , then the RGB values observed from x in the camera at ci are equal to x’s radiance [23]; more formally, (x ∈ S ∗ ) ∧ ([βS ∗ (x)]i = 1)
=⇒
proji (x) = r(x),
where proji (x) denotes the image RGB value at x’s projection along ci . In practice, image measurements are corrupted by noise. Here, we model pixel noise, n, as meanzero Gaussian additive noise with variance σn2 . The RGB value proji (x) is therefore a random variable given by proji (x) = r(x) + n. Assuming that radiances and noise are independent, it follows that (1) the mean of proji (x) is E [r(x)], (2) its variance is V ar [r(x)] + σn2 , and (3) conditioned on radiance, the measured pixel color is distributed according to the noise distribution: proji (x)|r(x) ∼ N (r(x), σn2 ). In what follows we use proj(x) to denote the vector of random variables {proji (x) : [βS ∗ (x)]i } of a point x with visibility state βS ∗ (x). We use I to denote the set of images obtained from the n cameras, and I(x, βS ∗ ) to denote a sample of the random vector proj(x), i.e., the vector of image RGB values observed at x’s projections. 3.2
Photo-Consistency
Given only βS ∗ (v) and I(v, βS ∗ ), what can we say about whether or not v belongs to S ∗ ? To answer this question we formulate two generative models that characterize the distribution of the set of observations, I(v, βS ∗ ), that corresponds to v. The foreground model captures the fact that if v lies on the surface and βS ∗ (v) is v’s correct visibility state, then all radiances in proj(v) are identical since they arise from the same location on the surface: Observation 1 (Foreground Model) If v ∈ S ∗ and k = |I(v, βS ∗ )| is the number of cameras
for which v is visible, then (1) the distribution of proj(v) conditioned on the actual radiance, r(v), is a multi-variate Gaussian proj(v) | r(v) ∼ N k (r(v), σn2 ) , and
(3)
(2) the sample variance of the RGB measurements, V ar [I(v, βS ∗ )], will converge to the noise variance σn2 as k tends to infinity.
A Probabilistic Theory of Occupancy and Emptiness
119
In practice, k is finite. It follows from Eq. (3) that the observation likelihood, p [V ar [I(v, βS ∗ )] | v ∈ S ∗ ], is a χ2 density with 3(k − 1) degrees of freedom. If v is not a voxel on S ∗ , the color at each of its k projections is due to a distinct scene voxel vj (Figure 2c), with projj (v) = r(vj ) + n for j = 1, . . . , k. This, along with the radiance independence assumption, that radiances at different voxels are independent, leads to our simple background model: Observation 2 (Background Model) If v ∈ S ∗ , the distribution of proj(v) conditioned on
the actual radiances is given by
proj(v) | {r(vj )}kj=1 ∼
k
N (r(vj ), σn2 ),
(4)
j=1
the sample variance of which, V ar [I(v, βS ∗ )], converges to V ar [r] + σn2 as k tends to infinity.
In practice, with finite cameras and an unknown background distribution, we model the likelihood p [V ar [I(v, βS ∗ )] | v ∈S ∗ ], using the input photos themselves: we generate 106 sets of pixels, each of size |I(v, βS ∗ )|, whose pixels are drawn randomly from distinct photos. This collection of variances is used to model the distribution of sample variances for voxels that do not lie on the surface. We can now use Bayes’ rule to derive the probability that a voxel is on the scene’s surface given a set of image measurements associated with its visibility state β: p [v ∈ S ∗ | V ar [I(v, β)]] =
p [V ar [I(v, β)] | v ∈ S ∗ ] . p [V ar [I(v, β)] | v ∈ S ∗ ] + p [V ar [I(v, β)] | v ∈S ∗ ] (5)
Note that this particular formulation of the foreground and background likelihoods was chosen for simplicity; other models (e.g., see [21]) could be substituted without altering the theoretical development below. Finally, by defining the voxel photo-consistency probability to be the probability that v is consistent with the scene surface conditioned on visibility, i.e., p [PhotoC(v, I, βS ) | I, βS (v)] = p [v ∈ S ∗ | V ar [I(v, βS (v))]] , def
we obtain the factored form for the shape photo-consistency probability in Lemma 1.
4
Probabilistic Modeling of Emptiness
Unlike photo-consistency, the definition of emptiness of region V − S does not depend only on the shape S; it requires that no photo-consistent shape in 2V intersects that region. In order to express this probability in terms of individual pixel measurements, we follow an approach analogous to the previous section: we apply the visibility-conditioning principle to evaluate p [Empty(V − S, I) | V, I], the emptiness probability of a volume V − S, as a product of individual voxel-specific probabilities. But what voxel visibilities do we condition upon? The difficulty here is that we cannot assign a unique visibility state to a voxel in V − S because its visibility depends on which shape in 2V we are
120
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
Fig. 3. Visualizing voxel visibility strings and carving sequences. The carving sequence Σ = v1 · · · vm defines a sequence of m shapes, where the (i − 1)-th shape determines the visibility of the i-th voxel and S(Σ, i − 1) = V − {v1 , v2 , . . . , vi−1 }. Voxels in V − S and S are shown as light and dark gray, respectively. Also shown is the shape-space path defined by Σ.
considering. To overcome this difficulty, we introduce the concept of voxel visibility strings. These strings provide a convenient way to express the space of all possible visibility assignments for voxels in V − S. Importantly, they allow us to reduce the estimation of emptiness and photo hull probabilities to a string manipulation problem: De nition 3 (Voxel Visibility String) (1) A voxel visibility string, Σ = q1 · · · q|Σ| , is a string
of symbols from the set {v, v | v ∈ V }. (2) The shape S(Σ, i) induced by the first i elements of the visibility string Σ is obtained by removing from V all voxels in the sub-string q1 · · · qi that occur in their v-form. (3) The visibility state of the voxel v corresponding to string element qi is determined by the shape S(Σ, i − 1), i.e., it is the bit vector βS(Σ,i−1) (v). (4) The value function, F (Σ), of string Σ is the product |Σ| p PhotoC(v, I, βS(Σ,i−1) ) | I, βS(Σ,i−1) (v) (qi = v) F (Σ) = f (qi ), with f (qi ) = β I, PhotoC(v, p − 1 β (q I, | ) (v) i = v) S(Σ,i−1) S(Σ,i−1) i=1 (5) A voxel visibility string that is a permutation of the voxels in V − S, each of which appears in its v-form, is called a carving sequence.
Note that the set of voxel visibility strings is infinite since voxels can appear more than once in a string. Carving sequences, on the other hand, are always of length |V − S| and have an intuitive interpretation (Figure 3): a carving sequence Σ = v1 · · · vm can be thought of as a path in 2V that “joins” V to S. That is, beginning with the initial volume, V , every time we encounter a string element with a bar we remove the corresponding voxel and thereby generate a new shape. When we reach the end of the string we will have generated the desired shape S. Thus, the set of all carving sequences for a shape S gives us all the possible ways of obtaining S by carving voxels from V . Carving sequences provide an avenue for relating the emptiness of space to the nonphoto-consistency of individual voxels in V − S. This is because the order of voxels in a carving sequence assigns a unique visibility state to every voxel it contains, allowing us to apply the visibility-conditioning principle through the following definition:
A Probabilistic Theory of Occupancy and Emptiness
121
Definition 4 (Σ-Non-Photo-Consistency Event) We use NPhotoC(V −S, I, Σ) to denote the event that V − S is non-photo-consistent with respect to carving sequence Σ = v1 · · · vm . This is the event that no voxel v in Σ is photo-consistent when its visibility state is given by Σ: NPhotoC(V − S, I, Σ) ⇐⇒
m
¬PhotoC(vi , I, βS(Σ,i−1) ) .
i=1
Using this definition, it is easy to show that the non-photo-consistency of the volume V − S, when conditioned on the visibilities induced by a carving sequence, is simply the product of the non-photo-consistency probabilities of its constituent voxels: p [NPhotoC(V − S, I, Σ ) | V, I, Σ] = F (Σ). To fully specify the emptiness probability of the volume V − S it therefore suffices to establish a relation between the events NPhotoC(V − S, I, Σ) and Empty(V − S, I). The following proposition gives us a partial answer to this problem: Lemma 2 (Empty-Space Event) If a non-photo-consistent carving sequence exists, V − S is empty:
NPhotoC(V − S, I, Σ)
=⇒
Empty(V − S, I).
Σ∈C(V,S)
where C (V, S) is the set of all carving sequences that map V to S. Lemma 2 suggests a rule to establish a volume’s emptiness without exhaustive enumeration of all photo-consistent shapes. It tells us that it suffices to find just one carving sequence whose voxels are all non-photo-consistent. While this rule is sufficient, it is not a necessary one: it is possible for V − S to be empty without any of the events NPhotoC(V − S, I, Σ) being true. This seemingly counter-intuitive fact can be proven by contradiction because it leads to an expression for the Photo Hull Probability that is not a distribution,2 i.e., S∈2V p [Hull(S, V, I) | V, I] < 1. From an informal standpoint, this occurs because the set of voxel visibility strings that are carving sequences does not adequately cover the space of all possible visibility assignments for voxels in V − S, resulting in an under-estimation of the “probability mass” for the event Empty(V − S, I). Fortunately, the following proposition tells us that this mass can be correctly accounted for by a slightly expanded set of voxel visibility strings, all of which are mutually disjoint, which consider all sets of visibility assignments for voxels in V − S and contain every voxel in V − S at most |V − S| times: Proposition 1 (Emptiness Probability) p [Empty(V − S, I) | V, I] =
1 |V − S|!
F (Σ) +
Σ∈C(V,S)
R(P ) G(P )
Σ∈C(V,S) P ∈P(Σ)−C(V,S)
(6) where P(Σ) is the set of strings P = p1 · · · pl with the following properties: (1) P is created by adding voxels in their v-form to a carving sequence Σ, (2) each voxel vi occurs at most once 2
A simple way to verify this fact is to expand the sum in Eq. (7) for a 2-voxel initial volume V . Details are provided in Appendix A.3.
122
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
between any two successive voxels appearing in their v-form, (3) no voxel in P can occur in its v-form after it has occurred in its v-form; and G(P ) and R(P ) are given by f (pi ) f (pi )/f (pj )
(1st occurrence) (voxel occurs in v-form, |P | g(pi ) , where g(pi ) = G(P ) = previous occurrence at pj ) i=1 ) + f (p ) − 1] /f (p ) (voxel occurs in v-form, [f (p i j j previous occurrence at pj )
R(P ) =
(2 − lm−1 )! (m − l1 )! (m − 1 − l2 )! ... m! (m − 1)! 2!
where m = |V − S|; and lj is the number of voxels, vj inclusive, that exist in P between two voxels vj−1 and vj that are consecutive in Σ.
Proposition 1 tells us that the “extra” mass needed to establish the emptiness of a region consists of the G(P )-terms in Eq. (6). This result is somewhat surprising, since it suggests that we can capture all relevant visibility configurations of voxels in V − S by only considering a very constrained set of voxel visibility strings, whose length is at most |V −S|2 +|V −S| . Moreover, it not only leads to a probability distribution for the photo hull 2 but, as shown in Section 6, it leads to a simple stochastic algorithm that produces fair samples from the photo hull distribution. We consider the implications of this proposition below.
5 The Photo Hull Distribution Theorem One of the key consequences of our probabilistic analysis is that despite the many intricate dependencies between global shape, visibility, emptiness, and photo-consistency, the expression of a shape’s photo hull probability in terms of image measurements has a surprisingly simple form. This expression is given to us by the following theorem that re-states Eq. (1) in terms of image measurements by bringing together the expressions in Lemma 1 and Proposition 1. It tells us that for a shape S, this probability is a sum of products of individual voxel photo-consistency or non-photo-consistency probabilities; each voxel in S contributes exactly one voxel photo-consistency factor to the product, while each voxel in V − S contributes multiple non-photo-consistency factors. Theorem 2 (Photo Hull Distribution Theorem) (1) If every pixel observation in I takes RGB values over a continuous domain, the probability that shape S ⊆ V is the photo hull H(V, I) is G(P || Π) R(P ), (7) p [Hull(S, V, I) | V, I] = Σ∈C(V,S) P ∈P(Σ)
where Π is a string corresponding to an arbitrary permutation of the voxels in S, all of which appear in their v-form; C (V, S) , P(Σ) and G(·), R(·) are defined in Proposition 1; and “||” denotes string concatenation. (2) Eq. (7) defines a probability distribution function, i.e., S∈2V p [Hull(S, V, I) | V, I] = 1.
A Probabilistic Theory of Occupancy and Emptiness
6
123
Probabilistic Occupancy by Fair Stochastic Carving
Even though the photo hull probability of a shape S has a relatively simple form, it is impractical to compute because of the combinatorial number of terms in Eq. (7). Fortunately, while the Photo Hull Distribution cannot be computed directly, we can draw fair samples from it (i.e., complete 3D photo hulls) using a stochastic variant of the space carving algorithm [19]. Moreover, by computing a set of 3D photo hulls through repeated applications of this algorithm, we can approximate the occupancy probability of every voxel in V according to Definition 2. Occupancy Field Reconstruction Algorithm – Set trial = 1 and run the following algorithm until trial = K: Stochastic Space Carving Algorithm Initialization Set Strial = V and repeatedly perform the following two steps. Voxel Selection Randomly choose a voxel v ∈ Strial that either has not been selected or whose visibility state changed from β to β since its last selection. Stochastic Carving If no such voxel exists, set trial = trial + 1 and terminate. Otherwise, carve v from Strial with probability p [PhotoC(v, I, β) | I, β] − p [PhotoC(v, I, β ) | I, β ] p [PhotoC(v, I, β) | I, β] – For every v ∈ V , set p [Occ(v, V, I) | V, I] = L/K where L is the number of carved shapes in {S1 , . . . , SK } containing v.
From a superficial point of view, the Stochastic Space Carving Algorithm in the above inner loop is almost a trivial extension of standard space carving—the only essential difference is the stochastic nature of the carving operation, which does not depend on a pure threshold any more. This carving operation and the precise definition of the carving probability are extremely important, however. This is because if carving is performed according to the above rule, the algorithm is guaranteed to draw fair samples from the Photo Hull Distribution of Theorem 2. In the limit, this guarantees that the occupancy probabilities computed by the algorithm will converge to the theoretically-established values for this probability: Proposition 2 (Algorithm Fairness) When the number of trials goes to infinity, the frequency by which the Stochastic Carving Algorithm generates shape S will tend to p [Hull(S, V, I) | V, I].
A second feature of the Stochastic Space Carving Algorithm, which becomes evident from the proof of Proposition 2 in the Appendix, is that it provides us with an approximate means for comparing the likelihoods of two photo hulls it generates: since each run of the algorithm corresponds to a single term in the sum of Eq. (7), i.e., a particular combination of the Σ, P, Π strings, and since we can compute this term from the algorithm’s trace as it examines, re-examines and removes voxels, we can compare two photo hulls by simply comparing the value of these terms. This effectively allows us to develop an approximate method for selecting a maximum-likelihood photo hull from a set of generated candidate shapes.3 3
Note that this method is approximate because we can only compute a single term in the sum in Eq. (7) that defines p [Hull(S, V, I) | V, I], not the probability itself.
124
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
Fig. 4. Cactus: Raw views of the occupancy field were created by computing at each pixel, the average over all voxels that project to it and have non-zero occupancy probability. Each image in the last row compares the recovered occupancy field to volumes obtained from standard space carving, shown directly above it. A light (dark) gray pixel indicates that at least one voxel in the occupancy field with non-zero probability projects to it, and that no (at least one) voxel in the carved volume projects to it. Intuitively, light gray pixels mark regions that the standard algorithm “over-carves” , i.e., regions that were carved even though they have a non-negligible probability of being in the photo hull. Here, P F and P B correspond to the foreground and background likelihoods in Section 3.2.
7
Experimental Results
To demonstrate the practical utility of our formulation, we applied the algorithm of Section 6 to different scenes of varying complexity and structure.4 The cactus dataset (from [18]) is especially challenging because of the drastic changes in visibility between photos, the complex scene geometry, and because the images contain 4
The complete input sequences are available at http://www.cs.toronto.edu/˜kyros/research/occupancy/.
A Probabilistic Theory of Occupancy and Emptiness
125
many “mixed” pixels, quantization errors, and 1-4 pixel calibration errors that make pixel matching difficult. The carving probabilities in the Stochastic Carving Step of the algorithm require foreground and background likelihoods (denoted P F and P B in Figure 4). The foreground likelihood was given by a χ2 density with σ = 20. The background likelihood was modeled from the input photos as described in Section 3.2. The resulting plots in Figure 4 (bottom left) show these likelihoods. When the number of input views is small, the carving probability changes slowly as a function of the variance of a voxel’s projections. For many views, where a coincidental pixel match across views is unlikely, it approaches a step function and carving becomes almost deterministic. We ran the algorithm for K = 400 iterations on a cubical, 1283 -voxel initial volume V , generating 400 distinct photo hulls that were used to compute the occupancy field shown in Figure 4 (top right). Each iteration took about 17 minutes on an R12000 SGI O2. Views of two of the 400 photo hulls are shown in Figure 4 (top middle). Despite the large number of stochastic carving decisions made by the algorithm, the generated photo hulls are similar in both shape and appearance, with volumes differing by less than 3% from hull to hull. The number of voxels with non-zero probability in the final occupancy field was only 7% more than the voxels in a single sample, suggesting that despite the high dimensionality of shape-space, the Photo Hull Distribution is “highly peaked” for this dataset. In Figure 4 (bottom right) we compare the occupancy field to photo hulls obtained from the standard space carving algorithm [19] for different variance thresholds. Note that the standard space carving algorithm fails catastrophically for lower values of σ. Even at σ = 35, standard space carving results in a volume that appears over-carved and has artifacts. This behavior of standard space carving over a wide range of variance thresholds shows that it fails to model the uncertainty in the data set. The occupancy field represents this uncertainty quite well: the volume corresponding to the pot base has voxel occupancies close to 1, while near the top of the tallest cactus, where detail is highest, voxels have lower occupancy probabilities. Results with K = 400 for a second dataset (the gargoyle from [19]) are summarized in Figure 5. Each iteration took about 6 minutes. The gargoyle statuette has more structure than the earlier example and the resulting occupancy field is less diffuse. However, as each voxel is visible in at most 7 of the 16 input images, the voxel probability distribution has a gentler slope, and the occupancy field shows a smoother gradation of voxel occupancy probabilities moving inwards of the object, most noticeably in regions with discontinuities and irregularities on the surface, and in the cavity between the gargoyle’s hands and body (Figure 5, bottom row).
8
Concluding Remarks
This paper introduced a novel, probabilistic theory for analyzing the 3D occupancy computation problem from multiple images. We have shown that the theory enables a complete analysis of the complex probabilistic dependencies inherent in occupancy calculations, provides an expression for the tightest occupancy bound recoverable from noisy images, and leads to an algorithm with provable properties and good performance on complex scenes. Despite these features, however, our framework relies on a shape representation that is discrete and on a radiance independence assumption that does not
126
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
Fig. 5. Gargoyle dataset (input images not shown): (Top) 3 of 400 sample hulls generated and the occupancy field obtained by combining them. (Bottom) Vertical slices of the occupancy field. The bar at the bottom shows the mapping from voxel occupancy probabilities to gray levels.
hold for scenes with smoothly-varying textures. Overcoming these limitations will be a major focus of future work. Additional topics will include (1) providing a mechanism for introducing shape priors, (2) developing efficient strategies for sampling the Photo Hull Distribution and for computing likely Photo Hulls, and (3) developing fast, sampling-free approximation algorithms for occupancy field computation. Acknowledgements. We would like to thank Allan Jepson for many helpful discussions. The support of the National Science Foundation under Grant No. IIS-9875628 and of the Alfred P. Sloan Foundation for Research Fellowships to Fleet and Kutulakos is gratefully acknowledged.
References 1. A. Elfes, “Sonar-based real-world mapping and navigation,” IEEE J. Robotics Automat., vol. 3, no. 3, pp. 249–266, 1987. 2. B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proc. SIGGRAPH’96, pp. 303–312, 1996. 3. D. Marr and T. Poggio, “A computational theory of human stereo vision,” Proceedings of the Royal Society of London B, vol. 207, pp. 187–217, 1979. 4. L. Matthies, “Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation,” Int. J. Computer Vision, vol. 8, no. 1, pp. 71–91, 1992. 5. T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment,” IEEE Trans. PAMI, vol. 16, no. 9, pp. 920–932, 1994. 6. I. Cox, S. Hingorani, S. Rao, and B. Maggs, “A maximum likelihood stereo algorithm,” CVIU: Image Understanding, vol. 63, no. 3, pp. 542–567, 1996.
A Probabilistic Theory of Occupancy and Emptiness
127
7. H. Ishikawa and D. Geiger, “Occlusions, discontinuities and epipolar line in stereo,” in Proc. 5th European Conf. on Computer Vision, pp. 232–248, 1998. 8. P. N. Belhumeur, “A Bayesian approach to binocular stereopsis,” Int. J. on Computer Vision, vol. 19, no. 3, pp. 237–260, 1996. 9. R. Szeliski, “Bayesian modeling of uncertainty in low-level vision,” Int. J. Computer Vision, vol. 5, no. 3, pp. 271–301, 1990. 10. D. Geiger and F. Girosi, “Parallel and deterministic algorithms from MRF’s: surface reconstruction,” IEEE Trans. PAMI, vol. 13, no. 5, pp. 401–412, 1991. 11. K. Kanatani and N. Ohta, “Accuracy bounds and optimal computation of homography for image mosaicing applications,” in Proc. 7th Int. Conf. on Computer Vision, vol. 1, pp. 73–78, 1999. 12. P. Torr, R. Szeliski, and P. Anandan, “An integrated bayesian approach to layer extraction from image sequences,” IEEE Trans. PAMI, vol. 23, no. 3, pp. 297–303, 2001. 13. P. Fua and Y. G. Leclerc, “Object-centered surface reconstruction: Combining multi-image stereo and shading,” Int. J. Computer Vision, vol. 16, pp. 35–56, 1995. 14. R. T. Collins, “A space-sweep approach to true multi-image matching,” in Proc. Computer Vision and Pattern Recognition Conf., pp. 358–363, 1996. 15. S. M. Seitz and C. R. Dyer, “Photorealistic scene reconstruction by voxel coloring,” in Proc. Computer Vision and Pattern Recognition Conf., pp. 1067–1073, 1997. 16. S. Roy and I. J. Cox, “A maximum-flow formulation of the N-camera stereo correspondence problem,” in Proc. 6th Int. Conf. on Computer Vision, pp. 492–499, 1998. 17. O. Faugeras and R. Keriven, “Complete dense stereovision using level set methods,” in Proc. 5th European Conf. on Computer Vision, pp. 379–393, 1998. 18. K. N. Kutulakos, “Approximate N-View stereo,” in Proc. European Conf. on Computer Vision, pp. 67–83, 2000. 19. K. N. Kutulakos and S. M. Seitz, “A theory of shape by space carving,” Int. J. Computer Vision, vol. 38, no. 3, pp. 199–218, 2000. Marr Prize Special Issue. 20. J. D. Bonet and P. Viola, “Roxels: Responsibility-weighted 3D volume reconstruction,” in Prof. 7th. Int. Conf. on Computer Vision, pp. 418–425, 1999. 21. A. Broadhurst, T. Drummond, and R. Cipolla, “A probabilistic framework for space carving,” in Proc. 8th Int. Conf. on Computer Vision, pp. 388–393, 2001. 22. R. Bhotika, D. J. Fleet, and K. N. Kutulakos, “A probabilistic theory of occupancy and emptiness,” Tech. Rep. 753, University of Rochester Computer Science Department, Dec. 2001. 23. B. K. P. Horn, Robot Vision. MIT Press, 1986.
A A.1
Proof Sketches Lemma 1: Photo-Consistency Probability
We rely on the following result which is stated here without proof (see [22] for details): Proposition 3 (Conditional Independence of Photo-consistency and Emptiness) If every pixel observation in I takes RGB values over a continuous domain, the photo-consistency probability of S, conditioned on the visibility function βS , is independent of the emptiness of V − S: p [PhotoC(S, I) | Empty(V − S, I), V, I] = p [PhotoC(S, I) | I, βS ] .
(8)
128
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
To prove the lemma, it therefore suffices to derive an expression for the right-hand side of Eq. (8). A shape is photo-consistent if and only if every voxel v ∈ S is photo-consistent. Considering an arbitrary voxel v1 ∈ S and applying the chain rule gives
p [PhotoC(S, I) | I, βS ] = p
PhotoC(v, I, βS ) | I, βS =
v∈S
p PhotoC(v1 , I, βS ) |
PhotoC(v, I, βS ), I, βS p
v∈S−{v1 }
PhotoC(v, I, βS ) | I, βS .
v∈S−{v1 }
The visibility function βS defines the visibility state for all voxels in S, including v1 . Now, the photo-consistency probability of v1 , when conditioned on v1 ’s visibility state in S, is independent of the photo-consistency probability of all other voxels in S: p PhotoC(v1 , I, βS ) | PhotoC(v, I, βS ), I, βS = p [PhotoC(v1 , I, βS ) | I, βS (v1 )] . v∈S−{v1 }
This equation is a precise statement of the visibility conditioning principle. When applied to all voxels in S and combined with Eq. (8), it yields the desired factorization: p [PhotoC(v, I, βS ) | I, βS (v)] . (9) p [PhotoC(S, I) | Empty(V − S, I), V, I] = v∈S
A.2
Lemma 2: Empty Space Events
We prove the lemma by contradiction. In particular, we show that if V − S intersects a photoconsistent shape, i.e., if Empty(V − S) is false, no carving sequence Σ ∈ C (V, S) exists for which NPhotoC(V − S, I, Σ) is true. Suppose V − S intersects a photo-consistent shape and let S1 , S2 be the shape’s intersections with V − S and S, respectively. We partition the set of carving sequences, C (V, S), into two subsets C1 , C2 as follows. Let C1 be the set of those carving sequences in C (V, S) where every voxel in S1 appears after the entire set of voxels in V − (S ∪ S1 ), and let C2 = C (V, S) − C1 . Now consider any carving sequence Σ ∗ ∈ C1 . If voxel vk is the first voxel in S1 to appear in Σ ∗ , this voxel will be photo-consistent with respect to the visibility state βSk , where Sk = S1 ∪S. This is because the cameras that view vk according to βSk are a subset of the cameras that view it when vk lies on the shape S1 ∪S2 , which is both a subset of Sk and is photo-consistent. Therefore, NPhotoC(V − S, I, Σ ∗ ) is not true. Moreover, voxel vk will be photo-consistent in every carving sequence in C2 in which it appears before every other voxel in S1 . This is because in that case vk will be visible from even fewer cameras than those that view vk in Σ ∗ . It follows that the above argument is true for every carving sequence in C1 and can be extended to cover every sequence in C2 as well. It follows that the event NPhotoC(V − S, I, Σ) is not satisfied for any carving sequence in C (V, S).
A.3
Emptiness and Inadequacy of Carving Sequences
If it were possible to define the emptiness of a volume V − S simply by the event that at least one carving sequence is not photo-consistent, i.e., Empty(V − S, I) ⇐⇒ Σ∈C(V,S) NPhotoC(V − S, I, Σ), the expression for the photo-hull probability of S would reduce to
A Probabilistic Theory of Occupancy and Emptiness
L
v1
129
R
v2
Fig. 6. A two voxel initial volume, V = {v1 , v2 }, viewed by cameras L and R.
p [Hull(S, V, I) | V, I] =
G(Σ || Π)R(Σ),
(10)
Σ∈C(V,S)
where Π is a voxel visibility string corresponding to an arbitrary permutation of voxels in S, all of which appear in their v-form. Eq. (10) is obtained by setting P(Σ) = {Σ} in Eq. (7). Despite the fact that Eq. (10) may seem intuitively correct, this equation does not hold in general. This is because the emptiness of V − S does not imply the existence of a non-photo-consistent carving sequence. We prove this fact by contradiction, by showing that the photo-hull probability defined by Eq. (10) is not a distribution in general, i.e., S∈2V p [Hull(S, V, I) | V, I] < 1. Consider the volume and viewing geometry illustrated in Figure 6. Let V = {v1 , v2 }, viewed by two cameras L and R. Voxel v1 is always visible from L and is occluded from R by v2 , while v2 is always visible from R and is occluded from L by v1 . The shape space consists of S1 = V , S2 = {v1 }, S3 = {v2 }, and S4 = ∅. From Eq. (10), we get the following expressions for the photo-hull probability of these shapes: p [Hull(S1 , V, I) | V, I] = G(∅ || v1 v2 )R(∅), p [Hull(S2 , V, I) | V, I] = G(v2 || v1 )R(v2 ), p [Hull(S3 , V, I) | V, I] = G(v1 || v2 )R(v1 ), and p [Hull(S4 , V, I) | V, I] = G(v1 v2 || ∅)R(v1 v2 ) + G(v2 v1 || ∅)R(v2 v1 ) Using Definition 3, the equations above, and the following notational shorthand pβvS = p [PhotoC(v, I, βS ) | I, βS (v)] , the sum of photo-hull probabilities of all shapes in the shape space is given by βS βS βS βS βS βS p [Hull(S, V, I) | V, I] = pv1 1 pv2 1 + (1 − pv2 1 )pv1 2 + (1 − pv1 1 )pv2 3 S∈2V βS βS βS βS 1 1 (1 − pv1 1 )(1 − pv2 3 ) + (1 − pv2 1 )(1 − pv1 2 ) 2 2 βS βS βS βS βS βS 1 1 = 1 − (1 − pv1 1 )(pv2 1 − pv2 3 ) − (1 − pv2 1 )(pv1 1 − pv1 2 ). 2 2
+
The second and third “residual” terms in the final expression of the above sum will be nonzero in general. This means that S∈2V p [Hull(S, V, I) | V, I] < 1, implying that the photo-hull probability of shapes S ⊆ V in Eq. (10) is underestimated. Hence, the emptiness probability of V − S is not completely explained by the non-photo-consistency of the carving sequences in C (V, S). The two residual terms correspond to the voxel visibility strings v2 v1 v2 and v1 v2 v1 , respectively. These strings belong to ∪Σ∈C(V,S4 ) P(Σ), but were omitted when Eq. (7) was reduced to Eq. (10).
A.4
Proposition 2: Algorithm Fairness
Proof (Sketch). The proof involves four steps: (1) showing that each run of the algorithm is described by a voxel visibility string T of bounded length; (2) defining an onto mapping µ between
130
R. Bhotika, D.J. Fleet, and K.N. Kutulakos
voxel visibility strings that characterize a run of the algorithm and voxel visibility strings P || Π in Theorem 2; (3) showing that G(T ) = G(µ(T )); and (4) using the above steps to prove fairness. Sketch of Step 1: Define string T to be the exact trace of voxels selected by the algorithm’s Voxel Selection Step, where a voxel appears in its v-form if it was selected and not carved in the Stochastic Carving Step immediately following it, or in its v-form if it was carved in that step. The algorithm terminates when all surface voxels have been examined and no visibility transitions have occurred for these voxels. Since carving can only increase the number of cameras viewing a voxel, the total number of visibility transitions is n|V |. This gives an upper bound on |T |. Sketch of Step 2: Let S be the shape computed by the algorithm and let T be its trace. To construct the string µ(T ) we change T as follows: (i) delete from T all but the last occurrence of every voxel in S, creating string T1 ; (ii) move the voxels in S to the end of the string, creating T2 ; and (iii) define µ(T ) = T2 . String µ(T ) will be of the form P ||Π , with P ∈ P(Σ) for some Σ ∈ C (V, S). This mapping is onto since every string in P(Σ) is a valid trace of the algorithm. Sketch of Step 3: (i) The visibility state of points in Π that is induced by the string µ(T ) = P ||Π is identical to the one assigned by the visibility function βS . It follows that G(Π) = F (Π) = p [PhotoC(S, I) | Empty(V − S, I), V, I]. (ii) Since the voxels in T that are contained in Π always occur in their v-form, their position in T does not affect the visibilities induced by T to those voxels in T that belong to P . Hence, G(T2 ) = G(T1 ). (iii) If a voxel in V − S appears multiple times in T , this means that it was examined and left uncarved for all but its last occurrence in T . It now follows from the definition of G(.) that G(T ) = G(T2 ). Sketch of Step 4: The frequency that the algorithm arrives at shape S is equal to the probability that the algorithm produces a trace in ∪Σ∈C(V,S) P(Σ). It can be shown that Steps 1-3 now imply that the probability the algorithm produces a trace T is equal to G(µ(T ))R(µ(T )), where R(.) is defined in Proposition 1. Intuitively, the R(µ(T )) factor accounts for the portion of all traces T that map to the same µ(T ). Since each trial of the algorithm is independent of all previous ones, it follows that the probability that the algorithm generates shape S is equal to the shape’s photo hull probability, as defined in Theorem 2.
Texture Similarity Measure Using Kullback-Leibler Divergence between Gamma Distributions John Reidar Mathiassen1 , Amund Skavhaug2 , and Ketil Bø3 1
2
SINTEF Fisheries and Aquaculture, N-7465 Trondheim, Norway
[email protected] Department of Engineering Cybernetics (ITK), NTNU, N-7491 Trondheim, Norway
[email protected] 3 Department of Computer and Information Science (IDI), NTNU, N-7491 Trondheim, Norway
[email protected]
Abstract. We propose a texture similarity measure based on the Kullback-Leibler divergence between gamma distributions (KLGamma). We conjecture that the spatially smoothed Gabor filter magnitude responses of some classes of visually homogeneous stochastic textures are gamma distributed. Classification experiments with disjoint test and training images, show that the KLGamma measure performs better than other parametric measures. It approaches, and under some conditions exceeds, the classification performance of the best non-parametric measures based on binned marginal histograms, although it has a computational cost at least an order of magnitude less. Thus, the KLGamma measure is well suited for use in real-time image segmentation algorithms and time-critical texture classification and retrieval from large databases.
1
Introduction
Measuring the similarity between textures is an important task in low-level computer vision. A similarity measure for textures is essential for texture classification, supervised- and unsupervised image segmentation. The motivation of our research has been to find a similarity measure that is both low in computational cost and is statistically justified for classification purposes, and is therefore well suited for use in both texture classification/retrieval and real-time texture segmentation. Many similarity measures have been developed, based on various feature extraction schemes ([20],[21],[22],[23], [24]). In this paper we will focus on texture similarity between features extracted using Gabor filters. Observations of many textures, lead us to conjecture that some classes of visually homogeneous
The authors would like to thank Henrik Schumann-Olsen for his useful comments and discussions. We also thank the anonymous reviewers for their feedback. This research was supported by The Research Council of Norway.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 133–147, 2002. c Springer-Verlag Berlin Heidelberg 2002
134
J.R. Mathiassen, A. Skavhaug, and K. Bø
stochastic textures have gamma distributed spatially smoothed Gabor magnitude responses. We introduce the KLGamma texture similarity measure, which is based on the Kullback-Leibler divergence between gamma distributions. We perform classification experiments on 35 texture images. These experiments are done to evaluate classification performance on image blocks of various sizes. Testing is done with several parametric and non-parametric similarity measures, with both disjoint and non-disjoint test and training sets. All the tested measures assume independence between features to avoid the computational cost of estimating and using the feature covariance matrix in the similarity measure. In addition to the proposed method, the following parametric methods are tested: weighted mean variance (WMV) [14], weighted nearest neighbour (WNN) and the Kullback-Leibler divergence between Gaussians (KLGauss). We also test several non-parametric measures applied to binned marginal probability distributions: Kullback-Leibler divergence (KL), χ2 statistic (Chi) and Jeffrey divergence (JD) [15],[16]. In classification experiments with disjoint test and training sets, KLGamma performs better than the other tested parametric measures and KL. For block sizes smaller than 32x32 pixels KLGamma outperforms all other measures, and for larger blocksizes it is slightly worse than JD and Chi. We analyze the computional cost of all the tested similarity measures and find that KLGamma is somewhat slower than WNN, and faster than the other parametric measures. Compared to non-parametric measures, which have similar classification performance, KLGamma is at least an order of magnitude less computationally intensive. In tests with disjoint test and training sets, KLGamma achieves slightly better results than the second best parametric measure, KLGauss, while being computationally simpler. Research by Puzicha et.al. ([15] and [16]) has shown that the non-parametric Chi and JD measures outperform other non-parametric measures and the parametric WMV measure in texture retrieval tests. In their experiments, the extracted texture features were Gabor filter magnitude responses. It has been shown [2] that spatially smoothing the Gabor filter magnitude responses can improve classification performance, and we will therefore do this kind of smoothing in our tests. Our main contributions are as follows: – We introduce the KLGamma texture similarity measure based on the Kullback-Leibler divergence between gamma distributions. – We conjecture that the spatially smoothed Gabor filter magnitude responses of some classes of visually homogeneous stochastic textures are gamma distributed. – We compare the classification performance of several non-parametric and parametric similarity measures on spatially smoothed Gabor features. In addition to being justified from a statistical point of view, the combination of speed and classification performance implies that the KLGamma measure is well suited to real-time texture segmentation and texture classification/retrieval. In the next section we present texture feature extraction using spatially smoothed Gabor filter magnitude responses. We then introduce all the tested
Texture Similarity Measure Using Kullback-Leibler Divergence
135
similarity measures in the context of maximum-likelihood classification. Then we present results that show how well the gamma distribution fits the probability density function of the extracted features, and then present results from classification tests that compare the KLGamma similarity measure with other parametric and non-parametric measures. Moreover, we present a comparison of the computational cost of the various similarity measures. Finally we discuss the results of this paper and offer some conclusions and suggestions for future work.
2 2.1
Texture Feature Extraction Gabor Filtering
When extracting image features for use in texture analysis, it seems wise to look to the way texture is analyzed in the human visual cortex. Marcelja [4] and Daugman [5], [6] showed that the responses of simple cells in the visual cortex can be modeled by Gabor filter response. The 2D Gabor functions proposed by Daugman also achieve the theoretical resolution limit in the conjoint spatial-spatial frequency domain. Thus it is understandable that Gabor filters are commonly used in the feature extraction step of texture analysis. Based on the Gabor functions proposed by Daugman, Lee[1] derived a family of Gabor wavelets that satisfy the neurophysiological constraints for simple cells. 2 κ2 ω0 − ω02 (x2 +y2 ) i(ω0 x ) − e− 2 · e e 8κ ψ (x, y, ω0 , θ) = √ (1) 2πκ Here x = x cos θ +y sin θ and y = −x sin θ +y cos θ. ω0 is the radial frequency in radians per unit length and θ is the wavelet orientation in radians. The wavelet is centered at (x = 0, y = 0) and κ ≈ π for a frequency bandwidth of one octave. κ2
The term e− 2 in (1) is the d.c. response of the wavelet, and is subtracted to achieve invariance with respect to image brightness. For simplicity, let ψω0 ,θ denote the Gabor wavelet with parameters ω0 and θ. Let ψωr 0 ,θ be the real part of ψω0 ,θ and ψωi 0 ,θ be the imaginary part, and let Ixy be the image centered at (x, y) to be filtered by ψω0 ,θ . Then the magnitude response of Ixy filtered by ψω0 ,θ is |ψω0 ,θ ∗ Ixy | =
ψωr 0 ,θ ∗ Ixy
2
+ ψωi 0 ,θ ∗ Ixy
2
(2)
where ∗ is the convolution operator. Experiments have shown [2] that using this magnitude response achieves better texture separability than using full-wave rectification or the real component only. This is most likely due to the fact that optimal conjoint resolution in the spatial-spatial frequency domain is achieved with the complex form (quadrature-phase pair) of the Gabor filter [1]. Pollen and Ronner’s [7] discovery, that the simple cells in the visual cortex exist in quadrature-phase pairs, suggests that the design of these cells might indeed be optimal with respect to the conjoint resolution in the spatial-spatial frequency domains.
136
2.2
J.R. Mathiassen, A. Skavhaug, and K. Bø
Spatial Smoothing
The performance of Gabor filters for texture analysis is known to improve by performing Gaussian smoothing of the Gabor filter magnitude response images ω2 2 2 0 [2]. If g(x, y, ω , θ) = √ω0 e− 8κ2 (x +y ) is the Gaussian term in (1), the magni0
2πκ
tude response images are smoothed with oriented Gaussian filter g(γx, γy, ω0 , θ). We denote the value at (x, y) of a smoothed magnitude image by cxy,ω0 ,θ,γ = g(γx, γy, ω0 , θ) ∗ |ψω0 ,θ ∗ Ixy | .
3
(3)
Similarity Measures
In this section we will present the similarity measures. To understand the reason for choosing a given similarity measure, one must look at the underlying assumptions that are made in deducing the measure. Therefore we will describe the maximum-likelihood classification principle, and the similarity measures that result from making assumptions on the type of feature probability distributions. We then define the asymptotically (for large sample sizes) optimal similarity measures for Gaussian distributions, gamma distributions and binned marginal histogram distributions. For binned histograms we will also present some further measures used in [15] and [16]. Finally we will describe the parametric WMV similarity measure proposed by Manjunath and Ma [14] which was compared with several non-parametric methods in [15] and [16]. 3.1
Maximum-Likelihood Texture Classification
Assume that a texture is described by a featurespace F. Given this featurespace we define a mapping g : F → C = {1, . . . , K} from the featurespace F to the set C of possible texture class labels. The goal we set for a texture classification system is to minimize the probability of classification error. This can be formalized as the minimization of p (g (x) =y), with x being a set of feature observations from a texture sample of class y. The optimal mapping g for this problem formulation is [13] the Bayes classifer g ∗ (x) = arg max p (x|y = i) P (y = i) i
(4)
where p (x|y = i) is the likelihood function for the ith class, and P (y = i) is it’s prior probability. When all texture classes have the same prior probability P (y = i) = 1/K (4) reduces to the maximum-likelihood (ML) classifier [12] M 1 g (x) = arg max log p (xj |y = i) . i M j=1
(5)
It is in general computionally expensive to use (5) for large sample sizes M .
Texture Similarity Measure Using Kullback-Leibler Divergence
137
Kullback-Leibler Divergence. In [12] it is shown that for large values of M we can apply the law of large numbers to (5), resulting in M →∞ (6) g (x) −→ arg max Eq [log p (x|y = i)] = arg min KL q; p(i) i
i
where KL (q; pi ) is the Kullback-Leibler divergence (KL) between the query density q and the density p(i) associated with texture class i. This shows that KL is the asymptotic limit of the ML classification criteria. If F has N dimensions, and these dimensions are independent other, the Kullback-Leibler di of each (i) N (i) = j=1 KL qj ; pj . This follows from (6) assuming vergence is KL q; p
N p(x) = j=1 pj (xj ). Here j denotes the feature dimension. 3.2
Gaussian Distribution
The univariate Gaussian distribution is given by pGauss (x; µ, σ) = 2 √ 1 e−0.5[(x−µ)/σ] . If the N dimensions of F are independent and Gaussian 2πσ distributed we have from [3] the following Kullback-Leibler divergence
(q)
(q) 2 (p) 2 (p) 2 N σj µj − µj σj 1 log + + − 1 KLGauss (q; p) = (q) (p) (p) 2 j=1 σj σj σj (7) A simplification of (7) gives us the weighted nearest neighbour classifier (WNN)
(q) (p) 2 N µj − µj (8) DW N N (q; p) = (p) σj j=1 (q)
(p)
The condition under which this simplification is justified is σj ≈ σj . We observe that the WNN measure is equivalent to the Mahalanobis measure [12] if the feature covariance matrix is diagonal. 3.3
Gamma Distribution
The gamma distribution is given by pGamma (x; α, β) = ∞ Γ (z) = 0 e−t tz−1 dt, z > 0.
1 α−1 −x/β e β α Γ (α) x
where
Parameter Estimation. There are two main methods of calculating estimates of the parameters α and β of the gamma distribution: the moment method and maximum-likelihood (ML) method. Using the moment method [9] we obtain the µ σ2 parameters by β = and α = , where µ and σ denote the estimated mean βˆ µ and standard deviation of the distribution. Using the ML estimation method [10] we obtain parameter estimates using n
1 Xi α β = X = n i=1
(9)
138
J.R. Mathiassen, A. Skavhaug, and K. Bø n
log α − Ψ ( α) = log X − log X =
1 log Xk − log X n
(10)
k=1
where X and log X are the mean and mean logarithm of the distribution given by a set of independent observations X = {X1 , . . . , Xn }, and Ψ (·) is the digamma function, i.e. Ψ (z) = Γ (z) /Γ (z). Kullback-Leibler Divergence between Gamma Distributions. A closedform expression for the KL-divergence between two Gamma distributions is given by [3] and results in the following expression KLGamma (q; p) =
N j=1 (p) +αj
(q) (q) (q) (q) (q) (p) αj − 1 Ψ αj − log βj −αj − log Γ αj + log Γ αj (p) log βj −
(p) αj
α(q) β (q) j j (q) (q) − 1 Ψ αj + log βj + (p) βj
(11)
Without any loss of classification performance we can simplify (11) for use in (6), leading to the following similarity measure DKLGamma (q; p) =
N j=1
log Γ
(p) αj
(p) +αj
(p) (p) log βj −αj
α(q) β (q) j j (q) (q) Ψ αj + log βj + (p) βj
(12)
which we introduce in this paper for use as a texture similarity measure. 3.4
Binned Histograms
Marginal probability density functions can be modeled as binned histograms. For a given feature j we then have the following empirical distribution dimension xj pj (xj ) = , where · denotes the floor operator, and pj (z) = 0 if ∆lj z∈ / {0, . . . , L − 1}. L is the number of bins and ∆lj is the bin size. Kullback-Leibler Divergence. The Kullback-Leibler divergence between two empirical distributions is DKL (q, p) =
N L−1 j=1 l=0
qj (l) log
qj (l) . pj (l)
(13)
Appendix A shows that when the feature probabilities are modelled with binned histograms, g (x) = arg mini DKL q, p(i) is exactly the maximum-likelihood classifier.
Texture Similarity Measure Using Kullback-Leibler Divergence
139
Fig. 1. 128x128 pixel patches of the textures used in the experiments. The textures are displayed in order of goodness-of-fit with the gamma distribution model, with the fit decreasing from left to right, and top to bottom
Jeffrey Divergence. A symmetrized version of the Kullback-Leibler divergence is the Jeffrey divergence [15] which is more numerically stable for empirical distributions. For binned histograms it is N L−1 DJD (q, p) = qj (l) log j=1 l=0
qj (l) pj (l) + pj (l) log . qj (l) + pj (l) qj (l) + pj (l)
(14)
χ2 Statistic. An approximation to the Kullback-Leibler divergence is the χ2 statistic [12]. A symmetrized version of this similarity measure is used in [15],[16] and is defined for binned histograms as follows Dχ2 (q, p) =
N L−1 [qj (l) − pj (l)]2 j=1 l=0
3.5
qj (l) + pj (l)
.
(15)
Weighted Mean Variance (WMV)
A parametric similarity measure used for comparing textures is the weighted (i) mean variance (WMV) measure proposed by Manjunath and Ma [14]. Let µj (i)
and σj be the mean and standard deviation of feature dimension j. The superscript (i) denotes the texture class. The WMV measure between two textures q and p is then defined as
140
J.R. Mathiassen, A. Skavhaug, and K. Bø Goodness-of-fit with gamma distribution 10
8
log(C/n)
6
4
2
0 A1 B1 C1 D1 E1 F1 G1 A2 B2 C2 D2 E2 F2 G2 A3 B3 C3 D3 E3 F3 G3 A4 B4 C4 D4 E4 F4 G4 A5 B5 C5 D5 E5 F5 G5
-2 Texture
Fig. 2. Goodness-of-fit with gamma distribution
DW M V
(q) N µ − µ(p) σ (q) − σ (p) j j j j (q; p) = + α (µj ) α (σj )
(16)
j=1
where α (µj ) and α (σj ) are the standard deviations of the respective features over the entire database, and are used to normalize the individual feature components.
4 4.1
Results Validity of Gamma Distribution Model
In this section we illustrate how well the gamma distribution model fits the distribution of the extracted texture features. As a goodness-of-fit measure we use the N L (p (l)−π (l))2 , χ2 statistic [18]. With N feature dimensions it is C = n j=1 l=1 j πj (l)j where n is the number of samples, L is the number of bins, pj (l) the observed probability of bin l and πj (l) the expected probability of bin i assuming the feature is gamma distributed. πj (l) = Fj (l∆lj ) − Fj ((l − 1)∆lj ), where ∆lj is the bin size and Fj (z) is the cumulative gamma distribution. Given a gamma distrix Γ (αj ,z/βj ) bution with parameters αj and βj , Fj (z) = x Γ (α , Γx (α, t) = 0 e−t tα−1 dt j) [19]. The goodness-of-fit with the gamma distribution for a given texture is calculated for feature distributions extracted from the entire 512x512 texture image. From figures 1 and 2 we see that the textures that best fit the gamma distribution model are visually homogeneous and appear to be stochastic in nature. We also see that the textures which are either structered, such as G5, or nonhomogeneous such as C5 and F5, do not fit the said model very well. We would like to point out that the texture patches in figure 1 are 128x128 blocks taken from larger 512x512 images. A consequence of this, is that some 128x128 patches
Texture Similarity Measure Using Kullback-Leibler Divergence Best (A1)
25% (B2)
Median (D3)
75% (F4)
141
Worst (G5)
ω0 = π ω0 = π / 2 ω0 = π / 4
Fig. 3. An illustration of how well the gamma distribution model fits the extracted features. We show the histogram (jagged line) and fitted gamma model (smooth line) for a representative set of textures. In these figures we have θ = π/4
may look homogeneous, while the entire 512x512 is not. This is for example the case for textures D4 and E4, and explains why they look very homogeneous in figure 1, but are not among the textures with the best goodness-of-fit wrp the gamma distribution. The degree to which the features histograms are gamma distributed can be observed in figure 3. It is also worth noting that in preliminary experiments, we find that adding a small amount of Gaussian noise to the texture images increased the goodness-of-fit with the gamma model. From these observations we make the following conjecture: Conjecture 1. Spatially smoothed Gabor filter magnitude responses of some classes of visually homogeneous stochastic textures are gamma distributed. 4.2
Classification Performance
We perform two different sets of experiments to validate the classification performance when using the different similarity measures: – Experiment A: Training and testing on the same data – Experiment B: Training and testing disjoint data sets The first set of experiments will indicate how well the feature set and similarity measure represent the textures. The second set of experiments will shed light on such issues as overtraining and generalization. Gabor Filter Parameters. For our classification experiments the π π we3π filter images with Gabor filters of 4 different orientations, θ ∈ 0, , , , and 5 4 2 4 π π π π different scales, ω0 = π, 2 , 4 , 8 , 16 for a total of 20 filters. In the experiments with spatially smoothed magnitude images, a smoothing factor of γ = 2/3 was used for smoothing all Gabor filter magnitude response images. This is just a small portion of the possible parameter range of {θ, ω0 , γ}, and we see it as being similar to what is commonly used in the texture analysis community [11],[2]. A vector x in the featurespace F is defined as x = [x1 , . . . , xN ] where N is the number of dimensions in the featurespace. The parameters used to extract the feature xj using (3) are {θj , ω0j } .
142
J.R. Mathiassen, A. Skavhaug, and K. Bø Table 1. Misclassified blocks in experiment A (in percent) Block size (in pixels) Measure 128x128 64x64 32x32 DKLGamma 0.9 4.8 10.6 DKL 0.4 4.3 7.9 DJD 0.9 4.2 10.3 Dχ2 0.9 3.9 10.0 0.7 4.4 10.8 KLGauss DW N N 1.6 6.0 14.0 DW M V 6.1 16.5 34.6
16x16 15.9 13.5 16.8 17.3 16.8 20.5 53.4
8x8 19.2 17.4 22.3 23.5 20.5 24.8 62.5
4x4 21.1 19.4 26.5 28.6 22.8 27.3 65.5
Experimental Setup. In our experiments we used a 35 image subset of the 40 images used in [16]. The image size is 512x512 pixels. For experiment A, training was performed on the entire images, and testing was performed on all non-overlapping blocks of size 128x128, 64x64, 32x32,...,4x4. One might question whether blocks as small as 4x4 pixels can represent the variance of the texture they are extracted from. Recall that the features we are analyzing are spatially smoothed Gabor filter responses. Thus the Gabor features in a 4x4 block contain information extracted from a ’receptive field’ slightly larger than the (largest) Gabor filter. In experiment B we look at the ability to generalize. Training is performed on one quadrant of the image and testing on all non-overlapping blocks of size 128x128, 64x64, 32x32,...,4x4 in the remaining three quadrants. This procedure is repeated for all four quadrants, and the results presented are the average performance over all combinations of training and test sets. In both experiment A and B, the classification performance was measured by the classification precision, i.e. the percentage of correctly classified image blocks. (i) The histograms were binned by setting ∆lj = L1 maxx,y,i cxy,ω0j ,θj ,γ , where (i) denotes the texture and j the feature dimension. To avoid numerical instability when using the non-parametric methods we initialized the histogram bins to a small value δ = 10−5 before estimating the histograms. We experimented with values of L between 32 and 256, but found that classification errors did not decrease noticably when increasing L from 64 to 256. Decreasing L from 64 to 32 gave marginally better classification results for small block sizes, and worse for large block sizes. Tables 1 and 2 therefore display the results of the nonparametric measures with 64 bins. Our experiments showed that there was no noticable difference in classification error between KLGamma with gamma distribution parameters estimated using the moment-method and KLGamma with parameters estimated using the ML method. We therefore display results from the former, due to its computational simplicity. Interpretation. In experiment A we see that the best measure is KL. This is expected since it is equivalent to using the maximum-likelihood classifier, and the
Texture Similarity Measure Using Kullback-Leibler Divergence
143
Table 2. Misclassified blocks in experiment B (in percent) Block size (in pixels) Measure 128x128 64x64 32x32 DKLGamma 8.0 12.7 18.7 DKL 13.9 17.7 22.8 DJD 7.6 11.4 18.2 Dχ2 7.4 11.3 17.8 9.7 13.8 20.2 KLGauss DW N N 9.9 15.1 22.9 DW M V 11.1 22.7 39.2
16x16 23.7 27.1 24.5 24.7 25.4 28.9 57.3
8x8 26.9 30.2 29.6 30.7 28.8 32.9 68.3
4x4 28.8 31.2 33.6 35.5 30.8 35.0 73.3
training data are representative of the test data. We also observe that KLGamma is the best parametric method for blocksizes smaller than 64x64 pixels, which is also understandable since the assumption that the data are gamma distributed is a valid one for most of the textures. For blocksizes of 128x128 and 64x64 pixels, the KLGauss measure is slightly better than KLGamma. The difference between KL and KLGamma can be attributed to the fact the conjecture 1 does not hold for all of the textures in the dataset. JD and Chi are marginally better than KLGamma for block sizes larger than 16x16, and worse for smaller block sizes. Experiment B reveals that KL does not do to well when the test and training sets are disjoint. This is perhaps due to problems in estimating accurate histograms from the training data, either because the sample size is too small (256x256 pixels) or the training data are not representaive of the test data. Chi is the best method for block sizes larger than 16x16 with KLGamma being only slightly worse. KLGamma is the best parametric measure, and for smaller block sizes it is the overall best similarity measure. We attribute the difference in classification error between experiment A and B to the fact that quite a few of the test images had internal textural variations, and were not entirely homeogeneous across the whole image. 4.3
Computational Cost
In this section we will look at how computationally intensive the similarity measures tested in the previous section are. We will analyze them in terms of how many operations are required to classify a query texture as being one of M textures. Parametric Measures. To give a fair speed comparison we will for each similarity measure define a set of parameters Q that must be calculated for each feature dimension of the query texture, and parameters P that must be calculated for each feature dimension of the database textures, in order for the similarity measure calculation to be as simple as possible. The equations in the second column of table 3 show the simplified calculations of each similarity measure for one feature dimension.
144
J.R. Mathiassen, A. Skavhaug, and K. Bø Table 3. Computational cost of parametric similarity measures
Measure
Expression
Q
DW N N
(Q1 −P1 )2 P2
DW M V
|Q1 − P1 | + |Q2 − P2 |
KLGauss P3 − Q3 + (Q1 −PP12) 2 DKLGamma P3 − P1 Q1 + Q P2
2
+Q2
P No. of ops 2 4N M µ, σ √ √ µ µ σ2 σ2 6N M + 4N α(µ) , α(σ) α(µ) , α(σ) 2 2 2 2 µ, σ , log σ µ, σ , log σ 7N M + 2N [Ψ (α) + log β, αβ] [α, β, log Γ (α) + α log β] 5N M + 7N [µ]
Table 4. Computational cost of non-parametric similarity measures Measure Expression DKL Q1 P1 DJD (−Q1 − P1 ) log (Q1 + P1 ) + P2 (Q1 −P1 )2 Dχ2 Q1 +P1
Q [−q] [q] [q]
P [log p] [p, p log p] [p]
No. of ops 2LN M 8LN M 5LN M
We assume that multiplications/divisions are as fast as additions/subtractions, one table lookup takes two operations and the |·| operator takes one operation. A further assumption is that the quantities µ and σ 2 are already calculated and the moment method is used to calculate the parameters of the gamma distribution. Also assume that the texture featurespace has N dimensions and the database texture parameters P are calculated in advance. In the rightmost column of table 3 we display the number of operations required to classify a query texture as being one of M possible database textures using the different similarity measures. From these results we see that WNN is the fastest method. If M > 3 the proposed KLGamma measure is the second fastest. Non-parametric Measures. We will do a similar analysis for the nonparametric measures. Simplifications will be done where possible. Here Q and P will be calculated for each of the L histogram bins in all the N feature dimensions. q and p denote the bin values for the query texture and database texture respectively. These results show that the KL measure is the easiest to compute, and the Chi measure is significantly faster than the JD measure. If we compare the computational cost of the parametric measures with the non-parametric measures we see that they are approximatly L times faster.
5
Discussion and Future Work
A question that naturally arises is whether the classification results we have obtained represent what would typically be found in texture classification applications. The first issue that should be addressed in order to answer this question is whether the test data are are representative of ”typical” textures. To say they are would be jumping to conclusions without much evidence. However, one can not avoid noticing how well the gamma distribution model fits the features
Texture Similarity Measure Using Kullback-Leibler Divergence
145
extracted from the visually homogeneous and stochastic textures. We suggest that further research should be done to determine for which classes of homogeneous stochastic textures our conjecture holds, and if it holds for homogeneous deterministic (e.g. polkadot) textures when there is noise present. Preliminary experiments we have done, suggest that this might be the case. In a later paper we will investigate more thoroughly the validity of our conjecture, give a rationale for why the extracted Gabor features appear to be gamma distributed, and determine how to extend the use of the KLGamma similarity measure to scale and rotation invariant texture classification/retrieval. The second issue we wish to touch upon, is whether our computational cost comparison gives results that are valid in a realistic implementation scenarios. If the implementation scenario follows the assumptions we have made, the results should be valid. The pletora of different specialized hardware and software platforms in existence makes it almost impossible to give a computional cost comparison that is valid in general. However, we can determine that the relative difference in computional cost between our proposed KLGamma measure and non-parametric measures is approximately an order of magnitude.
6
Conclusion
Based on observations of 35 texture images, we have conjectured that the spatially smoothed Gabor filter magnitude responses of some classes of visually homogeneous stochastic textures are gamma distributed. For the class of textures under which this conjecture holds, we have derived a computationally simple parametric similarity measure, the KLGamma measure, which is asymptotically optimal when used for texture classification/retrieval. This texture similarity measure is based on the Kullback-Leibler divergence between gamma distributions. We compared the texture classification performance of KLGamma with several parametric and non-parametric measures. Classification experiments confirmed that KLGamma is close to optimal. When training and testing was performed on the same dataset, the only method that noticably outperformed our method was the Kullback-Leibler divergence between binned histograms. We attribute this to the fact that some of the textures in our test suite were nonhomogeneous. Our conjecture does not hold for these textures. In tests with disjoint test and training sets, KLGamma was only slightly outperformed by the non-parametric Jeffrey divergence and symmetric χ2 statistic for block sizes larger than 16x16 pixels. For smaller blocks, KLGamma achieved the best classification results of all the similarity measures we have tested. An analysis of the computational cost of all the tested measures, indicate that KLGamma is at least an order of magnitude faster than non-parametric measures, while having comparable, and under some conditions better, classification performance. Only the simplest parametic method, WNN, is faster. Judging from the results we have obtained, we can conclude that, in scenarios under which our conjecture holds, our proposed KLGamma measure is the
146
J.R. Mathiassen, A. Skavhaug, and K. Bø
method of choice in e.g. real-time block-based segmentation algorithms such as that proposed by Hofman et.al. in [17]. KLGamma is also well suited for texture classification applications where speed is of the essence and the number of texture classes is large.
Appendix A Let X be a sample of S feature vectors from a texture q. Xjk denotes sample k from feature dimension j. Let N be the number of dimensions in the feature (i) vectors. If a texture i is described by a set of binned histograms pj , the likelihood S
N
(i) Xjk . The pj of the samples X being from texture i is λi = j=1 ∆lj k=1 X maximum-likelihood classifier is g(X) = arg maxi λi . If we let bjk = ∆ljkj N S (i) and take the log of λi we have log λi = j=1 k=1 log pj (bjk ). Now let njl denote the number of samples Xjk for which bjk = l. With this notation we N L−1 (i) obtain log λi = j=1 l=0 njl log pj (l). Furthermore, dividing by S on each side implies that g(X) = arg max λi = arg max i
(q)
Where pj
i
N L−1 (q) 1 (i) log λl = arg max λi pj (l) log pj (l) i S j=1 l=0
(17)
are the binned histograms of the query texture.
g(X) = arg max λi = arg min DKL q, p i
i
(i)
= arg min i
N L−1
(q) pj (l) log
(q)
pj (l) (i)
pj (l) (18) and thus minimizing the KL divergence is identical to the maximum-likelihood classifier. j=1 l=0
References 1. Tai Sing Lee, Image Representation Using 2D Gabor Wavelets, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 18, No. 10, October 1996 2. David A. Clausi, M. Ed Jernigan, Designing Gabor filters for optimal texture separability, Pattern Recognition 33 (2000) 1835-1849 3. W.D. Penny, KL-Divergences of Normal, Gamma, Direchlet and Wishart densities, Wellcome Department of Cognitive Neurology, University College London, March 30, 2001 4. S. Marcelja, Mathematical Description of the Responses of Simple Cortical Cells, J. Optical Soc. Am., vol. 70, pp. 1297-1300, 1980 5. J.G. Daugman, Two-dimensional Analysis of Cortical Receptive Field Profile, Vision Research, vol. 20, pp. 847-856, 1980
Texture Similarity Measure Using Kullback-Leibler Divergence
147
6. J.G. Daugman, Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters, J. Optical Soc. Am., vol. 2, no. 7, pp. 1160-1169, 1985 7. D.A. Pollen and S.F. Ronner, Phase Relationship Between Adjacent Simple Cells in the Visual Cortex, Science, vol. 212, pp. 1409-1411, 1981 8. J.H. van Hateren and A. van der Schaf, Independent Component Filters of Natural Images Compared with Simple Cells in Primary Visual Cortex, Proc.R.Soc.Lond.B 265:359-366, 1998 9. Ronald E. Walpole and Raymond H. Myers, Probability and Statistics for Scientists and Engineers 5th Edition, Prentice Hall, 1993 10. C.T.J. Dodson and J. Scharcanski, Information Geometry For Stochastic Refinement of Database Search and Retrieval Algorithms, 2001 11. A.C. Bovik, M. Clark, W.S. Geisler, Multichannel Texture Analysis Using Localized Spatial Filters, IEEE Trans. Pattern Anal. Machine Intell. 12 (1) (1990) 55-73 12. Nuno Vasconcelos and Andrew Lippman, A Unifying View of Image Similarity, MIT Media Lab 13. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990 14. B.S. Manjunath and W.Y. Ma, Texture Features for Browsing and Retrieval of Image Data, IEEE Trans. Pattern Anal. Machine Intell. 18 (8) (1996) 837-842 15. Jan Puzicha, Joachim M. Buhmann, Yossi Rubner and Carlo Tomasi, Empirical Evaluation of Dissimilarity Measures for Color and Texture, Proceedings of the Internation Conference on Computer Vision (ICCV’99), 1165-1173, 1999 16. Jan Puzicha, Thomas Hofmann and Joachim M. Buhmann, Non-parametric Similarity Measures for Unsupervised Texture Segmentation and Image Retrieval, Proc. of the IEEE Int. Conf. on Computer Vision and Pattern Recognition, San Juan, 1997 17. Thomas Hofmann, Jan Puzicha and Joachim M. Buhmann, Deterministic Annealing for Unsupervised Texture Segmentation, Proc. of the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR’97) 18. Richard J. Larsen and Morris L. Marx, An Introduction to Mathematical Statistics and Its Applications 3’rd Edition, Prentice Hall, 2001 19. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.V.P. Flannery, Numerical Recipes in C, Cambridge, 1992 20. G. R. Cross and A. K. Jain. Markov Random Field Texture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 5/1, 1983 21. Chun-Shien Lu and Pau-Choo Chung. Wold Features for Unsupervised Texture Segmentation. Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, ROC, 1998 22. K. I. Laws. Rapid texture identification. In Proc. of the SPIE Conference on Image Processing for Missile Guidance, pages 376-380, 1980 23. Timo Ojala and Matti Pietik¨ ainen, Unsupervised texture segmentation using feature distributions. Pattern Recognition, Vol. 32, pp. 477-486, 1999 24. Trygve Randen and John H˚ akon Husøy. Filtering for Texture Classification: A Comparative Study. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 4, Apr 1999. Web: http://www.ux.his.no/˜tranden/
All the Images of an Outdoor Scene Srinivasa G. Narasimhan, Chi Wang, and Shree K. Nayar Columbia University, New York NY 10027, USA 500 West 120th street, Rm 450 Computer Science {srinivas, chi, nayar}@cs.columbia.edu
Abstract. The appearance of an outdoor scene depends on a variety of factors such as viewing geometry, scene structure and reflectance (BRDF or BTF), illumination (sun, moon, stars, street lamps), atmospheric condition (clear air, fog, rain) and weathering (or aging) of materials. Over time, these factors change, altering the way a scene appears. A large set of images is required to study the entire variability in scene appearance. In this paper, we present a database of high quality registered and calibrated images of a fixed outdoor scene captured every hour for over 5 months. The dataset covers a wide range of daylight and night illumination conditions, weather conditions and seasons. We describe in detail the image acquisition and sensor calibration procedures. The images are tagged with a variety of ground truth data such as weather and illumination conditions and actual scene depths. This database has potential implications for vision, graphics, image processing and atmospheric sciences and can be a testbed for many algorithms. We describe an example application - image analysis in bad weather - and show how this method can be evaluated using the images in the database. The database is available online at http://www.cs.columbia.edu/CAVE/. The data collection is ongoing and we plan to acquire images for one year.
1
Variability in Scene Appearance
The appearance of a fixed scene depends on several factors - the viewing geometry, illumination geometry and spectrum, scene structure and reflectance (BRDF or BTF) and the medium (say, atmosphere) in which the scene is immersed. The estimation of one or more of these appearance parameters from one or more images of the scene has been an important part of research in computer vision. Several researchers have focused on solving this inverse problem under specific conditions of illumination (constant or smoothly varying), scene structure (no discontinuities), BRDF (lambertian) and transparent media (pure air). Images captured to evaluate their methods adhere to the specific conditions. While understanding each of these specific cases is important, modeling scene appearance in the most general setting is ultimately the goal of a vision system. To model, develop and evaluate such a general vision system, it is critical to collect a comprehensive set of images that describes the complete variability in the appearance of a scene. Several research groups have collected images of a scene (for example, faces, textures, objects) under varying lighting conditions A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 148–162, 2002. c Springer-Verlag Berlin Heidelberg 2002
All the Images of an Outdoor Scene
149
and/or viewpoints in controlled lab environments. The CMU-PIE database [1] has 40000 facial images under different poses, illumination directions and facial expressions. The FERET [2] database consists of 1196 images of faces with varying facial expressions. Similarly, the Yale Face database [3] has around 165 images taken under different lighting, pose and occlusion configurations. The SLAM database [4] provides a set of 1500 images of toy objects under different poses. The color constancy dataset collected by Funt et al. [5] provides a large set of images of objects (boxes, books and so on) acquired under different poses and with different illuminants (fluorescent, halogen, etc). The CURET database [6] provides a set of 12000 images of real world textures under 200 illumination and viewing configurations. It also provides an additional set of 14000 Bi-directional Texture Function (BTF) measurements of 61 real world surfaces. Several databases of images of outdoor scenes have also been collected. The “natural stimuli collection” [7] has around 4000 images of natural scenes taken on clear, foggy and hazy days. Parraga et al. [8] provide a hyperspectral dataset of 29 natural scenes. The MIT city scanning project [9] provides a set of 10000 geo-referenced calibrated images acquired over a wide area of the MIT campus. These databases, however, do not cover the complete appearance variability (due to all outdoor illumination and weather conditions) in any one particular scene. Finally, web-cams [10] used for surveillance capture images regularly over long periods of time. However, they are usually low quality, non-calibrated, not tagged with ground truth data and focus only on activity in the scene. Note that the references we have provided for various databases are by no means complete. We refer the reader to [11] for a more comprehensive listing. In this paper, we present a set of very high quality registered images of an outdoor scene, captured regularly for a period of 5 months. The viewpoint (or sensor) and the scene are fixed over time. Such a dataset is a comprehensive collection of images under a wide variety of seasons, weather and illumination conditions. This database serves a dual purpose; it provides an extensive testbed for the evaluation of existing appearance models, and at the same time can provide insight needed to develop new appearance models. To our knowledge, this is the first effort to collect such data in a principled manner, for an extended time period. The data collection is ongoing and we plan to acquire images for one year. We begin by describing the image acquisition method, the sensor calibration procedures, and the ground truth data collected with each image. Next, we illustrate the various factors that effect scene appearance using images from our database captured over 5 months. We demonstrate thorough evaluation of an existing model for outdoor weather analysis, using the image database.
2 2.1
Data Acquisition Scene and Sensor
The scene we image is an urban scene with buildings, trees and sky. The distances of these buildings range from about 20 meters to about 5 kilometers. The large
150
S.G. Narasimhan, C. Wang, and S.K. Nayar Kodak DCS315 10-bit Camera
Weather Proof Enclosure Pan/Tilt Stage
24 -70 mm Zoom Lens
FireWire (IEEE 1394)
Anti-reflection Glass
Fig. 1. Acquisition setup
distance range facilitates the observation of weather effects on scene appearance. See figure 5 for the entire field of view. The digital camera we use for image capture is a single CCD KODAK Professional DCS 315 (see figure 1). As usual, irradiance is measured using 3 broadband R, G, and B color filters. An AF Nikkor 24 mm - 70 mm zoom lens is attached to the camera. 2.2
Acquisition Setup
The setup for acquiring images is shown in figure 1. The camera is rigidly mounted over a pan-tilt head which is fixed rigidly to a weather-proof box (see black box in figure 1). The weather-proof box is coated on the inside with two coats of black paint to prevent inter-reflections within the box. An anti-reflection glass plate is attached to the front of this box through which the camera views the scene. Between the camera and the anti-reflection plate, is a filter holder (for, say, narrow band spectral filters). The entire box with the camera and the anti-reflection glass plate is mounted on a panel rigidly attached to a window. 2.3
Image Quality and Quantity
Images are captured automatically every hour for 20 hours each day (on an average). The spatial resolution of each image is 1520 × 1008 pixels and the intensity resolution is 10 bits per pixel per color channel. Currently, we have acquired images for over 150 days. In total, the database has around 3000 images. Due to maintainance issues that arise from prolonged camera usage (camera power failures and mechanical problems in controlling camera shutter), we have had to remove the camera twice from the enclosure. We believe the resulting loss of few days in the database can be tolerated since the dataset has enormous redundancy. The new image sets are registered with existing ones using the matlab image registration utility. To capture both subtle and large changes in illumination and weather, high dynamic range images are required. So, we acquire images with multiple exposures (by changing the camera shutter speed while keeping the aperture constant) and apply prior techniques to compute a high dynamic range image (≈ 12 bits per pixel) of the scene. Since the illumination intensity is expected to vary with time, the set of exposures are chosen adaptively. First, an auto-exposure image is taken and its shutter speed is noted. Then 4 more images are captured with exposures around this auto-exposure value. This type of adaptive exposure selection is commonly used by photographers and is called exposure bracketing.
All the Images of an Outdoor Scene
Focal Length u
0
,
v
0
151
2845 mm 7.2 , 9.5 pixels
Radial Dist. (C1 )
0.07
Reprojection Err
0.18 pixels
Fig. 2. Geometric calibration using planar checkerboard patterns. Table shows estimated intrinsic parameters. The distortion parameters not shown are set to zero.
3 3.1
Sensor Calibration Geometric Calibration
Geometric calibration constitutes the estimation of the geometric mapping between 3D scene points and their image projections. Since the calibration was done in a location different from that used for image acquisition, we estimate only the intrinsic parameters of the camera. Intrinsic parameters include the effective focal length, f , skew, s, center of projection (u0 , v0 ) and distortion parameters, C1 ...Cn (radial) and P1 , P2 (tangential). Then, the relation between observed image coordinates and the 3D scene coordinates of a scene point is: ui = Du s(xi vi = Dv (yi (r)
f (r) (t) + δui + δui ) + u0 zi
f (r) (t) + δvi + δvi ) + v0 , zi
(r)
(1) (t)
where (δui , δvi ) are functions of the radial distortion parameters, and (δui , (t) δvi ) are functions of the tangential distortion parameters. Du , Dv are the conversion factors from pixels to millimeters. See [12] for more details. We captured the images of a planar checkerboard pattern under various orientations (see figure 2). The corresponding corners of the checkerboard patterns in these images were marked. These corresponding corner points were input to a calibration routine [13] to obtain the intrinsic parameters. Figure 2 shows the estimated intrinsic parameters. The CCD pixels are square and hence skew is assumed to be 1. The deviation of the principal point from the image center is given by ∆u0 , ∆v0 . Only the first radial distortion parameter, C1 , is shown. The remaining distortion parameters are set to zero. 3.2
Radiometric Calibration
Analysis of image irradiance using measured pixel brightness requires the radiometric response of the sensor. The radiometric response of a sensor is the mapping, g, from image irradiance, I, to measured pixel brightness, M : M = g(I) . Then, the process of obtaining I from M : I = g −1 (M ) , upto a global scale factor, is termed as radiometric calibration.
152
S.G. Narasimhan, C. Wang, and S.K. Nayar
(b) Auto Exposure
(a) High Exposure M Measured Intensity
( c) Low Exposure 2
G
R
1
2
1
B
0.5
3 0
0
0.5
1
3
1
I Image
Irradiance
(d) Radiometric Response
(e) Computed High Dynamic Range Image (Histogram Equalized)
Fig. 3. Radiometric Self-Calibration. (a) - (c) Three images (10 bits per pixel per RGB channel) captured with different camera exposures. (d) Computed radiometric response functions of the 3 RGB channels. The response functions are linear with slopes 1.5923, 1.005 and 0.982 for R, G, and B respectively. The colors can be balanced by normalizing the slope of each response function. (e) Histogram equalized high dynamic range image irradiance map obtained by combining images taken with multiple exposures. Insets indicate that the dynamic range in this image is much higher than the dynamic range in any image captured using a single exposure.
The response functions of CCD cameras (without considering the gamma or color corrections applied to the CCD readouts) are close to linear. We computed the response functions of the 3 RGB channels separately using Mitsunaga and Nayar’s [14] radiometric self-calibration method. In this method, images captured with multiple exposures and the their relative exposure values are used to estimate the inverse response function in polynomial form. The results of the calibration are shown in the plots of figure 3(d). Notice that the response functions of R, G, and B are linear and they have different slopes - 1.5923, 1.005 and 0.982 respectively. To balance the colors, we normalize the response functions by the respective slopes. The images taken with different exposures are linearized using the computed response function. A high dynamic range image irradiance map (see figure 3) is obtained by using a weighted combination of the linearized images. This image has significantly more dynamic range then any of the original images taken with single exposures [15]. The high dynamic range can prove very useful when analyzing both subtle and large changes in weather and illumination.
4
Ground Truth Data
Any database is incomplete without the accompanying ground truth. We have tagged our images with a variety of ground truth information. Most important categories of the ground truth we collected are scene depth and weather information. The depths of scene points are mainly obtained using satellite digital
All the Images of an Outdoor Scene
153
Conditions at 2001.03.06 11:51am
FOV
O
Wind
NNW (340 ) 10 MPH
Visibility Sky conditions
1 1/4 mile(s) Overcast Light snow, Mist A trace 32.0 F (0.0 C) 32.0 F (0.0 C) 100%
Weather Precipitation last hour Temperature Dew Point Relative Humidity
Fig. 4. Sample ground truth data. [Left] A satellite digital orthophoto of a portion of the scene. The red spot indicates the position of the sensor and bright region indicates the field of view. Arcview [17] is used to measure the orthographic distance between any two scene points (seen in top view) with an accuracy of 1 meter. [Right] The weather data obtained from National Weather Service websites [18].
orthophotos supplied by the United States Geological Survey [16]. Arcview [17] is a mapping software that is used to measure the orthographic distance between two scene points (visible in the orthophotos) up to an accuracy of 1 meter. See figure 4 for an example of a satellite orthophoto. Note that accurate depth is not available at all pixels. However, since the field of view consists of mainly vertical buildings, rough planar models can be used. The position (longitude, latitude and altitude) of the sensor is included in the database. This information along with the date and time of day, can be used to accurately compute sun and moon orientation relative to the sensor. For exact equations, see [19,20]. Every hour we automatically collect standard weather information from the National Weather Service web sites [18]. This includes information about sky condition (sunny, cloudy), weather condition (clear, fog, haze, rain), visibility, temperature, pressure, humidity and wind (see figure 4). Such information can be used to estimate the scattering coefficient of the atmosphere [21].
5
WILD: Weather and ILlumination Database
We illustrate the variability in scene appearance due to weather, illumination, season changes, and surface weathering using images from our dataset captured over five months. 5.1
Variation in Illumination
The distribution of environmental illumination on a scene produces a wide variety of scene appearances. Commonly noticed effects include shadows, colors of sunrise and sunset, and illumination from stars and moon at night. The human visual system relies on illumination in the scene to perceive scene reflectance (retinex and color constancy [22]) and shape [23] correctly. As a result, rendering a scene with consistent illumination is critical for realism in graphics. Considerable amount of effort has been put into modeling outdoor illumination.
154
S.G. Narasimhan, C. Wang, and S.K. Nayar
The book “Daylight and its Spectrum” [24] provides a compendium of color and intensity distributions of skylight for many years of the 20th century. Daylight spectrum has also been represented using a set of linear bases [25]. Works that model clouds and their effect on the ambient illumination also exist in literature [26,27]. In graphics, scenes have been rendered under different daylight [28] and night illuminations [29]. Shadows are a powerful cue for shape and illumination perception. Rendering shadows and extracting shape information from shadows [30,31] are also important problems. Let us consider the various sources of illumination in any outdoor scene. The primary sources (self-luminous) include the sun during the day, the stars and lamps during night. There are numerous other secondary illumination sources such as skylight, ground light, moonlight, airlight [32] (due to scattering of light by the atmosphere), and scene points themselves (inter-reflections [33]). Our goal is to include the effects of all these sources in one comprehensive database. Figure 5 shows 6 images from our database illustrating the various shadow configurations on a sunny day. Figure 6 shows different illumination colors and intensities at sunrise, noon, sunset and night. Figure 7 depicts the variations in ambient lighting due to varying cloud covers. When viewed up-close, rough surfaces appear to have 3D textures (due to surface height variations) rather than 2D textures. The appearance of 3D textures has been modeled in [6]. Figure 8 shows the appearance of a rooftop with ridges at different times of the day. Notice the change in appearance of cast shadows due to surface height variations. The above variability in a single database facilitates research groups to study the illumination effects individually as well as simultaneously. For instance, one may use just sunny day images at one particular time of day, when the sun position remains constant and shadows do not change. In another instance, one can consider images captured only on cloudy days to model scenes under ambient lighting. 5.2
Variation in Weather Conditions
Most vision algorithms assume that light travels from a scene point to an observer unaltered. This is true only on clear days. In bad weather, however, the radiance from a scene point is severely scattered and thus, the appearance of a scene changes dramatically. The exact nature of scattering depends on a variety of factors such as shapes, orientations and densities of atmospheric particles and the colors, polarizations, and intensities of incident light [34]. Recently, there has been work on computing scene depths from bad weather images [35] as well as to remove weather effects from images. Some works extract structure and clear day scene colors using images of a scene taken under different weather conditions [36] or through different polarizer orientations [37]. Other works use pre-computed or measured distances to restore scene contrast [38,39]. In graphics, scenes (including skies) have been rendered considering multiple scattering of light in the atmosphere [28,40].
All the Images of an Outdoor Scene
09/07/2001, 10 AM Clear and Sunny
09/07/2001, 1 PM Clear and Sunny
09/07/2001, 11 AM Clear and Sunny
09/07/2001, 12 Noon Clear and Sunny
09/07/2001, 2 PM Clear and Sunny
09/07/2001, 3 PM Clear and Sunny
155
Fig. 5. Images illustrating different shadow configurations on a clear and sunny day. Shadows provide cues for illumination direction and the scene structure. Notice the positions of the sharp shadows on the buildings.
09/05/2001, 6 AM Sun Rise
09/05/2001, 12 noon Noon
09/05/2001, 7 PM Sun Set
09 / 05/2001, 10 PM Night
Fig. 6. Images illustrating the various colors and intensities of illumination at sunrise, noon, sunset and night. Notice the significant change in the colors of the sky.
7 AM, Partly Cloudy, Partly Sunny
7 AM, Increased Cloud Cover
7 AM, Overcast Sky
Fig. 7. Images showing various levels of cloud cover. The image on the left shows the appearance of the scene with a few scattered clouds. The two images on the right were taken under mostly cloudy and completely overcast conditions. Notice the soft shadows due to predominant ambient lighting.
156
S.G. Narasimhan, C. Wang, and S.K. Nayar
12 Noon
9 AM
3 PM
Fig. 8. When viewed at close proximity (fine scale), the appearances of surfaces should be modeled using the bi-directional texture function (BTF) instead of the BRDF. Notice the change in cast shadows due to the ridges on the rooftop. All images are histogram equalized to aid visualization.
06/08/2001, 1 PM Clear and Sunny
06/02/2001, 1 PM Light Mist and Rain
06/14/2001, 1 PM Hazy
09/14/2001, 1 PM Foggy
Fig. 9. Images taken at the same time of day but under different weather conditions. Notice the degradation in visibility, especially, of far away scene points in bad weather.
05/28/01, 11 PM Clear Night
05/26/01, 11 PM Misty Night
Fig. 10. Night images showing light sources under clear and misty conditions. The sources appear like specks of light on a clear night. Notice the light spreading due to multiple scatterings on a misty night.
How can the above models for weather analysis be evaluated? Under what conditions do these models fail or perform well? To satisfactorily answer such questions, we need to evaluate the performance of such models and algorithms on images of a scene captured under a wide variety of illumination and weather
All the Images of an Outdoor Scene
157
conditions. Our database includes images of the scene under many atmospheric conditions including clear and sunny, fog, haze, mist, rain and snow. Figure 9 shows 4 images of the same scene captured under different weather conditions. Notice the significant reduction in contrast (and increase in blurring) in far away buildings. Broadening of light beams due to multiple scatterings in the atmosphere is clearly illustrated by the lamps imaged at night (see figure 10). Consider the event of mild fog setting in before sunrise, becoming dense as time progresses and finally clearing by noon. We believe that such lengthy, time varying processes can be studied better using our database. Study of such processes have potential implications for image based rendering. 5.3
Example Evaluation: Weather Analysis
Consider images taken under different weather conditions. The observed color, E, of a scene point in bad weather is linearly related to its clear day color ˆ and the color of the weather condition (say, fog or haze), A, ˆ by the direction, D, dichromatic model of atmospheric scattering [35]: ˆ + nA ˆ. E = mD (2) ˆ ˆ So, E, A, and D lie on the same “dichromatic plane” (see figure 11). Here, m = E∞ ρe−βd , n = E∞ (1 − e−βd ), d is the depth of the scene point from the observer, β is called the scattering coefficient of the atmosphere [41], E∞ is the brightness at the horizon, and ρ is a function of the scene point BRDF [36]. Under what weather conditions is the dichromatic model valid? How well does the dichromatic model describe the colors of scene points under a particular weather condition (say, mist)? Figure 11 shows the results of evaluating the model for fog, haze, mist and rain using multiple (5 in this case) images taken under each weather condition. Based on the dichromatic model, Narasimhan and Nayar [36] developed constraints on changes in scene colors under different atmospheric conditions. Using these constraints, they developed algorithms to compute 3D structure and clear day colors of a scene from two or more images taken under different but unknown weather conditions. How do we evaluate the performance of such an algorithm? Figure 12 shows the comparison of the defogged image with the actual clear day image of the scene under similar illumination spectra (overcast skies). The accuracy of the computed scaled depth, βd, is compared against the ground truth relative depth values obtained from the satellite orthophotos. This demonstrates that models and algorithms pertaining to weather can be evaluated more thoroughly using images from this database. Narasimhan and Nayar’s algorithm described above shows that clear day images can be obtained using two images taken under different weather (say, foggy) conditions. Given two foggy images, can we generate novel foggy images? ˆ fog color A, ˆ and optical We show this is possible using the defogged color ρ D, depth βd, at each pixel. Note that these quantities can be computed using the above algorithm. Scattering coefficient β is a measure of the density of fog. Thus, the density of fog can either be increased or decreased by appropriately scaling
158
S.G. Narasimhan, C. Wang, and S.K. Nayar ∧
B
D
∧
D
E
Dichromatic Plane
m ∧
n
E1
A
O
Rain
E2
R
O
A
(b)
Overcast
Dense Haze Overcast ∧
G
(a)
o Weather Sky Avg Err Err < 3 o Fog Overcast 0.58 95 % Mist Overcast 1.25 o 88 %
Mild Haze
Sunny
1.13o 2.27o 3.61o
91 % 76 % 44 %
(c)
Fig. 11. Dichromatic plane geometry and its evaluation. (a) Dichromatic Model [35]. (b) The observed color vectors Ei of a scene point under different (two in this case) weather conditions (say, mild fog and dense fog) lie on a plane called the dichromatic plane. (c) Experimental verification of the dichromatic model with the scene imaged 5 times under each of the different foggy, misty, rainy and hazy conditions. The third column is the mean angular deviation (in degrees) of the observed scene color vectors from the estimated dichromatic planes, over 1.5 megapixels in the images. The fourth column provides the percentage of pixels whose color vectors were within 3 degrees of the estimated dichromatic plane. Note that the dichromatic model works well for fog, mist, rain and dense haze under overcast skies. For mild haze conditions under sunny skies, the model does not perform well. Such evaluation is possible only since our database has several images under each weather condition.
the optical depth βd. Substituting the scaled optical depth k βd, clear day color ˆ and fog color A, ˆ into equation 2, we compute the colors of scene points ρD, under novel fog conditions. The horizon brightness E∞ is kept constant since it is just a global scale factor. Figure 13 shows 4 novel images generated with increasing fog densities. 5.4
Seasonal Variations
The types of outdoor illumination and weather conditions change with seasons. For instance, the intensity distribution of sunlight and skylight differ from summer to winter [24]. Similarly, the atmospheric conditions that manifest in fall are significantly different from those that occur in winter. Models of the atmosphere in different seasons can be found in [21] and other related papers [42]. Since we acquire the images for over a year, changes in scene appearance due to changes in seasons can be studied. For instance, one might easily compare the images taken on a clear day in spring with images taken on a clear day in winter under identical sensor settings. Figure 14 shows 2 images taken at the same time of day in summer and fall. 5.5
Surface Weathering
Over substantial periods of time, we commonly see oxidation of materials (rusting), deposition of dirt on materials and materials becoming wet or dry. These effects are important for realistic scene rendering and have been modeled by Dorsey and Hanrahan [43]. Since our images have high spatial resolution, portions of the image corresponding to small regions in the scene (say, a portion of
All the Images of an Outdoor Scene
5 PM, Fog
d2
d3
159
6 PM, Fog
d4
d1 Computed Depth Map (Brightened)
Computed Defogged Image
d5
Relative Depth
Ground Truth
Computed Value
d3 / d1
9.55
10.457
d2 / d1
6.98
7.72
d4 / d1
6.41
6.49
d5 / d1
8.82
9.30
Relative Depth Verification using Satellite Orthophoto Data
Actual clear day image (under mostly cloudy skies)
Fig. 12. Computing structure and defogging from two foggy images. Table comparing the computed relative depths with ground truth relative depths (obtained using satellite orthophotos) of 5 different regions, d1 − d5 , in the scene. The relative depths are averaged over small neighborhoods. The window regions do not remain constant and thus produce erroneous depth values. All the images are contrast stretched for display purposes.
160
S.G. Narasimhan, C. Wang, and S.K. Nayar
Fig. 13. Generation of novel images with increasing fog densities (or scattering coefficients). The relative scattering coefficients used in this case are β, 2β, 3β and 4β respectively.
Summer Season, 06/15/2001, 11 AM
Fall Season, 09/15/2001, 11 AM
Fig. 14. Images taken at the same time of day but on days in summer and fall. Both the images were taken on clear and sunny days. Notice the subtle differences in colors and the positions of shadows.
Fig. 15. Portions of a rooftop in the scene when it is dry, partly and completely wet.
a wall) can be analyzed. Figure 15 shows a small patch in the scene when it is dry and wet.
All the Images of an Outdoor Scene
6
161
Summary
The general appearance of a scene depends on a variety of factors such as illumination, scene reflectance and structure, and the medium in which the scene is immersed. Several research groups have collected and analyzed images of scenes under different configurations of illuminations (both spectrum and direction), and viewpoints, in controlled lab environments. However, the processes that effect outdoor scene appearance such as climate, weather and illumination are very different from indoor situations. Ultimately, vision algorithms are expected to work robustly in outdoor environments. This necessitates a principled collection and study of images of an outdoor scene under all illumination and weather conditions. We have collected a large set of high quality registered images of an outdoor urban scene captured periodically for five months. We described the acquisition process, calibration processes and ground truth data collected. The utility of the database was demonstrated by evaluating an existing model and algorithm for image analysis in bad weather. This dataset has potential used in vision, graphics and atmospheric sciences. Acknowledgements. This work was supported in parts by a DARPA HumanID Contract (N00014-00-1-0916) and an NSF Award (IIS-99-87979). The authors thank E. Rodas for building the weather-proof camera enclosure.
References 1. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination and expression (PIE) database of faces. Tech Report CMU-RI-TR-01-02 (2001) 2. Phillips, P., Wechsler, H., Huang, J., Rauss, P.: The feret database and evaluation procedure for face-recognition algorithms. Image and Vision Computing 16 (1998) 3. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. PAMI 19 (1997) 711–720 4. Murase, H., Nayar, S., Nene, S.: Software library for appearance matching (slam). In: ARPA94. (1994) I:733–737 5. Funt, B., Barnard, K., Martin, L.: Is machine colour constancy good enough? In: Proc ECCV. (1998) 6. Dana, K., Nayar, S., van Ginneken, B., Koenderink, J.: Reflectance and texture of real-world surfaces. In: Proc CVPR. (1997) 151–157 7. van Hateren, J.H., van der Schaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. In Proc. Royal Society of London B 265 (1998) 359 – 366 8. Parraga, C.A., Brelstaff, G.J., Troscianko, T., Moorhead, I.: Color and illumination information in natural scenes. JOSA A 15 (1998) 563–569 9. Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., Master, N.: Calibrated registered images of an extended urban area. In: Proc. CVPR. (2001) 10. Hazecam: A live webcam. (In: http://www.hazecam.net) 11. CMUPage: The computer vision home page. (In: http://www.cs.cmu.edu/ cil/vision.html) 12. Heikkila, J., Silven, O.: A four-step camera calibration procedure with implicit image correction. In: Proc. CVPR. (1997) 1106–1112
162
S.G. Narasimhan, C. Wang, and S.K. Nayar
13. Bouguet, J.Y.: Camera calibration toolbox. (In: http://www.vision.caltech.edu/ bouguetj/calib doc) 14. Mitsunaga, T., Nayar, S.: Radiometric self calibration. In: CVPR. (1999) I:374–380 15. Nayar, S., Mitsunaga, T.: High dynamic range imaging: Spatially varying pixel exposures. In: Proc. CVPR. (2000) I:472–479 16. USGS: U.S. Geological Survey Mapping home page. (In: http://mapping.usgs.gov) 17. ESRI: The ESRI home page. (In: http://www.esri.com) 18. NWS: The national weather service home page. (In: http://www.nws.noaa.gov) 19. Aurora: Sun angle basics. (In: http://aurora.crest.org/basics/solar/angle/) 20. Naval-Observatory: The Astronomical Alamanac for the Year 2001. US R. G. O Government Printing Office (2001) 21. Acharya, P.K., Berk, A., Anderson, G.P., Larsen, N.F., Tsay, S.C., Stamnes, K.H.: Modtran4: Multiple scattering and BRDF upgrades to modtran. SPIE Proc. Optical Spectroscopic Techniques and Instrumentation for Atmospheric and Space Research III 3756 (1999) 22. Land, E., McCann, J.: Lightness and retinex theory. JOSA 61 (1971) 23. Ramachandran, V.: Perceiving shape from shading. SciAmer 259 (1988) 76–83 24. Henderson, S.T.: Daylight and its Spectrum. New York : Wiley (1977) 25. Slater, D., Healey, G.: What is the spectral dimensionality of illumination functions in outdoor scenes? In: Proc. CVPR. (1998) 105–110 26. Moon, P., Spencer, D.: Illumination from a non-uniform sky. Illum Engg 37 (1942) 27. Gordon, J., Church, P.: Overcast sky luminances and directional luminous reflectances of objects and backgrounds under overcast skies. App. Optics 5 (1966) 28. Tadamura, K., Nakamae, E., Kaneda, K., Baba, M., Yamashita, H., Nishita, T.: Modeling of skylight and rendering of outdoor scenes. In: Eurographics. (1993) 29. Jensen, H.W., Durand, F., Stark, M.M., Premoze, S., Dorsey, J., Shirley, P.: A physically-based night sky model. In Proc. SIGGRAPH (2001) 30. Kender, J., Smith, E.: Shape from darkness. In: Proc. ICCV. (1987) 539–546 31. Kriegman, D., Belhumeur, P.: What shadows reveal about object structure. JOSAA 18 (2001) 1804–1813 32. Koschmieder, H.: Theorie der horizontalen sichtweite. Beitr. Phys. freien Atm., 12 (1924) 33–53,171–181 33. Nayar, S., Ikeuchi, K., Kanade, T.: Shape from interreflections. IJCV 6 (1991) 34. Hulst, V.D.: Light Scattering by small Particles. John Wiley and Sons (1957) 35. Nayar, S., Narasimhan, S.: Vision in bad weather. In Proc. ICCV (1999) 36. Narasimhan, S., Nayar, S.: Chromatic framework for vision in bad weather. In Proc. CVPR (2000) 37. Schechner, Y., Narasimhan, S., Nayar, S.: Instant dehazing of images using polarization. In Proc. CVPR (2001) 38. Oakley, J., Satherley, B.: Improving image quality in poor visibility conditions using a physical model for degradation. IEEE Trans. on Image Processing 7 (1998) 39. Yitzhaky, Y., Dror, I., Kopeika, N.: Restoration of altmospherically blurred images according to weather-predicted atmospheric modulation transfer function. Optical Engineering 36 (1998) 40. T. Nishita, Y. Dobashi, E.N.: Display of clouds taking into account multiple anisotropic scattering and sky light. SIGGRAPH (1996) 41. McCartney, E.: Optics of the Atmosphere: Scattering by molecules and particles. John Wiley and Sons (1975) 42. AFRL/VSBM: Modtran. (In: http://www.vsbm.plh.af.mil/soft/modtran.html) 43. Dorsey, J., Hanrahan, P.: Digital materials and virtual weathering. Scientific American 282 (2000)
Recovery of Reflectances and Varying Illuminants from Multiple Views Q.-Tuan Luong1 , Pascal Fua2 , and Yvan Leclerc1 1
2
AI Center,SRI International, CA-94025 Menlo Park, USA, {luong,leclerc}@ai.sri.com Computer Graphics Lab, EPFL, CH-1015 Lausanne, Switzerland
[email protected]
Abstract. We introduce a new methodology for radiometric reconstruction from multiple images. It opens new possibilities because it allows simultaneous recovery of varying unknown illuminants (one per image), surface albedoes, and cameras’ radiometric responses. Designed to complement geometric reconstruction techniques, it only requires as input the geometry of the scene and of the cameras. Unlike photometric stereo approaches, it is not restricted to images taken from a single viewpoint. Linear and non-linear implementations in the Lambertian case are proposed; simulation results are discussed and compared to related work to demonstrate the gain in stability; and results on real images are shown.
1
Introduction
Consider cdhaving several magazines with pictures of a mountain villa taken at different times of day and from different vantage points. Using modern bundle adjustment and geometric reconstructions techniques, it is possible to recover the geometry of the scene from the pictures. However, there is currently no way to automatically recover the unknown radiometric responses of the cameras/digitizers and the unknown illumination for each picture. Without these, it is not possible to create a photo-realistic rendering of the scene. Given the scene geometry and the camera’s geometric parameters, we call the recovery of these unknowns the radiometric recovery problem. Here, we focus on an important special case of the general problem in which the unknown camera/digitizer responses are modelled by an affine transformation (scale and offset), the unknown illuminants in each picture are modelled by a single point light sources at infinity plus ambient illumination, and the
This work was sponsored in part by the Defense Advanced Research Projects Agency under contract F33615-97-C-1023 monitored by Wright Laboratory and in part by the Swiss Federal Office for Education and Science. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency, the United States Government, SRI International or EPFL.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 163–179, 2002. c Springer-Verlag Berlin Heidelberg 2002
164
Q.-T. Luong, P. Fua, and Y. Leclerc
scene BRDF is textured Lambertian. This is reasonably close to the mountain villa scenario above. The main insight of this work is that these unknowns can be recovered from multiple images in much the same way as the geometry of a rigid scene can be recovered from multiple images with unknown but varying viewpoints. Compared to the photometric stereo approaches, there is no need to use a constant viewpoint, thus allowing us to take advantage of the reliable and flexible existing solutions for the geometric reconstruction problem. The decoupling of geometry and radiometry that we advocate makes sense for many applications. In some cases, for example in an industrial setting, 3–D models of target objects may be available. In many other cases, they can be built using well understood vision-based techniques: When performing change detection using images of the same sites taken at different times of day, digital terrain maps are available or can be computed with geometric reconstruction techniques, such as stereo based on normalized cross-correlation, which can handle moderate lighting variations. It is then natural to combine maps and actual images to estimate the illumination parameters, recover surface albedoes and detect their changes over time. When creating 3–D models of objects or people from video sequences for rendering purposes, the illumination may vary from image to image because the subject moves with respect to fixed illuminants. After using structure-from-motion techniques that are relatively insensitive to illumination changes to recover geometry, one can attenuate the lighting effects in the original images to create high-quality texture maps. Table 1. Radiometric reconstruction in relation with previous general frameworks
Framework Camera geom. Scene geom. Reflectance Illumination Geom. reconstruction recovered recovered some variation some variation Shape from shading single recovered constant recovered Photometric stereo constant recovered recovered known General. phot. stereo constant partly recov’d partly recov’d partly recov’d Inverse lighting single known known or constant recovered Reflectance modeling known known recovered known Our approach known known recovered recovered
Previous work is summarized in Table 1 and discussed in detail in Section 2. While the fact that pixel intensities are in a bilinear relation with surface albedo and illumination radiances is well known, it has not been used before for radiometric recovery from multiple views and varying illuminants. Our goal is to introduce a framework for such a task and associated algorithms. As a first step towards the general problem of radiometric reconstruction, we investigate the case of Lambertian surfaces. The relevant lighting model consists of a single light source for each input image. In Section 3, we describe the linear radiometry of multiple illuminants, give a characterization of the minimal data required, and then propose in Section 4 a linear algorithm for the recovery of the radiometry of multiple views. A more
Recovery of Reflectances and Varying Illuminants from Multiple Views
165
detailed account of this theory, as well as a parallel between the radiometric reconstruction problem and the geometric reconstruction problem can be found in [12]. Simulations shows this linear method to be quite sensitive to measurement noise. To obtain a more robust solution, we turn in Section 5 to a non-linear method. The results of this approach compare very favorably with those of the SVD method [8,3,26]. Section 6 discusses limitations of the current system and areas for future developments.
2
Previous Work
Shape from shading [10] assumes that the reflectance is Lambertian and uniform. Surface shape, and the direction of the illuminant are recovered from a single [27] or multiple images [17]. Photometric stereo, by using multiple images, each taken with the same viewpoint but a different illuminant explicitly known [23,24,9] or implicitly determined [1], recovers a non-uniform albedo, as well as the surface orientation. Subsequently, it was shown [8] that even with unknown illuminants, albedo and shape can be recovered using SVD, which generalized photometric stereo. However, both of them are subject to a Generalized Bas-Relief ambiguity [3,26]. This is intrinsic to the problem and can be entirely resolved only by sufficient prior knowledge about the object’s geometry. For novel view generation, knowing that the object belongs to an instance of a class [7] provides an adequate approximation. These methods are powerful. If given a sufficient number of 3D measurement on the object, they can recover both the illuminant and reflectance. However their precision has not been documented, nor their applicability to unconstrained or large environments. In Sec. 5.2 we compare experimentally the SVD approach with our radiometric reconstruction approach. The primary purpose of all the previous approaches is to recover the shape based on the Lambertian model. A second family of methods recovers radiometric parameters assuming that the camera and object geometry is available. Inverse lighting [13,14] assumes that the reflectance is Lambertian, and known or constant. The illuminant is modeled as a linear combination of a basis of predefined lights, whose coefficients are recovered from a single image using a linear relationship. The dual approach, reflectance modeling, consists in assuming that the point light source is known. It recovers sophisticated reflectance properties with specular and diffuse parameters [11,2,21,13]. This has been extended to multiple known point light sources, as well as indirect illumination effects [25]. These approaches recover rich models of either the illuminant or the reflectance, but unlike the method that we propose, they cannot recover both. By contrast, the approach of [20] can. It is, however, not generic because it depends on a specific situation where the shadow of an object of known shape is projected onto an object also of known shape. Recent work [18,19] done simultaneously to ours [12] demonstrated that a complex lighting model, including multiple lightsources and specular properties, can be recovered given several images of an
166
Q.-T. Luong, P. Fua, and Y. Leclerc
object for which a 3–D model is available but, unlike in our approach, requires assuming that the lighting is the same in all images.
3 3.1
Linear Model for Multiple-Illuminant Radiometry The Lambertian Model
We model the effect of illumination as the sum of a Lambertian term from a distant point light source, and an ambient term. For a surface element of normal n and albedo α that projects into an image i at coordinates (u, v), the radiance at the image plane is given by: Li (u, v) = α(lTi n + µi )
(1)
where li is the light source vector, µi is the ambient light. Each surface element is characterized by 3 parameters, the direction of n, and the albedo α. While n is a unit vector, li is not: its magnitude encodes the light source’s “intensity.” The complete illumination model for each image is therefore given in terms of 4 parameters described by the illumination vector: Li = [li1 , li2 , li3 , µi ]T . Using t the illumination vector, and the notation N = [n, 1] , Eq. (1) can be written T more compactly as: Li (u, v) = αLi N. 3.2
Radiometric Calibration
The measures that we can access are pixel gray-level values. However, the quantity that has a physical meaning is the radiance at the image plane. A camera is said to be radiometrically calibrated when the relationship between these quantities is known. This (non-linear) relationship can be recovered using multiple different known [4] or unknown [15] exposures of a fixed scene. In this paper, we illustrate in a simple case how the radiometric (self) calibration can be attacked, rather than investigating it thoroughly. We will approximate the relationship by an affine transformation specified with a radiometric scale factor ai and radiometric offset bi : Ii (u, v) = ai Li (u, v) + bi
(2)
We will see that within this limitation, there is enough information to recover linearly one of the parameters, the radiometric offset. When using our non-linear approach, a sufficiently large amount of data could allow to recover non-linear parametric terms. 3.3
Multiple Images
The key idea of our approach is to use multiple images of the same surface, taken with different illuminations. We assume that the geometry of the surfaces and cameras has been previously recovered, which makes it possible to express
Recovery of Reflectances and Varying Illuminants from Multiple Views
167
all the 3-D spatial relationship in a common coordinate system. In this paper, we restrict ourselves to the case where only one point light source is used at a time. Each image i is therefore characterized by its illumination vector Li . In this context, the only relevant geometric information are the normals nj , of each surface element j , which we assume to have been determined. The only radiometric parameter for each surface element is its albedo αj . We have p surface elements, with n different illuminants, which give rise to the p × n radiance values Lij and pixel gray-level values Iij , according to Eq. (1) and Eq. (2) : Iij = ai Lij + bi = ai αj LTi Nj + bi (3) Since there are a total number of p + 4 × n + 2 × n unknowns: – albedoes αj , j = 1 . . . p, – illumination vectors Li , i = 1 . . . n, – radiometric calibration parameters ai , bi , i = 1 . . . n, and a total number of p×n equations, with a sufficient number of different surface elements and illuminations, we can expect to solve the radiometric reconstruction problem. 3.4
Ambiguities and Minimal Data
Going back to the radiometric reconstruction problem, we notice from Eq. (3) that if: ai , bi , αj , Li is a solution, then another acceptable solution is obtained for any scalars s > 0 and ki , i = 1 . . . n αj ai , bi , , ski Li ki s The ambiguity due to the multiplier s is a uniform scale ambiguity, which means that we cannot distinguish between brighter surfaces lit by a dimmer illuminant or darker surfaces lit by a more intense illuminant. Only the positive values of the global scale s are acceptable because of the physical constraint that the albedoes be positive. In the case of radiometrically calibrated cameras, the scale ambiguity is the only generic ambiguity of the problem. The ambiguity due to the multipliers ki means that for photometrically uncalibrated cameras, the intensity of the illuminants and the radiometric scale factor of the camera cannot be distinguished without further assumptions. In the reminder of this paper, we will therefore take ai = 1. Taking these ambiguities into account, the total number of unknowns in the problem is 5 × n + p − 1 in the uncalibrated case, and 4 × n + p − 1 in the calibrated case. From this count, we can calculate the minimal amount of data needed to have a unique solution. In the calibrated case, we have unique solution when np ≥ 4n + p − 1: 7 surface elements for 2 views, 6 surface elements for 3 views, 5 surface elements for 4 views and more.
168
4
Q.-T. Luong, P. Fua, and Y. Leclerc
A Linear Solution
Given a surface element of normal n that projects into two images, each with a different light source, its albedo can be estimated using Eq. (3) as: Ik − bk LTk N
αk =
or as
αl =
Il − bl , LTl N
(4)
where index k (resp. l) denotes gray levels, light source vector parameters, and radiometric calibration parameters in the image k (resp. l). Since the value of the albedo is presumed constant, by expanding αk = αl we obtain:
Ik − bk −Il + bl
T
LTl N=0 , LTk
(5)
linear in the 12 variables: Lki , Lli , Lki bl − Lli bk , 1 ≤ i ≤ 4 . 4.1
The Calibrated Case
Intensities can be transformed so that we can assume without loss of generality that ai = 1 and bi = 0. If nT = [n1 , n2 , n3 ], Eq. (5) can be rewritten as: UT f = 0 U = [Il n1 , Il n2 , Il n3 , Il , −Ik n1 , −Ik n2 , −Ik n3 , −Ik ]T f = [lk1 , lk2 , lk3 , µk , ll1 , ll2 , ll3 , µl ]T
(6)
Combining the rows U for each surface element provides a linear system of ˜ = 0. In practice, this is solved in the least-squares sense, taking the form Uf into account the global scale factor on the illuminant f : ˜ subject to f = 1 min Uf f
(7)
With more than two views, we solve simultaneously for all the illuminants, by concatenating the equalities of Eq. (5). 4.2
Minimal Data and Degeneracies
A surface visible in n views yields n − 1 independent equations. With p surfaces, the number of rows is p(n − 1), and the number of columns is 4n. Because of the scale factor, the number of independent equations is at most max(p(n−1), 4n−1). To prove that this data is sufficient when the illuminants and surface normals are fully general, we built a specific system of linear equations for each value of p = 1..7 and n = 1..5. The MapleT M computer algebra system checked the rank of the system, and confirmed that there is no rank loss.
Recovery of Reflectances and Varying Illuminants from Multiple Views
169
There are two types of degeneracies, one which is caused by illuminants, the other by the object’s shape. When all N illumination vectors are identical the ˜ has rank 4×(n−1). The larger the distance between illuminants is, the matrix U more stable the solution, as will be confirmed by Sec. 4.4. Besides this obviously expected degeneracy, we found no case of degeneracy caused by illuminants with generic surface normals. In particular for n = 5, the system does not lose rank in the generic case, despite the fact [16,22] that the space of images (in the common image plane, or equivalently, taken from the same viewpoint) is of dimension four. ˜ to have full rank, it is necessary to have at least seven surFor the matrix U face elements with different orientations, and that all these orientations vectors be non-planar. In particular planes and cylinders constitute degenerate surfaces. 4.3
Recovering the Offsets
According to Sec. 3.4, in the uncalibrated case, the only radiometric calibration parameters which can be recovered are the offsets bi . Eq. (5) is rewritten as: VT g = 0
V = [Il n1 , Il n2 , Il n3 , Il , −Ik n1 , −Ik n2 , −Ik n3 , −Ik , n1 , n2 , n3 , 1]T g = [Lk1 , Lk2 , Lk3 , Lk4 , Ll1 , Ll2 , Ll3 , Ll4 , mkl1 , mkl2 , mkl3 , mkl4 ]T
where we have used the notations: Lji = lji , i = 1 . . . 3 Lj4 = µj
and
mjli = bl lji − bj lli , i = 1 . . . 3 mjl4 = bl µj − bj µl
After solving these linear equations, to recover the n offsets, we notice that: (Lpj Lqi − Lpi Lqj )bk = mpqi Lkj − mpqj Lki
(8)
After the Lij and mijk are solved for, this equation can be used in a redundant way to recover the bk . 4.4
Synthetic Experiments
To test our approach, we randomly draw values for the light source vectors, values for the surface normals that are such that the surface elements are frontlit, and values between 0 and 1 for the albedoes. We then compute the expected gray level values and add Gaussian noise. Finally, we use the normals and noisy gray-levels to recover estimated illuminants and compare them to the real values. To this end, we form “concatenated illuminant vectors” using the components − → in all images. The “distance” between two concatenated illuminant vectors V1 − → and V2 is: → − →− V1 .V2 − → − → Dvect (V1 , V2 ) = 1 − − → − → . ||V1 ||||V2 ||
(9)
170
Q.-T. Luong, P. Fua, and Y. Leclerc
Fig. 1. Sensitivity of the linear algorithm to gray level noise. Each curve corresponds to a different amount of noise. (a) Using 2 images. (b) Using 3 images.
It is one when the two vectors are proportional to each other and zero when they are perpendicular. Figure 1(a,b) shows the results of such experiments using 2 and 3 illuminants respectively. Each curve corresponds to a different value of the gray-level noise. For each noise value, we ran the algorithm 10000 times and computed both the true distance between illuminants1 and the recovery error of Eq.(9). We then divided the range of distances between illuminants, that is the x-axis of the graphs into 20 bins. Within each bin, for all runs whose randomly chosen illuminants have a distance that falls into that bin, we compute and plot the 90 percentile value of all the recovery errors, that is, the smallest value which is larger than 90% of these recovery errors. All these experiments were performed using 200 surface elements. We have verified experimentally that, as soon as the number of surface elements becomes larger than 50, the results become statistically independent of the actual number being used. As expected, the recovery error increases with the amount of noise and also when the distance between the individual illuminants decreases. Note that increasing the number of illuminants from two to three, and beyond, does not bring any meaningful improvement.
5
Non-linear Solution: Radiometric Bundle Adjustment
The radiometric bundle adjustment method is the counterpart of the geometric bundle adjustment used in photogrammetry. It is promising because the latter one has proved to be one of the most robust methods for geometric reconstruction. In this approach, we seek to find the values of the 5 × n + p − 1 unknowns αj , j = 1 . . . p, and Li , bi , 1 . . . n, which minimize the error function: 1
In the two-illuminant case, the distance is Dvect of Eq. (9) between the two individual illuminant vectors. In the three-illuminant case, it is the average distance of the three possible pairs of illuminants.
Recovery of Reflectances and Varying Illuminants from Multiple Views
i
(Iij − (αj LTi Nj + bi ))2
171
(10)
j
where Iij are the measured-gray levels in the images, and the sums are taken over all the images and for each projection of a surface element. The main drawback is that non-linear optimization is required to minimize the criterion of Eq. (10). However, as will be shown below, the increase in reliability is well worth the additional computational burden. 5.1
Synthetic Experiments
We ran experiments similar to those of Section 4.4, but using this non-linear approach and much higher noise levels. The results are depicted by Figure 2(a,b). As in Figure 1, the individual curves correspond to different levels of gray-level noise and are the 90 percentile of recovery errors shown as functions of the distance between illuminants. In the case of two images, the results are excellent at the 0.5% and 1.0% noise levels, levels for which the linear algorithm’s results are already becoming incorrect. The precision then degrades gracefully as the noise level increases. Also, unlike in the linear case, using three illuminants instead of two yields a marked improvement. The method is still sensitive to noise but, because a 1.0% noise level corresponds to about 3 gray-levels in a typical 8-bit image, it becomes much more applicable.
Fig. 2. Sensitivity of the non-linear algorithm to gray level noise. Each curve corresponds to a different amount of noise. (a,b) 90 percentiles of recovery errors of illuminant direction using 2 and 3 images. (c,d) Root-mean square errors between true and recovered albedoes using 2 and 3 images.
172
Q.-T. Luong, P. Fua, and Y. Leclerc
The non-linear algorithm also recovers an estimated albedo value for each facet. In Figure 2(c,d), we divide the x-axis of the graphs in the same 20 bins as before. We compute and plot the 90 percentile of the root mean-square difference between recovered albedoes and real ones, instead of the 90 percentile of the distance between true and recovered illuminants as in Figure 2(a,b). Note that the graphs of both figures have exactly the same shapes, indicating that our distance measure between illuminants is also a good indicator of the quality of albedo recovery. In practice, normals will only be known up to some precision. We have therefore verified in a similar manner that the results are relatively insensitive to fairly large imprecisions in the normals. Another source of errors that must be taken into account are gray-levels that are not merely imprecise but that do not at all conform to the Lambertian model, for example because some surface elements are specular or in shadow. To handle these, we adopt a RANSAC style approach [5] to robustly minimizing the criterion of Eq. (10). We have checked that it tolerates up to 15% of outliers [12]. 5.2
Comparison with the SVD Approach
As discussed earlier, when enough images are available, albedoes and light source directions can be recovered, up to a Generalized Bas-Relief (GBR) ambiguity, using an SVD based-approach that does not require the use of normals [3]. Following this approach, we form the p × n matrix X whose rows are the average gray levels of the p facets in each of the n images. We then use SVD to decompose it as X = BS , (11) where B is a p × 5 matrix in which the first three elements of all rows are proportional to the normals multiplied by the facet albedoes and S is a 5 × n matrix whose columns are the parameters of the light sources associated to each image. In an ideal no-noise case, for our model with ambient light source and additive camera offset, the matrix B can be taken to be the matrix with rows αj nxj , αj nyj , αj nzj , αj , 1 ,
(12)
where αj is the albedo of facet j and {nxj , nyj , nzj } is its normal vector. 2 The GBR ambiguity comes from the fact that, for any invertible 5 × 5 matrix T , BT and T −1 S is also a solution of equation 11. In our specific case, we do not know the albedoes but we do know the normals. We can therefore use them to resolve the ambiguity by computing a matrix T and the αj that jointly minimize: ||BT − A||2 2
(13)
In the original SVD work [3], B is p × 3 or a p × 4 matrix. Here we use a p × 5 to account for the additive offset associated to each illuminant. This approach thus requires at least 5 images.
Recovery of Reflectances and Varying Illuminants from Multiple Views
173
Fig. 3. Sensitivity of the SVD algorithm to gray level noise in terms of the 90 percentile of the root-mean square error between true and recovered albedoes. Each curve corresponds to a different amount of gray-level noise. (a) Using 6 images. (b) Using 7 images. (c) Using 10 images. (d) Using 20 images.
where A is the matrix whose rows are those defined by Equation 12. Because this system is homogeneous, this involves fixing one of the albedoes and finding the 25 unknown coefficients of T and the p − 1 unknown albedoes. We then take the αj to be the recovered albedoes and T −1 S to be the matrix whose columns are the recovered illuminants Li . Figure 3 depicts the results of experiments similar to those of Figure 2(c,d), but using the SVD approach instead of our proposed method. Note that, even using 20 illuminants instead of the 2 or 3 of Figure 2(c,d), the recovery errors are higher for identical values of the distance between illuminants. Furthermore, the precision quickly degrades as the number of illuminants decreases. This is not surprising since the SVD method, unlike ours, does not use the normal information early in the computation. 5.3
Using Real Images
In Figure 4, we show one of two aerial images of the same site, but taken at different times of day and from slightly different viewpoints. We treated them as a stereo pair. We reconstructed the terrain’s shape using a deformable surface technique, resulting in a triangulated mesh. We computed the average gray-level of each mesh facet’s projection in each image. We then fed these gray-levels along with the facet normals to our algorithm. Figure 4(b) shows the mesh shaded using the illuminant recovered for the first image. In Figure 4(c), we show the albedo at each pixel location that would produce the observed image gray-levels, assuming that the illuminant we computed is correct. The resulting
174
Q.-T. Luong, P. Fua, and Y. Leclerc
Fig. 4. Cartographic Modeling: (a) One of two images of a site taken at two different times of day. (b) The corresponding terrain, shaded using the illuminant recovered by our system. (c) The corresponding de-lighted image, stretched so that it has the same dynamic range as the original one. (d,e) Thresholded versions of the original and delighted image: The pixels whose gray level is inferior to 50 appear in black, those whose gray level is greater than 150 in white, and the rest in gray. In (e), unlike in (d), most of the grassy areas appear in solid gray, which is correct since the physical properties are roughly similar. This would make the task of a region-based segmentation algorithm much easier.
de-lighted image exhibits much less structure than the original one, even though both images have the same dynamic range: Shading effects have been greatly attenuated because they are accounted for by changes in shape. To illustrate this point, in Figure 4(d,e), we show versions of the original and de-lighted images coded using three gray-levels only: The pixels whose gray level is inferior to 50 are painted in black, those whose gray level is greater than 150 in white, and the rest in gray. In (e) unlike in (d), most of the grassy areas appear in gray, the shadows in black and the road pixels in white. In other words, a gray-level based segmentation algorithm would have a much easier time of the de-lighted image than of the original one. In Figure 5, we show results in a case where we can use a CAD/CAM 3–D model of a target object but where there are strong specularities: We acquired four different images of an O2 computer by using a static camera and moving the computer before each exposure. We aligned the geometric model with each image by running a bundle-adjustment algorithm. This method yields the motion parameters from one view to the other and, thus, allows us to reduce the number of degrees of freedom: Only one illuminant is free and the others are taken to be rotated versions of it. Again, as shown in Figure 5(e), the recovered illumination estimate allows us to de-light the original image. And, as depicted by the graphs of Figure 5(f,g,h), the de-lighted gray-levels corresponding to object points having the same physical material, such as those on the vertical sides of the computer tower, are much more uniform than the original ones, except for specular pixels. As shown in Figure 5(i,j), it becomes much easier to segment meaningful regions from the de-lighted image than from the original one: In reality, the computer’s base and the rest of the tower have two different and roughly constant albedoes. As a result, we can pick out these two regions in the de-lighted image using simple gray-level thresholds, which is impossible in the original images.
Recovery of Reflectances and Varying Illuminants from Multiple Views
175
In Figure 6, we present a similar behavior in the more difficult case where we do not have an a-priori shape-model: The three images to the left were acquired using a static camera watching a moving head. We computed the head motion and reconstructed its shape using a stereo-based facial reconstruction technique [6] that is relatively insensitive to illumination changes. Because our geometric reconstruction method also yields motion parameters, as in the case of the images of Figure 5, we can assume that illuminants are rotated versions of each other. To overcome the influence of specularities, we further reduce the number of degrees of freedom: We constrain the facets that are symmetric with respect to a vertical plane that goes through the middle of the head to have the same albedoes. In Figure 6(d), the resulting surface is shaded using the illuminant we recover while Figure 6(e) shows the de-lighted image corresponding to image (b). For comparison purposes, in Figure 6(f), we show the corresponding portion of the original image, stretched to have the same dynamic range and with two scan-lines overlaid. In Figure 6(g), we plot the gray levels along both scan-lines. Each graph corresponds to one scan-line and contains two curves: One for the original gray levels in (f) and one for the de-lighted gray-levels in (e). Note that the de-lighted gray levels are again much flatter than the original ones: Most of the gray-level variations are accounted for by changes in shape and we are left with the almost constant albedoes one would expect on a face. This is true for most of the skin surface, except on the right side the nose and the cheeks: The recovered illuminant comes from the left side of the face, resulting in exceedingly bright pixels at places where cast shadows should be, but are not, modeled. As a result, the estimated albedoes are too large. Nevertheless, these de-lighted images can then be averaged to produce de-lighted texture-mapped models such as the one of Figure 6(h,i,j). These, in turn, could re-lighted from a different direction and inserted in an animatiom.
6
Conclusion
We have introduced a new methodology for recovering the illuminants and surface albedoes from multiple views taken with varying illuminants in the Lambertian case. When coupled with a method for shape reconstruction, it allows us to recover both the geometric and radiometric attributes of a scene. We have also shown that in a linear setting, the only observable camera radiometric parameter, the offset, can be recovered. The new theory indicate the minimal data as well as degenerate configurations and sets the limits for radiometric reconstruction. We proposed two different algorithmic implementations: 1. A linear algorithm It allows fast computation, but is sensitive to noise. 2. A radiometric bundle-adjustment algorithm. It involves non-linear optimization that makes it slower but also much more robust. Using synthetic data, we have quantified the sensitivity of the algorithms to image noise, outliers, imprecision of surface normals, distance between illuminants,
176
Q.-T. Luong, P. Fua, and Y. Leclerc
Fig. 5. Modeling an object with known geometry and strong specularities: (a,b,c) Three image of an O2 computer acquired using a static camera. The computer was moved before each shot. (d) CAD model shaded using the recovered illuminant for image (b). (e) The corresponding de-lighted image, stretched so that it has the same dynamic range as the original one. (f,g,h) Top: Gray-levels along the three horizontal scan-lines overlaid in (b). Because of illumination effects, the gray levels vary a lot, even though, in reality, the albedo remains almost constant. Bottom: Gray-levels along the same scanlines in the de-lighted image (e). They are almost constant, except for the specular pixels that produce a peak. (i,j) Thresholded versions of the original and de-lighted image: The background appears in black, the foregrounnd pixels whose gray level is inferior to 80 appear in dark gray, those whose gray level is greater than 80 and smaller than 140 in light gray, and the rest in white.
and number of surface elements. Because reliable results can be obtained using relatively few surface elements, we have not found the extra computational expense involved by the non-linear method to be a severe drawback. The usefulness of this algorithm has been illustrated by two experiments that show that, even with a minimal number of real images, we can obtain 3–D models and plausible estimates of the light source direction and albedoes. The implementation represents a first step towards a solution of the radiometric reconstruction problem, rather than a complete solution, and could be combined with existing approaches. This work, like many, is based on Lambertian reflectance model with ambient light and distant point sources. The surface elements that do not conform to it are treated as outliers and discarded as such by the bundle-adjustment algorithm. Note, however, that the recovery algorithm is implemented using a general least-squares solver in which the constraints derived using the Lam-
Recovery of Reflectances and Varying Illuminants from Multiple Views
177
Fig. 6. Facial Reconstruction: (a,b,c) Three image of a moving head acquired using a static camera. (d) Close-up view of the reconstructed head model, shaded using the recovered illuminant. (e) The corresponding de-lighted image. (f) Corresponding part of image (b) stretched so that is has the dynamic range (e). (g) The gray-levels along the two scan-lines overlaid in (f). Each graph, contains two curves: One for the original gray levels in (f) and one for the de-lighted gray-levels in (e). (h,i,j) De-lighted texture-mapped view of the 3-D model.
bertian model could easily be replaced by ones computed using more complex illumination models, such as those that take specularities into account. Similarly, we do not explicitly model shadows but treat them as outliers. There are, however, shadow extraction techniques that do not depend on geometry and could be used. Moreover, given a known surface geometry, the shadow boundaries provide clues as to the location of point light sources. In an iterative scheme, after a first estimation of the light sources has been obtained, it would be possible to use the geometry to predict potential shadowed areas and take that information into account in the next iteration. In the absence of shadows and specularities, any combination of light sources could be accounted for by a single point source and an ambient term. Understanding their roles is therefore crucial to extending our approach from the simple case where there is a single point-light source per image to more realistic situations that involve multiple light-sources per image. In future work, focus will be on introducing shadows, specularity models, and multiple light sources into the proposed framework.
178
Q.-T. Luong, P. Fua, and Y. Leclerc
References 1. E. Angelopoulou and J. Williams. Photometric surface analysis in a tri-luminal environment. In ICCV, pages 442–447, 1999. 2. R. Baribeau, M. Rioux, and G. Godin. Color reflectance modeling using a polychromatic laser range sensor. PAMI, 14(2):263–269, 1992. 3. P.N. Belhumeur, D.J. Kriegman, and A.L. Yuille. The bas-relief ambiguity. IJCV, 35(1):33–44, November 1999. 4. P. Debevec and J. Malik. Recovering high dynamic range radiance maps from photographs. SIGGRAPH, 31:369–378, 1997. 5. M.A Fischler and R.C Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications ACM, 24(6):381–395, 1981. 6. P. Fua. Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data. International Journal of Computer Vision, 38(2):153– 171, July 2000. 7. A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. Illumination-based image synthesis: Creating novel images of human faces under differing pose and lighting. In IEEE Workshop on multiple-view modeling and analysis of visual scenes, pages 47–54, 1999. 8. H. Hayakawa. Photometric stereo under a light-source with arbitrary motion. JOSA-A, 11(11):3079–3089, November 1994. 9. B.K.P. Horn. Robot Vision. MIT Press, 1986. 10. B.K.P. Horn and M.J. Brooks. Shape from Shading. MIT Press, 1989. 11. K. Ikeuchi and K. Sato. Determining reflectance properties of an object using range and brightness images. PAMI, 13(11):1139–1153, 1991. 12. Q.T. Luong, P. Fua, and Y. G. Leclerc. The Radiometry of Multiple Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):19–33, January 2002. 13. S. Marschner. Inverse rendering for computer graphics. PhD thesis, Cornell University, 1998. 14. S. Marschner and D. Greenberg. Inverse lighting for photography. In IST/SID Fifth Color Imaging Conference, pages 262–265, 1997. 15. T. Mitsunaga and S.K. Nayar. Radiometric self calibration. In CVPR, pages I:374–380, 1999. 16. Y. Moses. Face recognition: generalization to novel images. PhD thesis, The Weizmann Institute of Science, Israel, 1993. 17. N. Mukawa. Estimation of shape, reflection coefficients and illuminant direction from image sequences. In ICCV, pages 507–512, 1990. 18. K. Nishino, Z. Zhang, and K. Ikeuchi. Determining Reflectance Parameters and Illumination Distribution from a Sparse Set of Images for View-Dependent Image Synthesis. In International Conference on Computer Vision, pages 599–606, Vancouver, Canada, July 2001. 19. R Ramamoorthi and P. Hanrahan. A signal processing framework for inverse rendering. In SIGGRAPH, pages 117–128, 2001. 20. I. Sato, Y. Sato, and K. Ikeuchi. Illumination distribution from brightness in shadows: Adaptive estimation of illumination distribution with unknown reflectance properties in shadow regions. In ICCV, pages 875–883, 1999. 21. M. Sato, Y. Wheeler and K. Ikeuchi. Object shape and reflectance modeling from observation. In SIGGRAPH, pages 379–387, 1997.
Recovery of Reflectances and Varying Illuminants from Multiple Views
179
22. A. Shashua. On photometric issues in 3d visual recognition from a single 2d image. IJCV, 21(1-2):99–122, 1997. 23. W. Silver. Determining shape and reflectance using multiple images. PhD thesis, MIT, Cambridge, MA, 1990. 24. R.J. Woodham. Photometric method for determining surface orientation from multiple images. OptEng, 19(1):139–144, 1980. 25. Y. Yu, P. Debevec, J. Malik, and T. Hawkins. Inverse global illumination: Recovering reflectance models of real scenes from photographs. SIGGRAPH, pages 215–224, August 1999. 26. A.L. Yuille, D. Snow, R. Epstein, and P.N. Belhumeur. Determining generative models of objects under varying illumination: Shape and albedo from multiple images using svd and integrability. IJCV, 35(3):1–20, 1999. 27. Q. Zheng and R. Chellappa. Estimation of illuminant direction, albedo, and shape from shading. PAMI, 13(7):680–702, 1991.
Composite Texture Descriptions Alexey Zalesny1, Vittorio Ferrari1, Geert Caenen2, Dominik Auf der Maur1, and Luc Van Gool1, 2 1
Swiss Federal Institute of Technology Zurich, Switzerland {zalesny, ferrari, aufdermaur, vangool}@vision.ee.ethz.ch http://www.vision.ee.ethz.ch/~zales 2 Catholic University of Leuven, Belgium {Geert.Caenen, Luc.VanGool}@esat.kuleuven.ac.be
Abstract. Textures can often more easily be described as a composition of subtextures than as a single texture. The paper proposes a way to model and synthesize such “composite textures”, where the layout of the different subtextures is itself modeled as a texture, which can be generated automatically. Examples are shown for building materials with an intricate structure and for the automatic creation of landscape textures. First, a model of the composite texture is generated. This procedure comprises manual or unsupervised texture segmentation to learn the spatial layout of the composite texture and the extraction of models for each of the subtextures. Synthesis of a composite texture includes the generation of a layout texture, which is subsequently filled in with the appropriate subtextures. This scheme is refined further by also including interactions between neighboring subtextures.
1
Composite Textures, Divide, and Conquer
Natural textures can be of a very high complexity, which renders them difficult to emulate by texture synthesis methods. Many of such textures consist of patches, which in turn contain patterns of their own. Good examples are the textures of building materials like marbles or limestones, or the textures of different landscape types. A countryside texture could, e.g. be a mixture of meadows and forests. We will refer to such textures as composite textures. Whereas the patterns within the patches may already be homogeneous enough to be modeled and synthesized successfully with existing techniques, modeling and synthesizing the mixture directly tends to be less effective. It therefore stands to reason to divide the synthesis of composite textures into several steps: 1. Segment (by hand or otherwise) example image(s) of the composite texture to be modeled, where each class is given its own label. 2. Consider the resulting label map (with labels assigned to all pixels) as a texture and extract a model for it. 3. Also extract models for the textures corresponding to the different labels (i.e. the different classes/segments). 4. Generate a synthetic label map, based on the model of step 2. 5. Fill in the segments generated under step 4 with textures according to their labels, based on the models of step 3. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 180–194, 2002. © Springer-Verlag Berlin Heidelberg 2002
Composite Texture Descriptions
181
This scheme tallies with ideas about “macrotextures” vs. “microtextures” that have emerged as soon as texture research started in computer vision. The label map can be considered a macrotexture, whereas the textures within the segments of constant label values would be microtextures. In this paper, we will refer to the macrotextures as “label textures”, and to the microtextures as “subtextures”. Although such ideas have been around for quite a while, it seems that so far such hierarchical approach has not really been implemented for textures of a strong stochastic nature. When ideas about hierarchical analysis are used, it is usually in the context of multi-resolution schemes, which have contributed greatly to the state-ofthe-art in texture synthesis (see e.g. [2, 7] as seminal examples). Contributions that come closest to the work presented here have used label textures that were handdrawn by the user and the corresponding textures were then filled in with texture either by “smart copying” from example [8] or by synthesis techniques [13]. The composite texture approach as just described does not include mutual influences between the subtextures. It is not exceptional that the subtextures are not completely stationary within their domains. In particular, regularly a kind of transition zone is found near their boundaries. These changes may well depend on the texture on the other side of the boundary. Later in the paper we will modify the composite texture scheme to include such effects. The remainder of the paper is organized as follows. Section 2 describes the texture synthesis approach used to generate the label textures and the subtextures. Section 3 shows some results for the synthesis of composite textures. Section 4 describes the inclusion of subtexture interactions (nonstationary aspects). Section 5 concludes the paper.
2
Image-Based Texture Synthesis
This section describes the approach that we use to synthesize the textures, i.e. both the label textures and the subtextures. More information about this texture modeling and synthesis approach can be found in [13]. It follows the cooccurrence paradigm in that texture is synthesized as to mimic the pairwise statistics of the example texture. This means that the joint probabilities of the intensities at pixel pairs with a fixed relative position are approximated. Such pairs will be referred to as cliques, and pairs of the same type (same relative position between the pixels) as clique types. This is illustrated in Fig. 1.
Fig. 1. Dots represent pixels. Pixels connected by lines represent cliques. Left: cliques of the same type, right: cliques of different types.
The texture model consists of statistics for a selected set of clique types. Clique type selection follows an iterative approach, where clique types are added one by one to the texture model, a texture synthesized based on the model is each time updated
182
A. Zalesny et al.
accordingly, and the statistical difference between the example texture and the synthesized texture is analyzed to decide which further clique type addition to make. The set of selected clique types is called the neighborhood system. The complete texture model consists of this neighborhood system and the statistical parameter set. The latter contains the histograms of intensity differences between the two pixels of all cliques of the same type. A sketch of the texture model extraction algorithm is as follows: step 1: Collect the complete 2nd-order statistics for the example texture, i.e. the statistics of all clique types. After this step the example texture is no longer needed. As a matter of fact, the current implementation focuses on clique types up to a maximal length. step 2: Generate an image filled with independent noise and with values uniformly distributed in the range of the example texture. This noise image serves as the initial synthesized texture, to be refined in subsequent steps. step 3: Collect the statistics for all clique types from the current synthesized image. step 4: For each clique type, compare the intensity difference histograms of the example texture and the synthesized texture and calculate their Euclidean distance. In fact, the intensity histograms pure (singletons) are considered as well. step 5: Select the clique type with the maximal distance (see step 4). If this distance is less than a threshold, then leave the algorithm with the current model. Otherwise add the clique type to the current (initially empty) neighborhood system and all its statistical characteristics to the current (initially empty) statistical parameter set. step 6: Synthesize a new texture using the updated neighborhood system and texture parameter set. step 7: Go to step 3. After running this algorithm we have the final neighborhood system of the texture and its statistical parameter set. A more detailed description of this texture modeling approach is given elsewhere [13]. In that paper it is also explained how the synthesis step works and how we generalize this texture modeling to colored textures. The proposed algorithm produces texture models that are very small compared to the complete 2nd-order statistics extracted in step 1 and also compared to the example image. Typically only 10 to 40 clique types are included and the model amounts from a few hundred to a few thousand bytes. Another advantage is that the method avoids verbatim copying of parts of the example images. This is an advantage that it shares with similar approaches [5, 6, 9, 12]. Verbatim copying becomes particularly salient when large patches of texture need to be created, e.g. when covering a large wall with a marble texture. Then the repeated appearance of the same structures quickly becomes salient to the human eye. For the texture mapping on such large surfaces this problem is more serious than may transpire from state-of-the-art publications that suffer from this problem (e.g. [3, 11]), as the extent of the textures that can be shown in papers is rather small.
Composite Texture Descriptions
183
The basic texture synthesis approach described in this section can handle quite broad classes of textures. Nevertheless, it has problems with the composite type of textures considered here. Comparisons between single and composite texture synthesis results will be shown later.
3
The Segmentation of Composite Textures
The steps of the composite texture modeling scheme proposed in section 1 are illustrated on the basis of a few examples. Fig. 2 shows a modern “thorn-cushion steppe” type of landscape (left). It consists of several ground cover types, like “rock”, “green bush”, “sand”, etc., for which the corresponding segments are drawn in the figure on the right.
Fig. 2. Left: an example of “thorn-cushion steppe”, Right: manual segmentation into basic ground cover types (also see Fig. 4).
If one were to directly model this composite ground cover as a single texture, the basic texture analysis and synthesis algorithm proposed in the last section would not be able to capture all the complexity in such a scene. Fig. 3 shows the result of such synthesis.
Fig. 3. Attempt to directly model the scene in Fig. 2 as a single texture.
In keeping with the composite texture scheme of section 1, an example image like this is first decomposed into different subtextures, as shown in Fig. 2 (right). The
184
A. Zalesny et al.
segments that correspond with different subtextures have been given the same intensity (i.e. the same label). This segmentation has been done manually. Fig. 4. shows the image patterns corresponding to the different segments. The textures within the different segments are simple enough to be handled by the basic algorithm. Hence, in this case 6 subtexture models are created, one for each of the ground cover types (see caption of Fig. 4.). But also the map with segment labels (Fig. 2 right image) can again be considered to be a texture, describing a typical landscape layout in this case. This label texture is quite simple again, and can be modeled by the basic algorithm of section 2. Hence, similar label textures can be generated automatically as well. The synthesis of this composite texture first generates a landscape layout as a label texture. Subsequently, the different segments are filled in with the corresponding subtextures, based on their models. As an alternative, a graphical designer or artist can draw the layout, after which the computer fills in the subtextures in the segments that s/he has defined, according to their labels. Fig. 5 shows one example for both procedures.
Fig. 4. Manual segmentation of the terrain texture shown in Fig. 2. Left: segments corresponding to 1-green bush, 2-rock, 3-grass, 4-sand, 5-yellow bush. Right: left-over regions are grouped into an additional mixed class.
Fig. 5. Synthetically generated landscape textures. Left: based on a manually drawn label texture. Right: based on an automatically generated label texture.
Composite Texture Descriptions
185
Note that except for the hand segmentation of the label texture in Fig. 2, the right image has been created fully automatically and arbitrary amounts of such texture can be generated, enough to cover a terrain model with never-repeating, yet detailed texture. As mentioned before, the fact that this approach doesn’t use verbatim copying of parts in the example images has the advantage that no disturbing repetitions are created. A complete automation of the composite texture scheme would have to include an initial, unsupervised segmentation of the scene, so that no hand segmented label texture is needed. Such unsupervised segmentation is not an easy matter of course, but in certain cases may nevertheless be possible. As Paget [10] noticed, the optimal complexity of texture models for segmentation is typically lower than that required for synthesis. This can be key to avoiding a chicken-and-egg situation with such fully automated process. In case the features needed for the segmentation were of the same complexity as the models that we extract from segmented data, the whole procedure can no longer run automatically. Fig. 6 left shows two images that could be segmented automatically. As a matter of fact, truly textural features were not needed at all in this case. Segmentation based on Lab color data was sufficient. The data around each pixel were smoothed with a Gaussian filter of size 5x5. Each pixel is described by a three dimensional color feature vector. For more general case, an additional vector is built of simple texture features, obtained as the outputs of a bank of simple filters. A cosine metric between such feature vectors is used as similarity measure between pairs of pixels (Bhattacharyya distance). A sample pixel set is drawn from the image and a complete graph is constructed where vertices correspond to pixels and edges are weighted by the similarity measure. The graph is then partitioned into completely connected, disjoint subsets of vertices so as to maximize the total sum of the similarities over all remaining edges. These subsets are also referred to as “cliques”, but this time in the graph theoretical sense. The pixels in each subset now represents a subtexture. This robust segmentation procedure is efficiently obtained via the Clique Partitioning approximation algorithm of [4]. It is worth noting that the algorithm is completely unsupervised: it doesn’t need to know the number of subtexture classes, their size, or any information other than the image data itself. The segmentation results for the two images are shown in Fig. 6 right. A more detailed description of the unsupervised segmentation scheme is given elsewhere [1]. Fig. 7 shows the result of a fully automatic composite texture synthesis for the top texture of Fig. 6. No user input whatsoever was needed to generate this result. Fig. 7 right is the label texture that was generated. Automatic segmentation of a composite texture will not always be possible or even desirable. Fig. 8 shows one of the more complicated types of limestones that was used as building material at the ancient city of Sagalassos in Turkey, now a center of intensive archaeological excavations. If one wants to recreate the original appearance of the buildings, such textures need to be shown with all their complexities and subtleties, but without the effects of erosion. Fig. 8 (a) shows an example image of this limestone (pink-gray breccia), which has a kind of patchy structure and where several darker cracks and pits are the result of century long erosion. First, the original limestone image was manually segmented, whereupon texture models were generated for the different parts, making sure that those parts left out
186
A. Zalesny et al.
erosion effects. An automatically generated, synthetic result is shown in (c). To simulate the visual effect of erosion, simple thresholding of the image (a) yielded the erosion related cracks and pits. These dark areas were then superimposed on texture (c), yielding (d). As can be seen the visual appearance of this texture comes close to that of (a). It is texture (c), however, that is the desired output for this application. This is a case were manual segmentation and user assisted subtexture modeling is hard to avoid.
Fig. 6. Left: landscape images. Right: unsupervised segmentation.
Composite Texture Descriptions
4
187
Interactions between Subtextures
The subtextures are not always sharply delineated and may not be stationary within their patches. It is quite usual that neighboring subtextures exhibit a mutual influence near their boundaries. The observed nonstationary behavior near the subtexture boundaries may therefore well depend on the specific subtexture on the other side. Such effects have been included in a refined version of the composite texture model. The “upgraded” model includes additional cliques, where head and tail pixels correspond to different subtextures. For all pairs of subtextures that come within a user-specified distance of each other, a separate neighborhood system and statistical parameter set are derived. These are characterized by the fact that the two pixels of a clique (we call them “head” and “tail”) fall within a different subtexture. Note that the interactions impose a strict order on their cliques: the head lying in a first subtexture and the tail in a second has to be distinguished from the reverse situation. Fig. 9 illustrates the three types of cliques that are relevant as soon as subtexture interactions are taken account of.
Fig. 7. Fully automatic composite texture synthesis of the top texture in Fig. 6. Right: the automatically generated label map modeled as a label texture from the unsupervised segmentation in Fig. 6 top-right.
Currently, in the upgraded approach the subtextures are synthesized in order of “complexity”. So far, we have only used simple criteria like entropy of the color histogram as complexity measures. We start the synthesis with the simplest texture. The rationale is that in this way those textures that need the least information are synthesized first. By the time more complex textures are to be synthesized, the simpler ones are already in place and can assist in this synthesis. Suppose we are about to analyze the interactions of texture n in this sequence. We can include the real intensity difference histograms into the model for those textures that come earlier in the sequence, because they will already have been synthesized by the time texture n is. For the interactions with textures further down the line, the intensity differences
188
A. Zalesny et al.
between the real intensities within texture n and 0 outside are taken, as at the time of synthesis the system will find the initialized values 0 there. This in fact corresponds to simply taking the intensity histogram for points of texture n within the relative positions to the neighboring texture that are prescribed by the clique type. Work to synthesize all subtextures simultaneously instead of sequentially as described here is under way. Hence, we are not too concerned with the generality of our texture complexity ranking procedure, as it will become obsolete once simultaneous synthesis is in place. Simultaneous synthesis should be superior as all subtexture interactions are considered.
(a) example image of a limestone
(b) synthesized as a single texture
(c) synthesized as a composite texture, omitting the effects of erosion
(d) synthesis like in (c) but with erosion added as the darker parts of (a)
Fig. 8. Results for the synthesis of a limestone texture.
Composite Texture Descriptions
189
Subtexture 1 Subtexture 2
1
2 2 2
2 1
Fig. 9. The cliques (1,2), (2,2), and (2,1) are of different types despite of their equal relative position of head and tail. The arrows show the direction of the interaction for this specific sequence of subtexture modeling.
Fig. 10 shows an example image of a limestone texture. Figure (b) is the result of unsupervised segmentation. The result consists of only two subtextures in this case, consisting of the dark and bright areas, respectively. The dark areas were considered “simpler” by the upgraded approach and it therefore dealt with them first. The difference between the composite texture approach of section 1 and the upgraded approach doesn’t lie in the label texture as its generation in both cases is identical. In order to best compare the results of the original and the upgraded approach we therefore take the example label texture of (b) as our label texture for subtexture synthesis. Fig. 10 (d) shows the result with the approach of section 1, (c) with the upgraded approach. As can be seen, the upgraded version looks more natural. In the example image the edges between the two subtextures are less sharp than in (d). Near either side of the border the intensities tend to get more similar. A good example is the white spot at the top. The spatial distribution of white and gray pixels in such spot in image (c) is better than that of (d). Fig. 11 (a) gives the result of the upgraded approach for the top texture of Fig. 6. The label texture of Fig. 7 right was reused. As can be seen (cf. independent subtexture synthesis Fig. 11 (b) repeated here for comparison), the quality of this texture is better. Fig. 12 (a) gives the result for the bottom texture in Fig. 6. In this case the quality of the texture synthesized without subtextures’ interactions (not shown) is more comparable. Fig. 13 illustrates the use of the composite textures for virtual reality. Part (a) corresponds to the original model of an ancient building as graphic designers have produced it. Part (b) shows the same building with some of our textures mapped onto the pillars. Our goal is to map the original textures onto complete monuments and the surrounding landscape. Finally, an example is given that shows that the composite texture approach can also be beneficial even for textures that have traditionally been treated as a single texture. Fig. 14 (a) shows one of the Brodatz textures (D20, French canvas). Image (b) is the result of a texture synthesis when this texture is modeled as a single texture. Image (d) is the result of a composite texture synthesis without subtexture interactions (the black and bright pixels formed the two classes). Image (c) is the result of composite texture synthesis with interactions between the two subtextures. This result is clearly the best.
190
A. Zalesny et al.
(a)
(b)
(c)
(d)
Fig. 10. (a) original limestone; (b) unsupervised segmentation; (c) composite synthesis with interactions between subtextures; (d) composite synthesis with independent subtextures.
5
Conclusions and Future Work
Our texture method learns a model of a complicated texture pattern like a landscape or geological structure from example images. This model can be used e.g. to generate more of similarly looking texture without verbatim copying of parts of the example textures. A first “upgrade” of this technique has also been presented. It takes account of nonstationary aspects in the different subtextures, due to their interactions. Several extensions are planned. The first is to do away with the rather clumsy, sequential generation of the subtextures in case of the upgraded model. After an initial generation of the label texture, all subtextures and their interaction effects should better be generated together. In that way, mutual influences between all interacting subtextures are included.
Composite Texture Descriptions
(a)
191
(b)
Fig. 11. (a) composite synthesis with interactions between subtextures for the top texture of Fig. 6; (b) repeats for comparison the result without interdependency between textures.
(a)
(b)
Fig. 12. (a) fully automatic composite texture synthesis of the bottom texture in Fig. 6; (b) the automatically generated label map modeled as a label texture from the unsupervised segmentation in Fig. 6 bottom-right.
192
A. Zalesny et al.
(a)
(b) Fig. 13. (a) original model of an ancient building at Sagalassos as graphic designers have produced it; (b) the same building with some of our textures mapped onto the pillars.
Composite Texture Descriptions
(a)
(b)
(c)
(d)
193
Fig. 14. (a) Brodatz D20, French canvas; (b) synthesis as a single texture; (c) composite texture synthesis with interaction between the subtextures; (d) composite synthesis without interaction between subtextures.
A second extension is the use of the composite texture models to analyze and recognize complete scenes, e.g. the classification of land use and land cover in satellite images. Some of these classes can have intricate structures and could be better dealt with if described as composite textures. As Paget’s work [10] has shown, we can expect that even partial information from the extracted models can suffice for this analysis. A third extension could be the use of deeper hierarchies and to have more levels that just a label texture and its subtextures. A fourth extension could be the enhancement of our unsupervised segmentation scheme.
194
A. Zalesny et al.
Acknowledgments. The authors gratefully acknowledge support by the European IST projects "3D Murale" and "CogViSys".
References 1. G. Caenen, V. Ferrari, L. Van Gool, and A. Zalesny. Unsupervised Texture Segmentation through Clique Partitioning. Tech. Rep. KUL/ESAT/PSI, Catholic University of Leuven, Belgium, 2002. 2. J.S. De Bonet. Multiresolution Sampling Procedure for Analysis and Synthesis of Texture Images. Proc. Computer Graphics, ACM SIGGRAPH’97, 1997, pp. 361-368. 3. A. Efros and T. Leung. Texture Synthesis by Non-Parametric Sampling. Proc. Int. Conf. Computer Vision (ICCV’99), Vol. 2, 1999, pp. 1033-1038. 4. V. Ferrari, T. Tuytelaars, and L. Van Gool. Real-Time Affine Region Tracking and Coplanar Grouping. IEEE Computer Soc. Conf. on Computer Vision and Pattern Recognition (CVPR'01), December 2001, Vol. 2, pp. 226-233. 5. A. Gagalowicz and S.D. Ma. Sequential Synthesis of Natural Textures. Computer Vision, Graphics, and Image Processing, Vol. 30, 1985, pp. 289-315. 6. G. Gimel’farb, Image Textures and Gibbs Random Fields. Kluwer Academic Publishers: Dordrecht, 1999, 250 p. 7. D. Heeger and J. Bergen. Pyramid-Based Texture Analysis/Synthesis, ASM SIGGRAPH ’95, pp.239-238, 1995. 8. A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, and D. Salesin. Image analogies. ACM SIGGRAPH 2001, pp.327-340, 2001. 9. B. Julesz and R.A. Schumer. Early Visual Perception. Ann. Rev. Psychol. Vol. 32, 1981, pp. 575-627 (p. 594). 10. R. Paget. Nonparametric Markov Random Field Models for Natural Texture Images. PhD Thesis, University of Queensland, February 1999. 11. L.-Y. Wei and M. Levoy. Fast Texture Synthesis Using Tree-Structured Vector Quantization. SIGGRAPH 2000, pp.479-488. 12. Y.N. Wu, S.C. Zhu, and X. Liu. Equivalence of Julesz and Gibbs Texture Ensembles. Proc. Int. Conf. Computer Vision (ICCV’99), 1999, pp. 1025-1032. 13. A. Zalesny and L. Van Gool. A Compact Model for Viewpoint Dependent Texture Synthesis. SMILE 2000, Workshop on 3D Structure from Images, Lecture Notes in Computer Science 2018, M. Pollefeys, L. Van Gool, A. Zisserman, and A. Fitzgibbon (Eds.), 2001, pp. 124-143.
Constructing Illumination Image Basis from Object Motion Akiko Nakashima, Atsuto Maki, and Kazuhiro Fukui Corporate Research and Development Center, TOSHIBA Corporation 1, KomukaiToshiba-cho, Saiwai-ku, Kawasaki 212-8582 Japan
Abstract. We propose to construct a 3D linear image basis which spans an image space of arbitrary illumination conditions, from images of a moving object observed under a static lighting condition. The key advance is to utilize the object motion which causes illumination variance on the object surface, rather than varying the lighting, and thereby simplifies the environment for acquiring the input images. Since we then need to re-align the pixels of the images so that the same view of the object can be seen, the correspondence between input images must be solved despite the illumination variance. In order to overcome the problem, we adapt the recently introduced geotensity constraint that accurately governs the relationship between four or more images of a moving object. Through experiments we demonstrate that equivalent 3D image basis is indeed computable and available for recognition or image rendering.
1
Introduction
In appearance-based object recognition, lighting variation is one of the most significant issues. As Moses et al.[10] pointed out in the context of face recognition, “the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity.” In order to deal with this problem several approaches using image bases have been proposed, and successful results were reported. See, for example, [2,5]. The advantage of these methods is due to the fact that an arbitrary image of an object under a distant light source has low-dimensional representation of the object. The observation concerning the low-dimensional representation originated in Shashua’s works[14,15]. He showed that images of an object under arbitrary lighting conditions are represented by linear combinations of three base images, given that the surface of the object follows the Lambertian model without shadows. Yuille and Snow [21] further considered ambient background illumination, Belhumeur and Kriegman [3] extended the representation to the case with attached shadows. They proved that a set of images with attached shadows under arbitrary lighting conditions forms a convex polyhedral cone which is again based on three base images. As is obvious from these observations, the notion of base images provides a foundation for representing images under arbitrary illumination. In this paper, we call the set of the three base images illumination image basis. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 195–209, 2002. c Springer-Verlag Berlin Heidelberg 2002
196
A. Nakashima, A. Maki, and K. Fukui
In previous approaches the illumination image basis has been obtained from images of an object in a fixed pose under various lighting conditions. To acquire such images, therefore, the object and the camera must be kept stationary, and the direction of illumination be varied. As opposed to this situation, if we could equally construct the illumination image basis from images of an object that is in motion under a stationary lighting condition, the process of capturing input images would be substantially simplified. This is especially true for an object such as a human head, which we mainly deal with in this article, since the only required action is to move the object in 3D space. However, to our knowledge there has been no attempt to consider the possibility, and the purpose of this paper is to propose a technique for constructing the illumination image basis from object motion. Given images of an object captured in different poses, the set of pixels at the same coordinate in the images do not correspond to an identical point in the 3D surface. For the construction of illumination image basis, thus, we basically need to re-align the pixels in the images as if the object were seen from an identical direction, and the problem therefore becomes that of finding the corresponding image points. For the point correspondence, typically exploited is the constraint that the corresponding parts of the images have equivalent intensity, regarding the variation in illumination as noise. Unfortunately, however, the constraint is nearly always invalid for an object in motion, since relatively non-uniform lighting causes the intensity at a specific location on the surface of the object to change as the object moves [13]. Among the few efforts to address this issue, recently the geotensity constraint1 [8] has been derived to overcome the problem with respect to camera geometry, and to replace the constant intensity constraint. Based on the notion of linear intensity subspaces [14], the geotensity constraint governs the relationship between four or more images of a moving object. The algorithm for point correspondence using the geotensity constraint proceeds basically in two stages. The first stage is to derive the parameters of the Geotensity constraint by analyzing coordinates and image intensity of some sample points on the object in motion. That is, computing structure from motion gives the geometric parameters of the situation, whereas computing the linear image subspace gives the lighting parameters of the situation. By combining both sets of parameters we arrive at the Geotensity constraint. Using the same set of input images, the second stage is to take each pixel in an arbitrary reference image in turn and search for the depth along the ray from the optical center of the camera passing through the pixel. The correspondence is evaluated by measuring the agreement of the entire set of projected intensities of a point on the object surface with the Geotensity constraint. Although the geotensity constraint was proposed originally for the task of 3D surface reconstruction, we will show that the constraint can be applied directly for our porpose, that is, to obtain the illumination image basis from images of an object in motion under stationary illumination. Further, once we obtain the illumination image basis, images of the object under arbitrary lighting con1
Geotensity stands for “geometrically corresponding pixel intensity.”
Constructing Illumination Image Basis from Object Motion
197
ditions can be generated simply by linear combinations of the basis as stated earlier. Since the Geotensity constraint can be applied to any choice of reference frame, different choices allow us to construct illumination bases corresponding to different viewing directions and thereby we can synthesize variations of object images both in terms of lighting conditions and object poses. While such variations are obviously useful for recognition problems in general, it should be noted that they become available from a minimum of four input images. Although the resulting variations may look similar to what is obtained by the method in [5], we should point out that the entire scheme is different in that we explicitly solve for the point correspondence instead of estimating the surface geometry in the framework of photometric stereo [20]. This paper is organized as follows: First, we introduce the low-dimensional representation for images captured under arbitrary illumination in Section 2 and the geotensity constraint in Section 3. Then, in Section 4 we describe an algorithm to construct the illumination image basis by applying the geotensity constraint. Experiments are shown in Section 5. In Section 6, the final section, discussions and conclusions are presented.
2
Illumination Image Basis
In this section we briefly describe the low-dimensional representations for variations of illumination which have been successfully utilized in appearance-based object recognition. Consider the case of a convex object with a surface conforming to a Lambertian reflectance model with varying albedo, the ratio of outgoing to incoming light intensity, under a single point light source at infinity. Let us assume further that both the camera and object are stationary, but that the light source can be moved. The key advantage of this assumption is that the correspondence problem between frames is solved in advance since the same view of the object is always seen. If we have three pictures of the object I(j)(j = 1, 2, 3) from light source directions s(j)(j = 1, 2, 3), respectively, then the intensities in any fourth frame, taken from a novel setting of the light source, must simply be a linear combination of the intensities in the first three frames. Mathematically, I(4) =
3
a(j)I(j)
j=1
for some coefficient a(j) [14]. Epstein et al. [4] showed empirically that this relationship is generally quite good, but it was left to Belhumeur and Kriegman [3] to prove that the correct relationship assuming no self-shadowing is in fact 3 I(4) = max( a(j)I(j), 0) , j=1
(1)
198
A. Nakashima, A. Maki, and K. Fukui
Fig. 1. Images of an object under variable lighting and an illumination basis.
Fig. 2. Images of an object in motion and an illumination basis.
where images I(j)(j = 1, 2, 3) must be taken with the light source in the bright cell, which is the cell of light source directions that illuminate all points on the object, and the maximum operator is effected independently on each pixel and establishes whether or not that pixel was illuminated at all by the light source. These first three images form a 3D image basis from which all other images can be derived and hence can be replaced by any equivalent basis of images. Hence Equation 1 can be replaced by, I(j) = max([˘I(1) ˘I(2) ˘I(3)]a(j), 0) ,
(2)
where a set of ˘I(j)(j = 1, 2, 3) is an image basis, which we call illumination image basis, and the 3D vector a(j) contains linear coefficients that map the basis to the j th image. An example is shown in Figure 1. The four images on the left are original images of a statue of Caesar in frontal view under variable lighting. Applying the principle component analysis (PCA) to these images, we obtain an illumination image basis. It is shown in the three images on the right in Figure 1 arranged in the order of the eigenvalues. In theory, the illumination image basis can be obtained directly as three images captured in a fixed pose under three different lighting conditions. In this case, however, the illumination image basis remains to be contaminated if the captured images contain any noise. In order to obtain an illumination image basis in which the noise is attenuated, PCA is often applied to a larger number of images which are captured as inputs. Although we illustrate it in Figure 1 using only four images as the minimum number for the inputs, it is to show the relevancy of the method, and also for the consistency with the remaining examples. The concept of illumination image basis is quite general, and has been employed to develop further sophisticated algorithms for object recognition as well as image rendering as seen for example in [16].
Constructing Illumination Image Basis from Object Motion
199
In this paper, we deal with images of an object in motion captured by a fixed camera under stationary illumination. In such a situation, the object is observed differently according to the motion. Figure 2 shows four images of the statue of Caesar in various poses illuminated from its frontal direction. From these images we wish to construct an illumination image basis as shown in the right-hand part of Figure 2 that is a counterpart to the illumination image basis obtained in the conventional way in Figure 1. Methods for object recognition which give successful result for arbitrary lighting conditions should then become analogously available without collecting images under various illumination. Unlike the conventional situation, however, it is not so straightforward to construct an illumination image basis from images of an object in motion. While this is due to the correspondence problem between frames which is not solved in advance, we do overcome the problem by adapting the geotensity constraint which was originally proposed for 3D surface reconstruction in [8]. We introduce the geotensity constraint in the context of constructing an illumination image basis in the next section, and describe the actual algorithm constructing the illumination image basis in Section 4.
3
Geotensity Constraint
The geotensity constraint is a constraint on geometrically corresponding pixel intensity between four or more images of an object seen from different views under static lighting conditions. In this section, we introduce the geotensity constraint for a single light source from the viewpoint of constructing an illumination image basis; a detailed explanation of the geotensity constraint can be found in [8]. In general, the geotensity constraint consists of two parts, the geometry constraint and the intensity constraint, from which the geotensity constraint is derived. 3.1
Intensity Constraint
As shown in Equation 2, images of an object in a fixed pose captured under arbitrary illumination can be represented by three base images, and the representation is also valid for an intensity at each pixel. That is, the key advantage of Equation 2 is that the same view of the object is always seen and the correspondence problem between frames is solved in advance. Now let us assume a static camera and light source and a moving object. Although we no longer have correspondence between images, we consider the intensity Ii (j) of the ith point on the surface of the object projected into the j th image. Ii (j) must then satisfy Ii (j) = max([I˘i (1) I˘i (2) I˘i (3)]a(j), 0) ,
(3)
where a set of ˘I(j)(j = 1, 2, 3) represents the intensity at corresponding points in the three base images. Equation 3 is the intensity constraint between images.
200
3.2
A. Nakashima, A. Maki, and K. Fukui
Geometry Constraint
Regarding the first image as the reference frame, for simplicity we discuss the affine and scaled-orthographic camera models [11] for projection. Consider the ith world point Xi = (Xi , Yi , Zi ) on the surface of an object projected to image point xi (j) = (xi (j), yi (j)) in the j th frame. The affine camera model defines this projection as xi (j) = M(j)Xi + t(j) , where M(j), an arbitrary 2×3 matrix, and t(j), an arbitrary 2-dimensional vector, encode the motion parameters of the object. The scaled orthographic camera model replaces M(j) with the first two rows of a scaled 3×3 rotation matrix R(j) so that R1 (j) xi (j) = λ Xi + t(j) . R 2 (j) Note that R(j) represents the rotation of the object from the first frame to the j th frame. The Euclidean structure and motion parameters fitting the weak perspective camera model can be recovered by the technique of structure from motion [18]. There is an arbitrary choice of affine or Euclidean frame which can be partially fixed by choosing the first frame to be canonical [7] such that 100 0 Xi + . (4) xi (1) = 010 0 A result of choosing the canonical frame is that the structure vectors have the form, Xi = (x i (1), Zi ) and we can derive the relationship, xi (1) xi (j) = M(j) + t(j) . (5) Zi This relationship effectively describes the epipolar constraint between two images. This constraint says that given a point, xi (1), in one image then the corresponding point in another image, xi (j), will lie on a line defined by the object motion parameters, M(j) and t(j), between the two frames and that the exact position along that line depends on the correct depth, Zi , of the point. The corresponding point on an epipolar line is shown in Figure 3. Indeed, given a point in one image and the motion parameters for a number of other frames, then the corresponding points in all the other images are uniquely determined by a particular choice of Zi . We will use Equation 5 as a practical affine form of the geometric constraint later. A similar form of this geometric (epipolar) constraint can be easily derived for the more complex camera models. 3.3
Geotensity Constraint
Consider a set of nj images, I(j)(j = 1, . . . , nj ), and a point (xTi , Zi )T on the surface of an object with depth Zi that projects to point, xi (1), in the first image.
Constructing Illumination Image Basis from Object Motion
201
Fig. 3. Given a point in one image, according to the choice of depth Z, corresponding points in the other images are searched on the epipolar lines while investigating the intensity constraint.
Under object motion and affine imaging conditions we can recall Equation 5 for the geometric constraint imposed on a sequence of images and the corresponding pixel in the j th image, xi (j), is given by Equation 5. Hence, the corresponding pixel intensity, Ii (j), is given by xi Ii (j) = I(j)[xi (j)] = I(j)[M(j) + t(j)] . (6) Zi If the motion parameters, M(j) and t(j), and the depth of the point on the object surface, Zi , are given, Ii (j) is measured from the intensity at the corresponding pixels in the j th image. On the other hand, when the corresponding pixel intensity Ii (j) is non-zero, it has another representation by the intensity constraint defined by Equation 3 which we can rewrite as Ii (j) = I˘i (1) I˘i (2) I˘i (3) a(j) . (7) For nj measured intensities Ii (j) which are non-zero, it follows ˘ ˘ ˘ I n = Ii (1) Ii (2) Ii (3) anj (j) , j
(8)
where Inj and anj are nj -dimensional vector consisting of the non-zero measured intensity and a 3 × nj matrix containing the columns a(j), respectively. When the linear coefficients a(j) are known in advance, using the non-zero measured intensity Ink , I˘i (j)(j = 1, 2, 3) can be computed by the solution of Equation 8 in the form −1 . (9) I˘i (1) I˘i (2) I˘i (3) = I nk ank (ank ank ) Then, the other representation of the corresponding pixel intensity, Iˆi (j), is given by Iˆi (j) = max( I˘i (1) I˘i (2) I˘i (3) a(j) , 0). (10)
202
as
A. Nakashima, A. Maki, and K. Fukui
Now let us define an error between the measured and estimated intensities (Ii (j) − Iˆi (j))2 . (11) Ei = j
If the estimated values Iˆi (j) are correct, they agree with the measured intensity Ii (j). Hence, the constraint on geometrically corresponding intensity between captured images, namely the geotensity constraint, can be stated simply as Ei = 0. Since the measured intensity Ii (j)(j = 1, · · · , nj ) includes noise in practice, the corresponding intensity between captured images, Iˆi (j)(j = 1, · · · , nj ), is estimated so as to minimize the error in Equation 11.
4
Construction of Illumination Image Basis
In this section we propose a method to construct an illumination image basis from an image sequence by applying the geotensity constraint. The basic concept is to search for the depth, Zi , of the surface of the object at each pixel, xi , in a reference image while computing the error Ei in the geotensity constraint. Then, the corresponding pixels are searched along the epipolar lines in the other images as schematically depicted in Figure 3. The algorithm for constructing an illumination image basis is detailed, preceded by descriptions of practical schemes for solving for the geometry and the intensity constraints. 4.1
Solving for Geometry
Computing geometric correspondence requires that M(j) and t(j) in Equation 5 be known. To recover these components we solve the well-known affine structure from motion problem using SVD [17]. An initial measurement matrix containing the coordinates of a few corresponding sample points is thus needed and these must be provided by an independent mechanism. One way to sample proper points through a sequence is to employ a scheme to extract corners and find their correspondence automatically, as seen for example in [1,19]. For the obtained corresponding sample points, we solve not only for motion parameters but also compute the correspondences between pixel intensity that are required for the next stage of solving for intensity. 4.2
Solving for Intensity
Using the sample corresponding points obtained in the process of solving for motion parameters, we must now acquire the linear coefficients a(j) acting on the illumination image basis in Equations 9 and 10. We assume Lambertian surface properties, a convex object, and a single point light source. Observing ni sample corresponding pixels in each image through nj frames, we can form the matrix equation I = ˘Ia
(12)
Constructing Illumination Image Basis from Object Motion
203
Fig. 4. Re-assigned images in frontal view.
Fig. 5. Synthesized images under various illumination.
where I is an ni × nj matrix containing the corresponding sample pixel intensity Ii (j), ˘I is an ni × 3 matrix containing a 3D basis for the sparse elements in ˘I (j)(j = 1, 2, 3) such that ˘I = [˘I (1) ˘I (2) ˘I (3)] and a is a 3 × nj matrix whose columns are the linear coefficients a(j) as a = [a(1) a(2) . . . a(nj )]. We record the corresponding sample pixel intensity, Ii (j), in I, which we call illumination matrix. Equation 12 is then in the familiar form for solution by singular value decomposition to obtain a rank 3 approximation to the matrix I. Hence, the linear coeffecints are acquired from the intensity of the small number of sample pixels in advance. In order to solve for singular value decomposition of I, in this case, at least four sample points are required through four images in the sequence. The astute reader would have noticed that Equation 12 can only be used if all the pixels used in forming I were illuminated in all j images. This is easy to ensure since under a single light source these pixels will all have non-zero intensity values and thus it is not necessary for the images to have been taken with the light in the bright cell. In practice we employ a robust random sampling and consensus technique (RANSAC) to ensure that artifacts that are caused by an object not fulfilling the assumed conditions (eg. specularities and self-shadowing) do not distort the correct solution. As is well known, the solution is unique up to an arbitrary invertible 3×3 transformation A since ˘Ia = (˘IA)(A−1 a). (13) However, we may leave A undetermined, since A is canceled when computing the estimated intensity using Equations 9 and 10. 4.3
Constructing Illumination Image Basis
We describe an algorithm for constructing illumination image basis by way of dense search for the corresponding pixel intensity applying the geotensity constraint. Note that we have acquired M(j), t(j) and a(j) through the techniques
204
A. Nakashima, A. Maki, and K. Fukui
discussed in the previous two sections. At each pixel in the reference image, xi (1), we measure the corresponding pixel intensity, Ii (j), estimate Iˆi (j), and compute the error, Ei , in the geotensity constraint at depth, Zi , at regular small intervals. The corresponding pixel intensities are computed from Equations 6, 9, and 10 armed with M(j), t(j) and a(j). As is apparent in Equation 6, the pixel coordinate where we acquire the intensity depends on the depth map. When the depth is correct we expect the error to approach zero. The effective search range for the depth is determined according to the pointalistic model of the object computed in Section 4.1 so that it is limited to the range where the object may exist in space. By the minimization of the error at each pixel in the reference frame, xi (1), we acquire the intensity at the corresponding pixels in the other frames, Ii (j). With the acquired intensity in the pixels xi (j), we can construct the images of the object seen from the same view as the reference frame. Figure 4 shows the images constructed from four images on the left shown in Figure 2 in various views. The viewing direction in the constructed images is aligned to that in the reference frame and the images appear to be illuminated by variable lighting source. We can either employ a set of three images from the constructed images directly as an illumination image basis, or apply the PCA to the constructed images and form an illumination basis by the three eigenvectors corresponding to the highest eigenvalues. The three images on the right in Figure 2 compose an illumination image basis obtained by applying the PCA to the constructed images shown in Figure 4. Indeed, images under arbitrary illumination can be synthesized by the resulting illumination basis as shown in Figure 5. Given nj images of an object in motion, the framework of the algorithm is summarized as follows: 1◦ We set a reference image as the first one, I(1). 2◦ For a pointalistic model, we compute the motion parameters M(j)(j = 1, · · · , nj ) and t(j)(j = 1, · · · , nj ), and linear coefficients, a(j)(j = 1, · · · , nj ). 3◦ For a particular guess of depth, Zi , at point xi , we measure Ii (j)(j = 1, · · · , nj ) using Equation 6 with M(j), t(j), and a(j). ◦ 4 These measurements are fed into Equation 9 to calculate the values I˘i (j)(j = 1, 2, 3). 5◦ The estimated values, Iˆi (j)(j = 1, · · · , nj ), are computed using Equation 10. 6◦ The error Ei can be obtained using Equation 11. 7◦ Do 3◦ – 6◦ while searching for a depth that minimizes the error Ei . 8◦ Do 3◦ – 7◦ for all pixels in the reference frame, xi (1). 9◦ The corresponding pixel intensity Ii (j)(j = 2, · · · , nj ) is aligned to the pixel in the reference image xi (1). 10◦ We apply the PCA to the aligned images and the reference frame. Then, an illumination basis is acquired as three eigenvectors corresponding to the highest eigenvalues. In the process 10◦ , alternatively, we may choose three images directly from the reference and aligned images as an illumination basis.
Constructing Illumination Image Basis from Object Motion
205
Fig. 6. Images of a face under variable lighting and an illumination basis.
Fig. 7. Input images of a face in motion and an illumination basis in frontal view.
Fig. 8. Images of another face under variable lighting and an illumination basis.
Fig. 9. Input images of a face in motion and an illumination basis in frontal view.
5
Experiments
In this section, we show several experiments to confirm the validity of the proposed algorithm. We apply the algorithm to a statue of Caesar and human faces. First, we compare the results obtained by the proposed algorithm with those obtained in the conventional way. Then, we explore the possibility of utilizing the proposed method for face recognition by comparing the difference between the obtained illumination image bases for faces of different subjects. It is also shown that the method makes it possible to construct illumination image bases in various views according to the choice of the reference frame. 5.1
Illumination Image Bases of a Statue of Caesar and Human Faces
Let us begin by confirming the validity of the proposed method. Using a statue of Caesar and human faces as objects, we compare illumination bases constructed from images of the objects under motion with those constructed from images under variable illumination. Although the results for Caesar have already been referred to in the previous sections for explanation, we discuss more of the details here.
206
A. Nakashima, A. Maki, and K. Fukui
Table 1. (a) Similarities between the subspaces of the same subjects spanned by illumination image bases constructed in the conventional and proposed methods. (b) Similarities between person A and B.
subjects Caesar person A person B similarities 0.99 0.99 0.99 (a)
person A \ B illumination object motion illumination 0.94 0.95 object motion 0.95 0.94 (b)
In Figures 1, 6, and 8, four images on the left are captured under variable lighting. We extract the central area including eyes, nose, and mouth from the images. By applying PCA to the extracted areas, each illumination image basis is obtained by three eigenvectors. It is shown in the right-hand parts of Figures 1, 6, and 8 and arranged according to eigenvalues. As observed in [6], the first illumination base image is illuminated from the front, the second from the right or left, and the third from the top or bottom. In Figures 2, 7, and 9, on the other hand, four images on the left are input images of the objects in motion. The left image in frontal view is set to be a reference frame. By applying the geotensity constraint to the input images, we search correspondences between the images, and then images in frontal view under variable illumination are obtained by realigning the intensity in the process 9◦ in Section 4.3. The re-alinged images for Caesar are shown in Figure 4. It is observed that searching of correspondences is successful. These images are the analogues to the four left-hand images of the object under variable lighting shown in Figure 1. As in Figures 1, 6, and 8 PCA is applied to the re-aligned images to obtain an illumination image basis. The results are shown in the right-hand parts of Figures 2, 7, and 9. The first base image is illuminated from the front, the second from the right or left, and the third from the top or bottom. These aspects are similar to those in Figures 1, 6, and 8 except the reverse of white and black which means that only the signs of elements of the base image are opposite. Hence, each of the base images in Figures 2, 7, and 9 appears to correspond with each of those in Figures 1, 6, and 8. Mathematically, however, since there exist infinite illumination image bases which span the illumination subspace unique to one object in one pose, it is not prerequisite for an illumination image basis constructed from images under object motion to exactly correspond with the counterpart from images under varying illumination. In order to evaluate the illumination image basis acquired by the proposed method, thus, we investigate its closeness to the counterpart due to varying illumination by comparison in the level of the subspaces. For computing the similarities between the subspaces spanned by illumination image bases constructed in the conventional and proposed methods in quantity, we introduce a canonical angle θ. It represents an angle between two spaces, S1 and S2 , and defined by
Constructing Illumination Image Basis from Object Motion
cos2 θ =
sup
s1 ∈S1 ,s2 ∈S2 ||s1 ||=0,||s2 ||=0
(s1 , s2 ) . ||s1 ||2 ||s2 ||2
207
(14)
If S1 exactly corresponds to S2 , then cos2 θ is equal to 1. The results of computing Equation 14 between the illumination subspaces represented by illumination image bases shown in Figures 1 and 2, 6 and 7, and 8 and 9 are 0.99 as listed in Table 1 (a). From the results, it can be inferred that an illumination image basis is constructed from object motion by the proposed algorithm as well as from variable illumination. As for the human faces, the second and third base images in Figures 7 and 9 have some destroyed parts in their appearance, which are seemingly caused by the non-rigidity of the subjects especially around the eyes (the statue of Caesar is rigid on the other hand). Nevertheless, the resulting figures of similarity in terms of the illumination subspaces are not deteriorated and thereby show the validity of the proposed algorithm. We also explore the possibility of utilizing the illumination image bases obtained by the proposed method for face recognition, by computing the differences between the illumination subspaces for two different human subjects. The resulting angles computed by Equation 14 for all possible combinations of the illumination subspaces derived from the two different human subjects are listed in Table 1 (b). The computed similarities for the two subjects, shown in Figures 6 through 9, are 0.94 or 0.95 for different cases of utilizing varying illumination and object motion interchangeably. The values are smaller than those for the similarity between the illumination image bases of the same subject, 0.99. In the same way, illumination image bases are constructed for several different subjects, and similar results are obtained. That is, the similarities for the same subjects are 0.99 and for different subjects are around 0.94. Hence, the difference between the two subjects can be regarded as reasonably well conserved. Although the difference between 0.94 and 0.99 may seem trivial, it is in practice a significant difference when considered in algorithms for face recognition. In Fisher face [2], for example, the ratio of dispersion of images classified in a correct category to dispersion of images classified in a false category is minimized. Since the minimization has the effect of emphasizing differences between categories, the observed difference will be feasible recognized in the algorithm. 5.2
Illumination Image Bases in Various Views
In the proposed method, illumination image bases in various views can be constructed according to the selected reference frames. Here, we use the input images in Figure 7 again. The four input images are set to be a reference frame in turn, and illumination image bases are constructed. Then, we synthesize images under various illumination represented by the linear combination of the obtained illumination base images with various linear coefficients. The synthesized images in views corresponding to the reference frames are shown in Figure 10. It can be seen that the proposed method makes construction of illumination image bases in various views possible only by replacing the reference image with other images
208
A. Nakashima, A. Maki, and K. Fukui
Fig. 10. Illumination bases in various views. The reference images are selected from the input images in Figure 7.
in different views. Therefore, constructing illumination image bases from object motion is useful to deal with variations of not only illumination but also views in the appearance-based object recognition. When replacing the reference frame, we do not need to search the small number of sample points and to recompute motion parameters and linear coefficients. Once the motion parameters and linear coefficients are computed, the new motion parameters for a different reference image are easily derived from the previous motion parameters which represent the relative motion between the frames and the linear coefficients can be used directly as they are.
6
Discussions and Conclusions
We have proposed a method to construct an illumination image basis from images of an object in motion under fixed illumination. The experiments showed the validity of the proposed method for a statue of Caesar and human faces. The advantages of the method are: 1. An illumination image basis is constructed from images of an object in motion without changing illumination or camera view. 2. Images under arbitrary illumination are synthesized by linear combination of the obtained illumination base images equivalently to the conventional method. 3. Illumination image bases in various poses are constructed from an image sequence only by replacing the reference frame. 4. By utilizing the illumination image bases in various poses, we can enjoy the advantages of an object recognition algorithm that considers the varieties of illumination and poses. Although we dealt with the geotensity constraint in the case of a single lighting source for simplicity in this paper, extension to the case of multiple light sources is possible by applying the geotensity constraint under multiple light sources [9]. In general, illumination image bases in various views constructed by the proposed method can be utilized for image rendering [16], and for object recognition [12], although occlusion problem is left as the future work.
Constructing Illumination Image Basis from Object Motion
209
Acknowledgments. The authors wish to thank Hiroshi Hattori for providing the RANSAC program to find corner correspondence.
References 1. P.A. Beardsley, P. Torr, and A.P. Zisserman. 3D model acquisition from extended image sequences. In 4rd ECCV, pages 683–695, Cambridge, UK, 1996. 2. P.N. Belhumeur, J.P. Hespanha, and Kriegman D.J. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE-PAMI, 19(7):711–720, 1997. 3. P.N. Belhumeur and D.J. Kriegman. What is the set of images of an object under all possible illumination conditions? IJCV, 28:3:245–260, 1998. 4. R. Epstein, P.W. Hallinan, and A. Yuille. 5 ± 2 Eigenimages suffice: an empirical investigation of low-dimensional lighting models. In Proc. IEEE Workshop on Physics-based Modeling in CV, 1995. 5. A.S. Georghiades, D.J. Kriegman, and P.N. Belhumeur. Illumination cones for recognition under variable lighting: Faces. In CVPR, pages 52–58, 1998. 6. P.W. Hallinan. A low-dimensional representation of human faces for arbitrary lighting conditions. In CVPR, pages 995–999, 1994. 7. R.I. Hartley. Euclidean reconstruction from uncalibrated views. In J.L. Mundy, A. Zisserman, and D. Forsyth, editors, Proc. Applications of Invariance in computer vision, pages 238–256. Springer-Verlag, 1994. Lecture Notes in Computer Science 825. 8. A. Maki, M. Watanabe, and C.S. Wiles. Geotensity: Combining motion and lighting for 3d surface reconstruction. In 6th ICCV, pages 1053–1060, 1998. 9. A. Maki and C. Wiles. Geotensity constraint for 3d surface reconstruction under multiple light sources. In 6th ECCV, pages 725–741, 2000. 10. Y. Moses, Y. Adini, and S. Ullman. Face recognition: the problem of compensating for changes in illumination direction. In J.O. Eckland, editor, 3rd ECCV, pages 286–296, Stockholm, Sweden, 1994. Springer-Verlag. 11. J.L. Mundy and A. Zisserman, editors. Geometric invariance in computer vision. The MIT Press, 1992. 12. H. Murase and S.K. Nayer. Visual learning and recognition of 3-d objects from appearance. IJCV, 14(1):5–24, 1995. 13. A.P. Pentland. Photometric motion. IEEE-PAMI, 13:9:879–890, 1991. 14. A. Shashua. Geometry and photometry in 3D visual recognition. PhD thesis, Dept. Brain and Cognitive Science, MIT, 1992. 15. A. Shashua. On photometric issues in 3d visual recognition from a single 2d image. IJCV, 21:99–122, 1997. 16. A. Shashua and T. Riklin-Raviv. The quotient image: Class-based re-rendering and recognition with varying illuminations. IEEE-PAMI, 23(2):129–139, 2001. 17. C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 9:2:137–154, 1992. 18. D. Weinshall and C. Tomasi. Linear and incremental acquisition of invariant shape models from image sequences. In 4th ICCV, pages 675–682, 1993. 19. C.S. Wiles, A. Maki, and N. Matsuda. Hyper-patches for 3d model acquisition and tracking. IEEE-PAMI, 23(12):1391–1403, 2001. 20. R.J. Woodham. Photometric method for determining surface orientation from multiple images. Optical Engineering, 19:139–144, 1980. 21. A. Yuille and D. Snow. Shape and albedo from multiple images using integrability. In CVPR, pages 158–164, 1997.
Diffuse-Specular Separation and Depth Recovery from Image Sequences Stephen Lin1 , Yuanzhen Li1,2 , Sing Bing Kang3 , Xin Tong1 , and Heung-Yeung Shum1 2
1 Microsoft Research, Asia Chinese Academy of Science 3 Microsoft Research
†
Abstract. Specular reflections present difficulties for many areas of computer vision such as stereo and segmentation. To separate specular and diffuse reflection components, previous approaches generally require accurate segmentation, regionally uniform reflectance or structured lighting. To overcome these limiting assumptions, we propose a method based on color analysis and multibaseline stereo that simultaneously estimates the separation and the true depth of specular reflections. First, pixels with a specular component are detected by a novel form of color histogram differencing that utilizes the epipolar constraint. This process uses relevant data from all the stereo images for robustness, and addresses the problem of color occlusions. Based on the Lambertian model of diffuse reflectance, stereo correspondence is then employed to compute for specular pixels their corresponding diffuse components in other views. The results of color-based detection aid the stereo correspondence, which determines both separation and true depth of specular pixels. Our approach integrates color analysis and multibaseline stereo in a synergistic manner to yield accurate separation and depth, as demonstrated by our results on synthetic and real image sequences.
1
Introduction
The difference in behavior between diffuse and specular reflections poses a problem in many computer vision algorithms. While purely diffuse reflections ordinarily exhibit little variation in color from different viewing directions, specular reflections tend to change significantly in both color and position. A vast majority of techniques in areas such as stereo and image segmentation account only for diffuse reflection and disregard specularities as noise. While this diffuse assumption is generally valid for much of an image, processing of regions that contain specular reflections can result in significant inaccuracies. For instance, traditional stereo correspondence of specular reflections does not give the true depth of a scene point, but instead an incorrect virtual depth. To avoid such problems, methods for image preprocessing have been developed to detect and to separate specular reflections in images. †
This work was performed while the second author was visiting Microsoft Research, Asia.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 210–224, 2002. c Springer-Verlag Berlin Heidelberg 2002
Diffuse-Specular Separation and Depth Recovery from Image Sequences
1.1
211
Related Work
To distinguish between diffuse and specular image intensities, most separation methods utilize color images and the dichromatic reflectance model proposed by Shafer [15]. This model suggests that, in the case of dielectrics (non-conductors), diffuse and specular components have different spectral distributions. The spectral distribution of the specular component is similar to that of the illumination, while the distribution of the diffuse component is a product of illumination and surface pigments. The color of a given pixel can be viewed in RGB color space as a linear combination of a vector for object reflectance color and a vector for illumination color. All image points on a uniform-colored surface lie on a dichromatic plane which is spanned by these two vectors. Rich literature exists on separation using the dichromatic model. Klinker et al. [7] developed a method based on the observation that the color histogram of a surface with uniform reflectance takes the shape of a “skewed T” with two limbs. One limb represents purely diffuse points while the other corresponds to highlight points. Their separation algorithm automatically identifies the two limbs, then computes the diffuse component of each highlight point as its projection on the diffuse limb. To estimate specular color, Tong and Funt [16] suggest computing the dichromatic planes of several uniform-reflectance regions and finding the line most parallel to these planes. Sato and Ikeuchi [14] also employed the dichromatic model for separation by analyzing color signatures produced from many images taken under a moving light source. Besides color, polarization has also been an effective cue for specular separation. Wolff and Boult [17] proposed a polarization-based method for separating reflection components in regions of constant Fresnel reflectance. Nayar et al. [12] used polarization in conjunction with color information from a single view to separate reflection components, where a constraint is provided by neighboring pixels with similar diffuse color. These previous approaches have produced good separation results, but they have requirements that limit their applicability. Nearly all of these methods assume some sort of consistency among neighboring pixels. For the color algorithms, color segmentation is needed to avoid problematic overlaps in color histograms among regions with different diffuse reflection color. Segmentation algorithms are especially unreliable in regions containing specularity, and errors in segmentation significantly degrade the results of dichromatic plane analysis. Additionally, these color methods generally assume uniform illumination color, so interreflections cannot be present. For typical real scenes, which have textured objects and interreflections, anatomizing the color histogram is essentially infeasible. The need for segmentation also exists for polarization methods, which additionally require polarizer rotation. The algorithm by Sato and Ikeuchi [14] notably avoids these assumptions, but the need for an image sequence taken under a moving light source restricts its use. 1.2
Stereo in the Presence of Specularities
Stereo in the presence of specular reflection has been a challenging problem. Bhat and Nayar [1] consider the likelihood of correct stereo matching by analyzing
212
S. Lin et al.
the relationship between stereo vergence and surface roughness, and in [2] they further propose a trinocular system where only two images are used at a time in the computation of depth at a point. Jin et al. [5] poses this problem within a variational framework and seeks to estimate a suitable shape model of the scene. Of these methods, only [2] has endeavored to recover true depth information for specular points, but it involves extra efforts to determine a suitable trinocular configuration. 1.3
Our Approach: Color-Based Separation with Multibaseline Stereo
We circumvent the limiting assumptions made in previous works by framing the separation problem as one of stereo correspondence. By matching specular pixels to their corresponding diffuse points in other views, we can determine the diffuse components of the specularities using the Lambertian model of diffuse reflection. In this way, we employ multibaseline stereo in a manner that computes separation and depth together. Without special consideration of camera positions, we handle highlights in multibaseline stereo by forming correspondence constraints based on the following assumptions: diffuse reflection satisfies the Lambertian property; specular reflections vary in color from view to view; scene points having specular reflection exhibit purely diffuse reflection in some other views. The first assumption is commonplace in computer vision, and the second assumption is not unreasonable, because specular reflections change positions in the scene for different viewpoints. Even when a specular reflection lies on the same uniformly colored surface between two views, the underlying diffuse shading and the specular intensity will likely change, thus yielding different color measurements [9]. The third assumption is dependent on surface roughnesses in the scene, but is often satisfied by having a sufficiently long stereo baseline for the entire image sequence. With epipolar geometry and the aforementioned viewing conditions, our proposed method first detects specular reflections using stereo-enhanced color processing. The color-based detection results are then used to heighten the accuracy of stereo correspondence, which yields simultaneous separation and depth recovery. It is this synergy between color and stereo that leads to better estimation of both separation and depth. With their combination, we develop an algorithm that is novel in the following aspects: – The epipolar constraint is used to promote color histogram differencing (CHD) by correspondence of image rows. – A technique we call Multi-CHD is introduced to handle color occlusions and to make full use of available images for robust specularity detection. – Diffuse components and depth of specular reflections are computed by stereo correspondence constrained by epipolar geometry and detection results. In comparison with previous separation approaches, our use of multibaseline stereo provides significant practical advantages. First, prior image segmentation and consistency among neighboring pixels are not assumed. Second, our method
Diffuse-Specular Separation and Depth Recovery from Image Sequences
213
is robust to moderate amounts of interreflection, which are present in most real scenes. Third, capture of image sets can potentially be done instantaneously, since we do not require polarizer rotations or changes in lighting conditions that prohibit possible frame-rate execution. These three properties make our method more feasible than any previous method. The remainder of this paper is organized as follows. Section 2 describes the details of specularity detection by color histogram differencing under epipolar constraints. These detection results are used in Section 3 to constrain stereo correspondence, which determines specularity separation and depth. The effectiveness of our algorithm is supported by experimental results presented in Section 4, and the paper closes with a discussion and some conclusions in Sections 5 and 6, respectively.
2
Color Histogram Differencing
For specularity detection, we develop a novel method that takes advantage of multibaseline stereo to improve color histogram differencing. CHD, introduced in [9], is based upon changes in the color of a specular reflection from view to view. Under the Lambertian property, the diffuse colors of a scene are viewpoint independent, with the exception of occlusions and disocclusions. Since the scene location of a specular reflection changes from view to view, its underlying diffuse reflection can differ in shading or color. The underlying diffuse color is a component of the overall specular color, so a specular reflection will change in color between different views. This difference in behavior between specular and purely diffuse reflections is exploited by CHD to detect specularities. Suppose that we have two images I1 and I2 of the same scene taken from different viewpoints. For each image Ik , its pixel colors can be mapped into a binary RGB histogram Hk , such as those shown in Fig. 1. If a scene point contains purely diffuse reflection, then its positions in H1 and H2 will be identical. A specular reflection in H1 , however, will have a shifted histogram position in H2 . By subtracting points in H2 from H1 , we can detect the specular pixels in I1 : H1,spec = H1 − H2 = {p | p ∈ H1 , p ∈ H2 }. Because of image noise and other unmodelled effects, some leeway is allowed in the differencing such that slight differences in histogram point positions between H1 and H2 do not indicate specularity. In the remainder of this section, we present our method of specularity detection based on CHD. First we describe tri-view CHD, which extends standard CHD to a trinocular multibaseline stereo setting in a way that takes advantage of epipolar geometry and addresses color occlusion. Then we generalize tri-view CHD to make full use of the abundant stereo images in a method we call multiCHD.
214
S. Lin et al.
specular
-
diffuse
H1
specular
=
specular
diffuse
-
H2
=
H1,spec
Fig. 1. Standard CHD. Since specular reflections generally change in color from one view to another, they can be detected by histogram differencing.
2.1
Tri-View CHD
Two major obstacles for standard CHD are histogram clutter and color occlusions. When parts of a histogram are crowded with points, CHD can fail because specular histogram points in one view can possibly be masked by other points, diffuse or specular, with similar color in another view. When this happens, specular pixels go undetected. The change in viewpoint also presents difficulties for CHD because diffuse colors present in one view may be occluded in another view by scene geometry or specular reflections. As a result, diffuse pixels are mistakenly detected as specular. To address both of these problems, we present a method called tri-view CHD. In standard CHD, differencing is done between histograms formed from entire images. Because of the large number of pixels involved, there often exists much crowding in color histograms. To reduce this clutter, we take advantage of epipolar geometry, which allows us to decrease the number of histogram points by differencing image rows rather than entire images. In multibaseline stereo, the correspondence between images rows is simply the matching scanlines. For tri-view CHD, matching scanlines for three images are illustrated in Fig. 2 and are denoted as IC for the central (reference) viewpoint, IL for the left viewpoint, and IR for the right viewpoint. The histograms corresponding to these scanlines are respectively labelled HC , HL and HR . We use three images to lessen the effects of occlusion caused by both geometry and specular reflections. Diffuse reflection in IC that is geometrically occluded in IL or IR are present in the histogram HL ∪ HR , because different parts of the scene are occluded when changing the viewpoint to the left or to the right. The same is true for diffuse reflection in IC that is occluded by specularity in IL or IR . Based on this observation, we can formulate the set of histogram points in HC that contain specular reflection as HC,spec = HC − (HL ∪ HR ). The detected specular histogram points are backprojected to the image to locate the specular pixels, which we represent as a binary image: 0 if IC (x, y) is non-specular (1) SC (x, y) = 1 if IC (x, y) is specular.
Diffuse-Specular Separation and Depth Recovery from Image Sequences
215
scene
L
R
C
IL
IC
IR
HL
HC
HR
Fig. 2. Tri-view CHD. In the multibaseline camera configuration, matching epipolar lines are used for histogram differencing.
This consideration of occlusion effects and cluttering in CHD diminishes both the number of false positives and the number of missed detections. Although we can compute the detection using tri-view CHD, the results can be sensitive to viewpoint intervals. For intervals that are too small, specularity color may not differ significantly, and for very large intervals visibility differences arise. Since an appropriate interval is dependent on scene geometry and surface roughnesses, it is impractical to compute. But since a multibaseline stereo sequence contains many viewpoints from which to form image triplets, we can make use of these available images for more robust detection. 2.2
Multi-CHD
We refer to use of the entire stereo sequence for tri-view CHD as multi-CHD. To explain this technique, let us consider the camera configuration illustrated in Fig. 3, with (2n + 1) viewpoints from which we capture (2n + 1) images labelled as {ILn . . . IL2 , IL1 , IC , IR1 , IR2 . . . IRn }. For the center reference image IC , we can form n different uniform-interval triplets {ILk , IC , IRk } for k = 1, 2 . . . n. Tri-view CHD can be performed on each of these triplets as described in the previous subsection to form k specular point sets SC,k . These sets are used to vote for the final detection results. Voting is an effective method for obtaining a more reliable result by combining data from a number of sources. The voted output reflects a consensus or compromise of the information. For each image pixel, it is deemed nspecular if the number of votes from SC,k exceeds a given threshold t, i.e., if k=1 SC,k > t.
216
S. Lin et al.
Fig. 3. Camera configuration for Multi-CHD.
Although we employ this basic voting scheme, it may be enhanced by weighting votes according to its distance from the reference viewpoint. Additionally, the reliability of voting can be improved by increasing the number of statistically independent voters [8]. This can be done by considering image triplets where the viewpoints are unevenly spaced, or by generalizing the multi-CHD concept to a 2D grid of cameras.
3
Stereo Correspondence for Specular Reflections
Once we detect the specular pixels in an image, we attempt to associate them to their corresponding diffuse points in other images to determine separation and depth. Because of the large differences in intensity and color, a correspondence cannot directly be computed between a specular pixel and its diffuse counterpart, so we instead correspond diffuse pixels of other images under the constraint of their disparity relationship to the specular pixel in the reference image. In our approach, we first use multibaseline stereo to compute initial disparity estimates for diffuse reflection. With these diffuse disparities, we impose a continuity constraint to determine a search range for estimation of specular disparities. This second round of stereo combines the use of spatially flexible windows and a dynamically selected subset of images with which to compute the matches. Upon determining the correspondence, a diffuse component is estimated for producing the separation, and the computed disparity gives the true depth. 3.1
Initial Depth Estimation for Diffuse Reflection
Our specular correspondence method begins with an initial depth estimate that employs the epipolar constraint. This preliminary estimate is reasonably accurate for purely diffuse pixels, and it is used to facilitate more precise depth
Diffuse-Specular Separation and Depth Recovery from Image Sequences
217
estimation for specular pixels. The uncertainty of this estimate can also be used as an additional independent cue for specularity detection. For a forward-facing multibaseline stereo configuration [13], disparity varies linearly with horizontal pixel displacement. In order to estimate the disparity for a pixel (x, y), we first aggregate the matching costs over a window as the sum of sum of squared differences (SSSD), namely ρ I0 (u, v) − Iˆk (u, v, d) , (2) ESSSD (x, y, d) = k=0 (u,v)∈W (x,y)
where ρ(•) is the per-pixel squared Euclidean distance in RGB between reference image I0 and Iˆk (warped image of Ik at disparity d). W (x, y) is a square window centered at (x, y). For each non-specular pixel (x, y) in the reference image as determined by multi-CHD, the minimum ESSSD for different disparity values determines the estimated disparity d and the uncertainty u of the estimation: d(x, y) = arg min ESSSD(x,y,d) and u(x, y) = min ESSSD(x,y,d) . d
d
Similar to confidence measures in other stereo works, u gauges uncertainty because match quality is poor when its value is high. This quantity will later be used within a confidence weight for disparity estimates, and additionally, it can be used as a cue for specular reflection. Pixels with uncertainty above a specified threshold can be added to the multi-CHD detection results. To handle occlusions which pose a problem in dense multi-view stereo, we incorporate shiftable windows and temporal image selection [6] to improve the matching of boundary pixels and semi-occluded pixels. Since specular points cannot be matched in this way, we more carefully correspond them using the continuity constraint. 3.2
Continuity Constraint
With the computed disparity estimates for diffuse pixels, we calculate a disparity search interval for specular pixels using the continuity constraint. Based on the cohesiveness of matter, the continuity constraint proposed by Marr and Poggio [10] claims that disparity should vary smoothly throughout an image, given opaque material and piecewise continuous surfaces. The failure of this assumption at depth discontinuities can be remedied by using a large window for SSSD computation. We can safely impose this constraint to get an initial estimate and search range of disparity for specular points, even if they are located near depth discontinuities. We implement the continuity constraint using k-Nearest Neighbors (k-NN). For a particular specular point S(x, y), we find k nearest neighbor points that are nonspecular around it. Their disparity estimates dl for l = 1 . . . k are accompanied by corresponding uncertainty values ul that indicate estimation quality. The initial estimate for disparity is expressed in terms of the k nearest neighbor disparities weighted by the reciprocal of their uncertainty values, i.e., di (x, y) = l dl ul −1 / l ul −1 .
218
S. Lin et al.
From this initial estimate, we restrict the search range of possible disparities to be within a preset bound dr so that the disparity candidate interval D(x, y) for S(x, y) is D(x, y) = di (x, y) − dr , di (x, y) + dr . (3) We take dr in our experiments to be equal to the standard deviation of the initial disparity estimates over the image. 3.3
Shiftable and Flexible Windows
Since pixels with specular reflection degrade area-based correlation, we employ shiftable and flexible windows, rather than a fixed window, to get a selectively aggregated matching error. The basic idea of shiftable windows is to examine several windows that include the pixel of interest, not just the window centered at that pixel. This strategy has been shown to effectively deal with occlusions [11] [3]. We extend this idea by excluding pixels detected as specular in the previous section. This results in shiftable windows with flexible shapes, and we show it improves the matching of pixels that are specular in some images and non-specular in others. This is furthermore effective in dealing with pixels near the boundary between specular and diffuse regions. In the formation of these flexible windows, we use the specular detection Sk from (1) for respective images Ik . For correspondence between image pairs I0 and Ik , we modify the n×n shiftable window Wn×n (x, y) that includes (x, y) into the window Wf (x, y) whose support is flexibly shaped to exclude specularity: (4) Wf (x, y) = (u, v)|(u, v) ∈ Wn×n (x, y), S0 (u, v) = 0, Sk (u, v) = 0 . In the case where the number of valid pixels in a window falls below a threshold, we progressively increase the original window size. Over this flexible and shiftable window, we aggregate the raw matching cost to compute the SSD: w(u, v)Eraw (u, v, d, k) ESSD (x, y, d, k) =
(u,v)∈Wf (x,y)
(u,v)∈Wf (x,y)
,
w(u, v)
where w(u, v) is the support weight of each pixel in Wf (x, y) for (x, y), which we set to the constant 1 to get the mean. 3.4
Temporal Selection
Rather than summing the match costs over all views, a better approach would be to dynamically select a subset of views where the support window is believed to be mostly diffuse and unoccluded. Towards this end, we can formulate from the specular detection results a temporally selective aggregated matching error: wt(k)ESSD (x, y, d, k)
ESSSD (x, y, d) =
k=0 , C(x,y)>T
k=0 , C(x,y)>T
where
wt(k)
,
Diffuse-Specular Separation and Depth Recovery from Image Sequences
C(x, y) =
[1 − Sk (u, v)].
219
(5)
(u,v)∈Wf (x,y)
The constraint C(x, y) > T ensures that in the selected views the correlation window includes an appropriate number of diffuse points, where T is a percentage of pixels in the original n × n shiftable window. The factors wt(k) are weights of ESSD (x, y, d, k) which could normalize for the number of temporally selected views. We instead use these weights to deal with occlusions in the selected views. Views with a lower local SSD error ESSD (x, y, d, k) are more likely to have visible corresponding pixels, so we set wt(k) = 1 for the best 50% of images satisfying constraint (5), and wt(k) = 0 for the remaining 50%. This temporal selection rule is similar to that described in [6]. Finally, we adopt a winner selection strategy to compute the final disparity: d(x, y) = arg
min
d∈D(x,y)
ESSSD (x, y, d),
where D(x, y) is the candidate interval for disparity given in (3). 3.5
Separation and Depth Estimate
Upon corresponding a specular point PS to a diffuse point PD in any other view, we can directly compute depth from the disparity and can theoretically take the color of PD as the diffuse component of PS , because of the Lambertian diffuse reflection assumption. The specular component can simply be computed as the difference in color between PS and PD . In reality, we must also account for noise and mismatches. So we find all possible corresponding diffuse points in the light field for PS , and then compute the mean value of the color measurements to obtain PD .
4
Experimental Results
In this section, we present results on synthetic and real image sequences to validate our approach. For both sequences, our algorithm settings are a multiCHD voting threshold of 50% and a CHD differencing distance threshold equal to one standard deviation of the image noise. The depth interval values d4 in (3) are 1.4 for synthetic images and 2.2 for real images. 4.1
Synthetic Image Sequence
Experiments on synthetic images are used for comparison to ground truth data. Our synthetic sequence consists of 57 images generated using Phong shading. The baseline distance between consecutive views is 3.125mm. Fig. 4 displays images of our results. An image taken from our sequence is shown in (a). For this image, ground truth and our own detection images are exhibited in (b) and (c) respectively. When the scene patch for a pixel contains more than one color, color blending occurs because of pixel integration. The
220
S. Lin et al.
balance of this blending along color boundaries will change from view to view, so these pixels are mistakenly detected as specular by CHD. In (d-f), an image comparison of depth recovery is shown. Our depth estimation (f) more closely resembles the ground truth (d), because of our consideration of specular reflections. When specularities are disregarded as in (e), incorrect virtual depths are computed instead. The images (g-j) show the ground truth separation and our own. Our result is similar to the ground truth, even though several pixels of the fruit in the bowl contain specularity throughout our image sequence. When this happens, our method tends to match with neighboring pixels, which gives a reasonable approximation of the diffuse component.
4.2
Real Image Sequence
For a real sequence, we use 64 images with a baseline distance between consecutive views of 7.8125mm and a focal length of 60mm. For SSD matching, we use a window size of 9x9 and only eleven consecutive images for depth recovery, to avoid non-constant disparities that can result from large baselines. Fig. 5(a) displays an image from this sequence, and (b) exhibits our detection result by multi-CHD. There exist some false detections mainly due to color blending, but the true specular areas are mostly found. For specular reflections on the wooden bowl, the watermelon and the leaves, the depth recovery without consideration of specular reflections in (c) gives an inaccurate virtual depth, while our estimation in (d) appears to approximate the true depth. The apple presents difficulties due to lack of texture, which is problematic for all stereo methods. Our diffuse and specular separation results are shown in (e) and (f), respectively. Notice that although depth recovery is inaccurate for textureless regions like the apple, its separation result is not bad because correspondence is made with some other points in the textureless region, whose diffuse component is often close to being correct. Differences between tri-view CHD and standard (full-image) CHD are demonstrated in Fig. 6. In (a), we show the result of standard CHD between the reference image and the image taken 7.8125cm to the left. Because of color occlusions due to differences in scene visibililty, standard CHD detects many false positives on the right side of the image. Moreover, for the large specular area on the wooden bowl, there are many missing detections that result from histogram crowding. For standard CHD with the image taken 7.8125cm to the right, it is seen in (b) that color occlusions cause many false positives on the left side, and there are missing detections on both the wooden bowl and the watermelon. Triview CHD using these three images produces the result in (c), in which the color occlusion and histogram crowding problem is better handled. A combination of (a) and (b) cannot generate the results in (c), and we improve upon this result using all views in multi-CHD to get the detection shown in Fig 5(b).
Diffuse-Specular Separation and Depth Recovery from Image Sequences
221
Fig. 4. Experimental results for synthetic scene: (a) original image; (b) ground truth detection of specular reflection, with white points representing specular and black ones representing diffuse; (c) our detection (d) ground truth depth; (e) depth estimation without special processing for specularities; (f) depth estimation using our approach; (g) ground truth diffuse component; (h) our separated diffuse component; (i) ground truth specular component; (j) our separated specular component
222
S. Lin et al.
Fig. 5. Experimental results for real scene: (a) original image; (b) detection of specular reflection, with white points representing specular and black ones representing diffuse; (c) depth estimation without special processing for specularities; (d) depth estimation using our approach; (e) separated diffuse component; (f) separated specular component
Fig. 6. Comparison of tri-view CHD with standard CHD (a) standard CHD between reference image and 10th image in sequence to the left; (b) standard CHD between reference image and 10th image in sequence to the right; (c) tri-view CHD
Diffuse-Specular Separation and Depth Recovery from Image Sequences
5
223
Discussion
In previous separation works, there is a strong dependence on segmentation to provide constraints on the diffuse component of specular reflections. Rather than relying on segmentation to get this information, we transfer this burden to stereo, for which this problem is more tractable. Separation methods require segmentation to group pixels into uniform-reflectance regions that contain both specular and diffuse reflections. Because of the large and often sharp intensity differences between the two forms of reflection, such a segmentation is difficult to compute. In stereo, the problem becomes one of matching equal-intensity diffuse pixels under the constraint of the specularity position, which should be easier than matching diffuse and specular pixels. In the experimental results, color mixing was mentioned as an obstacle to detection, but was nevertheless processed correctly for the separation result. Other image effects that can disrupt our algorithm include color saturation, which can lead to crowding in a small part of the color histogram. Saturation can be substantially reduced or eliminated by capturing high dynamic range images [4]. Another obstacle is image noise, which can reduce accuracy in both detection and correspondence, but this can be remedied by averaging multiple images for each view. Problems can also arise when our viewing assumptions are not held. Diffuse reflection from very rough surfaces may not follow the Lambertian model, and this is a hindrance for specularity detection and stereo correspondence in general. Our second and third assumptions, that specular color changes among viewpoints and that a specular scene point in one view is diffuse in some other views, are ordinarily broken when specular reflections exhibit little displacement among the stereo views. This may occur in areas of very high curvature and on surfaces with large roughness, because single pixels contain a broad range of surface normals in these cases. Although separation is difficult in such areas, the effect on stereo correspondence might not be substantial, since specularities behave similarly to fixed scene features.
6
Conclusion
We have described a new approach for identifying and separating specular components from an input image sequence. It integrates color analysis and multibaseline stereo in a single framework to produce accurate separation and dense true depth. The color analysis, which identifies specular regions, is in the form of color histogram differencing with three new enhancements for increased robustness: (1) Extension to three views, (2) Use of epipolar constraints, and (3) Use of multiple triplets in a voting scheme. Once the specular pixels have been identified, we apply a dense stereo algorithm on the remaining pixels in order to extract the true depth. This stereo algorithm uses both shiftable windows and temporal selection in order to avoid the occlusion problem associated with depth discontinuity. Currently the image sequences were acquired by moving the camera along a linear path with constant velocity. We plan to extend this work by using the
224
S. Lin et al.
algorithm on sequences acquired with arbitrary camera motions. In addition, it would be interesting to learn how scene shape and lighting conditions affect the optimal camera motion for capture and subsequent specular extraction.
References 1. D.N. Bhat and S.K. Nayar. Binocular stereo in the presence of specular reflection. In ARPA, pages II:1305–1315, 1994. 2. D.N. Bhat and S.K. Nayar. Stereo in the presence of specular reflection. In ICCV, pages 1086–1092, 1995. 3. A.F. Bobick and S.S. Intille. Large occlusion stereo. IJCV, 33(3):1–20, Sept. 1999. 4. P.E. Debevec and J. Malik. Recovering high dynamic range radiance maps from photographs. Computer Graphics (SIGGRAPH), 31:369–378, 1997. 5. H. Jin, A. Yezzi, and S. Soatto. Variational multiframe stereo in the presence of specular reflections. Technical Report TR01-0017, UCLA, 2001. 6. S.B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-view stereo. In CVPR, pages 103–110, Dec. 2001. 7. G.J. Klinker, S.A. Shafer, and T. Kanade. A physical approach to color image understanding. IJCV, 4(1):7–38, Jan. 1990. 8. L. Lam and C.Y. Suen. Application of majority voting to pattern recognition: An analysis of its behaviour and performance. IEEE Trans. on Systems, Man, and Cyberbetics, 27(5):553–568, 1997. 9. S.W. Lee and R. Bajcsy. Detection of specularity using color and multiple views. Image and Vision Computing, 10:643–653, 1992. 10. D.C. Marr and T. Poggio. A computational theory of human stereo vision. In Lucia M. Vaina, editor, From the Retina to the Neocortex: Selected Papers of David Marr, pages 263–290. Birkh¨ auser, Boston, MA, 1991. 11. Y. Nakamura, T. Matsura, K. Satoh, and Y. Ohta. Occlusion detectable stereo occlusion patterns in camera matrix. In CVPR, pages 371–378, 1996. 12. S.K. Nayar, X. Fang, and T.E. Boult. Removal of specularities using color and polarization. In CVPR, pages 583–590, 1993. 13. M. Okutomi and T. Kanade. A multiple baseline stereo. IEEE PAMI, 15, 1993. 14. Y. Sato and K. Ikeuchi. Temporal-color space analysis of reflection. J. of the Opt. Soc. of America A, 11, 1994. 15. S. Shafer. Using color to separate reflection components. Color Research and Applications, 10, 1985. 16. F. Tong and B. V. Funt. Specularity removal for shape from shading. In Proc. Vision Interface, pages 98–103, 1988. 17. L.B. Wolff and T.E. Boult. Constraining object features using a polarization rflectance model. IEEE PAMI, 13, 1991.
Shape from Texture without Boundaries D.A. Forsyth Computer Science Division U.C. Berkeley Berkeley, CA 94720
[email protected]
Abstract. We describe a shape from texture method that constructs a maximum a posteriori estimate of surface coefficients using only the deformation of individual texture elements. Our method does not need to use either the boundary of the observed surface or any assumption about the overall distribution of elements. The method assumes that texture elements are of a limited number of types of fixed shape. We show that, with this assumption and assuming generic view and texture, each texture element yields the surface gradient unique up to a two-fold ambiguity. Furthermore, texture elements that are not from one of the types can be identified and ignored. An EM-like procedure yields a surface reconstruction from the data. The method is defined for othographic views — an extension to perspective views appears to be complex, but possible. Examples of reconstructions for synthetic images of surfaces are provided, and compared with ground truth. We also provide examples of reconstructions for images of real scenes. We show that our method for recovering local texture imaging transformations can be used to retexture objects in images of real scenes. Keywords: Shape from texture, texture, computer vision, surface fitting
There are surprisingly few methods for recovering a surface model from a projection of a texture field that is assumed to lie on that surface. Global methods attempt to recover an entire surface model, using assumptions about the distribution of texture elements. Appropriate assumptions are isotropy [15] (the disadvantage of this method is that there are relatively few natural isotropic textures) or homogeneity [1,2]. Current global methods do not use the deformation of individual texture elements. Local methods recover some differential geometric parameters at a point on a surface (typically, normal and curvatures). This class of methods, which is due to Garding [5], has been successfully demonstrated for a variety of surfaces by Malik and Rosenholtz [9,11]; a reformulation in terms of wavelets is due to Clerc [3]. The method has a crucial flaw; it is necessary either to know that texture element coordinate frames form a frame field that is locally parallel around the point in question, or to know the differential rotation of the frame field (see [6] for this point, which is emphasized by the choice of textures displayed in [11]; the assumption is known as texture stationarity). A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 225–239, 2002. c Springer-Verlag Berlin Heidelberg 2002
226
D.A. Forsyth
There is a mixed method, due to [4]. As in the local methods, image projections of texture elements are compared yielding the cosine of slant of the surface at each texture element up to one continuous parameter. A surface is interpolated using an extremisation method, the missing continuous parameter being supplied by the assumption that the texture process is Poisson (as in global methods) — this means that quadrat counts of texture elements are multinomial. This method has several disadvantages: firstly, it requires the assumption that the texture is a homogenous Poisson process; secondly, it requires some information about surface boundaries; thirdly, one would expect to extract more than the cosine of slant from a texture element.
1
A Texture Model
We model a texture on a surface as a marked point process, of unknown spatial properties; definitions appear in [4]. In our model, the marks are texture elements (texels or textons, as one prefers; e.g. [7,8] for automatic methods of determining appropriate marks) and the orientation of those texture elements with respect to some surface coordinate system. We assume that the marks are drawn from some known, finite set of classes of Euclidean equivalent texels. Each mark is defined in its own coordinate system; the surface is textured by taking a mark, placing it on the tangent plane of the surface at the point being marked, translating the mark’s origin to lie on the surface point being marked, and rotating randomly about the mark’s origin (according to the mark distribution). We assume that these texture elements are sufficiently small that they will, in general, not overlap, and that they can be isolated. Furthermore, we assume that they are sufficiently small that they can be modelled as lying on a surface’s tangent plane at a point.
2
Surface Cues from Orthographic Viewing Geometry
We assume that we have an orthographic view of a compact smooth surface and the viewing direction is the z-axis. We write the surface in the form (x, y, f (x, y)), and adopt the usual convention of writing fx = p and fy = q. Texture imaging transformations for orthographic views: Now consider one class of texture element; each instance in the image of this class was obtained by a Euclidean transformation of the model texture element, followed by a foreshortening. The transformation from the model texture element to the particular image instance is affine. This means that we can use the center of gravity of the texture element as an origin; because the COG is covariant under affine transformations, we need not consider the translation component further. Furthermore, in an appropriate coordinate system on the surface and in the image, the foreshortening can be written as 1 0 Fi = 0 cos σi where σi is the angle between the surface normal at mark i and the z axis.
Shape from Texture without Boundaries
227
The transformation from the model texture element to the i’th image element is then TM →i = RG(i) Fi RS(i) where RS (i) rotates the texture element in the local surface frame, Fi foreshortens it, and RG (i) rotates the element in the image frame. From elementary considerations, we have that 1 p q RG (i) = p2 + q 2 −q p The transformation from the model texture element to the image element is not a general affine transformation (there are only three degrees of freedom). We have: Lemma 1: An affine transformation T can be written as RG FRS , where RG , RS are arbitrary rotations and F is a foreshortening (as above) if and only if det(T )2 = tr(T T T ) − 1 and 0 ≤ det(T )2 ≤ 1 If these conditions hold, we say that this transformation is a local texture imaging transformation. Proof: If the conditions are true, then T T T has one eigenvalue 1 and the other between zero and one. By choice of eigenvectors, we can diagonalize T T T to be RG F 2 RTG , meaning that T = RG FRS for arbitrary RS . The other direction is obvious. Notice that, given an affine transformation A that is a texture imaging transformation, we know the factorization into components only up to a two-fold ambiguity. This is because A = RG FRS = RG (−I)F(−I)RS = A where I is the identity. The other square roots of the identity are ruled out by the requirement that cos σi be positive. Now assume that the model texture element(s) are known. We can then recover all transformations from the model texture elements to the image texture elements. We now perform an eigenvalue decomposition of Ti TiT to obtain RG (i) and Fi . From the equations above, it is obvious that these yield the value of p and q at the i’th point up to a sign ambiguity (i.e. (p, q) and (−p, −q) are both solutions). The Texture Element is Unambiguous in a Generic Orthographic View: Generally, the model texture element is not known. However, an image texture element can be used in its place. We know that an image texture element is within some (unknown) affine transformation of the model texture element, but this transformation is unknown. Write the transformation from image element j to image element i as Tj→i
228
D.A. Forsyth
and recall that this transformation can be measured relatively easily in principle (e.g. [4,9,11]). If we are using texture element j as a model, there is a unique (unknown) affine transformation A such that TM →i = Tj→i A for every image element i. The rotation component of A is of no interest. This because rotating a model texture element simply causes the rotation on the surface, RS , to change, and this term offers no shape information. Furthermore, A must have positive determinant, because we have excluded the possibility that the element is flipped by the texture imaging transformation. Finally, A must have a positive element in the top left hand corner, because we exclude the case where A is −I because this again simply causes the rotation on the surface, RS to change without affecting the element shape. Assume that we determine A by searching over the lower diagonal affine transformations to find transformations such that TM →i = Tj→i A is a texture imaging transformation for every i. It turns out that there is no ambiguity. Lemma 2: Given TM →i for i = 1, . . . , N is a texture imaging transformation arising from a generic surface, then TM →i B is a texture imaging transformation for i = 1, . . . , N , for B a lower-diagonal matrix of positive determinant and with B00 > 0, if and only if B is the identity. Proof: Recall that only lower diagonal B are of interest, because a rotation in place does not affect the shape of the texture element. Recall that det(M) = det(MT ) and trace(M) = trace(MT ). This means that, for M a texture imaging transformation, both M and MT satisfy the constraints above. We can assume without loss of generality that TM →1 = RG F1 (because we have choice of coordinate frame up to rotation on the model texture element). This T means that TM →1 TM →1 is diagonal — it is the square of the foreshortening — T T and so B TM →1 TM →1 B must also be diagonal and have 1 in the top left hand corner. This means B must be diagonal, and that B00 = 1 (the case B00 = −1 is excluded by the constraint above). So the only element of interest is B11 = λ. Now for some arbitrary j — representing the transform of another element — we can write ab T TM T = →j M →j bc where (a+c−1) = (ac−b2 ). Now if TM →j B is a texture imaging transformation, then we have a λb T T B TM →j TM →j B = λb λ2 c Now this matrix must also meet the constraints to be a texture imaging transformation, so that we must have that (a + λ2 c − 1) = λ2 (ac − b2 ) as well as (a + c − 1) = (ac − b2 ). Rearranging, we have the system (a − 1) = ((a − 1)c − b2 )
Shape from Texture without Boundaries
229
and (a − 1) = λ2 ((a − 1)c − b2 ) which has solutions when λ2 = 1 (or when (a, b) = (1, 0), which is not generic). If λ = −1, then the transformation’s determinant is negative, so it is not a texture imaging transformation, so λ = 1. Notice the case where λ = −1 corresponds to flipping the model texture element in its frame, and incorporating a flip back into the texture imaging transformation. Lemma 2 is crucial, because it means that, for orthographic views, we can recover the texture element independent of the surface geometry (whether we should is another matter). We have not seen lemma 2 in the literature before, but assume that it is known — it doesn’t appear in [10], which describes other repeated structure properties, in [12], which groups plane repeated structures, or in [13], which groups affine equivalent structures but doesn’t recover normals. At heart, it is a structure from motion result. Special cases: The non-generic cases are interesting. Notice that if (a, b) = (1, 0) for all image texture elements, then 0 ≤ λ ≤ min(1/ cos σj ), where the minimum is over all texture elements. The easiest way to understand this case is to study the surface gradient field. In particular, at each texture element there is an iso-depth direction, which is per−1 pendicular to the normal. Now apply TM →j to this direction for the j’th texture element, yielding a direction in the frame of the model texture element. This direction, and this alone, is not foreshortened. In the case that (a, b) = (1, 0), for all texture elements the same direction is not foreshortened. There are two cases. In the first, this effect occurs as a result of an unfortunate coincidence between view and texture field. It is a view property, because the gradient of the surface (which is determined by the view), is aligned with the texture field; this case can be dismissed by the generic view assumption. The more interesting case occurs when the texture element is circular; this means that the effect of rotations in the model frame is null, so that texture imaging transformations for circular spots are determined only up to rotation in the model frame. In turn, any distribution of circular spots on the surface admits a set of texture imaging transformations such that (a, b) = (1, 0). This texture is ambiguous, because one cannot tell the difference between a texture of circular spots in a generic view and a texture of unfortunately placed ellipses; furthermore, the fact that λ is indeterminate means that the surface may consist of ellipses with a high aspect ratio viewed nearly frontally, or ones with a low aspect ratio heavily foreshortened. Again, a generic view assumption would allow only the first interpretation, so when a texture element appears like an ellipse, we fix its shape as a circle. Reasoning about the iso-depth direction in the model texture element’s frame allows us to understand lemma 2 in somewhat greater detail. In effect, the reason B is generically unique is that it must fix many different directions in the model texture element’s frame. The only transformations that do this are the identity and a flip. Recovering geometric data for orthographic views of unknown texture elements: Now assume that the texture element is unknown, but each texture imaging transformation is known. Then we have an excellent estimate of the texture element. In the simplest case, we assume the image represents the
230
D.A. Forsyth
Fig. 1. The reconstruction process, illustrated for a synthetic image of a sphere. Left, an image of a textured sphere, using a texture element that is not circular. Center, the height of the sphere, rendered as an intensity field, higher points being closer to the eye; right shows the reconstruction obtained using the EM algorithm using the same map of height to intensity.
albedo (rather than the radiosity), and simply apply the inverse of each texture imaging transformation to its local patch and average the results over all patches. In the more interesting case, where we must account for shading variation, we assume that the irradiance is constant over the texture element. Write Iµ for the estimate of the texture element, and Ii for the patch obtained by applying Ti−1 to the image texture element i. Then we must choose Iµ and some set of constants λi to minimize Σi || λi Iµ − Ii ||2 Now assume that we have an estimate of the model texture element, and an estimate of the texture imaging transformation for each image texture element. In these circumstances, it is possible to tell whether an image texture element represents an instance of the model texture element or not — it will be an instance if, by applying the inverse texture imaging transformation to the image texture element, we obtain a pattern that looks like the model texture element. This suggests that we can insert a set of hidden variables, one for each image texture element, which encode whether the image texture element is an instance or not. We now have a rather natural application of EM. Practical details: For the i’th texture element, write θgi for the rotation angle of the in-image rotation, σi for the foreshortening, θsi for the rotation angle of the on-surface rotation and Ti for the texture imaging transformation encoded by these parameters. Write δi for the hidden variable that encodes whether the image texture element is an instance of the model texture element or not. Write Iµ for the (unknown) model texture element. To compare image and model texture elements, we must be careful about domains. Implicit in the definition of Iµ is its domain of definition D— say a nxn pixel grid — and we can use this. Write Ti−1 I for the pattern obtained by applying Ti−1 to the domain Ti (D). This is most easily computed by scanning D, and for each sample point s = (sx , sy ) evaluating the image at Ti−1 s.
Shape from Texture without Boundaries
231
We assume that imaging noise is normally distributed with zero mean and standard deviation σim . We assume that image texture elements that are not instances of the model texture element arise with uniform probability. We have that 0 ≤ σi ≤ 1 for all i, a property that can be enforced with a prior term. To avoid the meaningless symmetry where illumination is increased and albedo falls, we insert a prior term that encourages λi to be close to one. We can now write the negative log-posterior 1 1 || λi Iµ − Ti−1 I ||2 δi + (1 − δi )K + 2 (λi − 1)2 + L 2 2σim i 2σlight i where L is some unknown normalizing constant of no further interest. The application of EM to this expression is straightforward, although it is important to note that most results regarding the behaviour of EM apply to maximum likelihood problems rather than maximum a posteriori problems. We are aware of no practical difficulties that result from this observation. Minimisation of the Q function with respect to δi is straightforward, but the continuous parameters require numerical minization. This minimisation is unusual in being efficiently performed by coordinate descent. This is because, for fixed Iµ , each Ti can be obtained by independently minimizing a function of only three variables. We therefore minimize by iterating two sweeps: fix Iµ and minimize over each Ti in turn; now fix all the Ti and minimize over Iµ .
Fig. 2. An image of a running cheetah, masked to suppress distractions, and of a spotted dress, ditto. These images were used to reconstruct surfaces shown in figures below.
3
Surface Cues from Perspective Viewing Geometry
Shape from texture is substantially more difficult from generic perspective views than from generic orthographic views (unless, as we shall see, one uses the highly restrictive homogeneity assumption). We use spherical perspective for simplicity, imaging onto a sphere of unit radius. Texture imaging transformations for perspective views: Because the texture element is small, the transformation from the model texture element to the particular image instance is affine. This means that we can again use the center of gravity of the texture element as an origin; because the COG is covariant under affine transformations, we need not consider the translation
232
D.A. Forsyth
Fig. 3. Left, the gradient information obtained by the process described in the text for a synthetic image of a textured sphere. Since the direction of the gradient is not known, we illustrate gradients as line elements, do not supply an arrow head and extend the element forward and backward. The gradient is relatively small and hard to see on the many nearly frontal texture elements — look for the black dot at the center of the element. Gradients are magnified for visibility, and overlaid on texture elements; gradients shown with full lines are associated with elements with a value of the hidden data flag (is this a texture element or not) greater than 0.5; the others are shown with dotted lines. The estimated element is shown in the top left hand corner of the image. Right, a similar plot for the image of the spotted dress.
component further. The transformation from the model texture element to the i’th image element is then (p)
TM →i =
1 RG (i)Fi RS (i) ri
where ri is the distance from the focal point to the COG of the texture element, RS (i) rotates the texture element in the local surface frame, Fi foreshortens it, and RG (i) rotates the element in the image frame. We have superscripted this transformation with a (p) to indicate perspective. Again, RG (i) is a function only of surface geometry (at this point, the form does not matter), and RS (i) is a property of the texture field. This transformation is a scaled texture imaging transformation. This means it is a general affine transformation. This is most easily seen by noting that any affine transformation can be turned into a texture imaging transformation by dividing by the largest singular value. The Texture Element is Ambiguous for Perspective Views: Recall we know the model texture element up to an affine transformation. If we have an orthographic view, the choice of affine basis is further restricted to a two-fold discrete ambiguity by lemma 2 — we need to choose a basis in which each TM →i is a texture imaging transformation, and there are generically two such bases. There is no result analogous to lemma 2 in the perspective case. This means
Shape from Texture without Boundaries
233
0.45
0.4
rms error in reconstruction
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
1.5
2
2.5
3 run
3.5
4
4.5
5
Fig. 4. The root mean square error for sphere reconstruction for five examples each of four different cases, as a percentage of the sphere’s radius. The horizontal axis gives the run, and the vertical axis gives the rms error. Note that in most cases the error is of the order of 10% of radius. Squares and diamonds correspond to the case where the texture element must be estimated (in each run, the image is the same, so we have five different such images), and circles and crossed circles correspond to the case where it is known (ditto). Squares and circles correspond to the symmetric error metric, and diamonds and crossed circles correspond to EM based reconstruction. Note firstly that in all but three cases the RMS error is small. Accepting that the large error in the first three runs using the symmetric error metric for the estimated texture element may be due to a local minimum, there is very little to choose between the cases. This suggests that (1) the process works well (2) estimating the texture element is successful (because knowing it doesn’t seem to make much difference to the reconstruction process) and (3) either interpolation mechanism is probably acceptable.
that any choice of basis is legal, and so, for perspective views, we cannot recover the texture element independent of the surface geometry. This does not mean that shape from texture in perspective views is necessarily ambiguous. It does mean that, for a generic texture, we cannot disconnect the process of determining the texture element (and so measuring a set of geometric data about the surface) from the process of surface reconstruction. This implies that reconstruction will involve a fairly complicated minimisation process. We demonstrate shape from texture for only orthographic views here; this is because the process of estimating texture imaging transformations and the fitting process can be decoupled. An alternative approach is to “eliminate” the shape of the model texture element from the problem. This was done by Garding [5], Malik and Rosenholtz [9, 11] and Clerc [3]; it can be done for the orthographic case, too [4], but there is less point. This can be done only under the assumption of local homogeneity.
4
Recovering Surface Shape from Transform Data
We assume that the texture imaging transformation is known uniquely at each of a set of scattered points. Now this means that at each point we know p and q up
234
D.A. Forsyth
to sign, leaving us with an unpleasant interpolation problem. We expect typical surface fitting procedures to be able to produce a surface for any assignment of signs, meaning that we require some method to choose between them — essentially, a prior.
Fig. 5. The reconstructed surface for the cheetah of figure 2 is shown in three different textured views; note that the curve of the barrel chest, turning in toward the underside of the animal and toward the flank is represented; the surface pulls in toward the belly, and then swells at the flank again. Qualitatively, it is a satisfactory representation of the cheetah.
The surface prior: The question of surface interpolation has been somewhat controversial in the vision community in the past — in particular, why would one represent data that doesn’t exist? (the surface components that lie between data points). One role for such an interpolate is to incorporate spatial constraints — the orientation of texture elements with respect to the view is unlikely to change arbitrarily, because we expect the scale of surface wiggles to be greater than the inter-element spacing. It is traditional to use a second derivative approximation to curvature as a prior; in our experience, this is unwise, because it is very
Shape from Texture without Boundaries
235
badly behaved at the boundary (where the surface is nearly vertical). Instead, we follow [4] and compute the norm of the shape operator and sum over the surface. This yields 1 2 2 π(θ) ∝ exp −( 2 ) (κ1 + κ2 )dA 2σk R where the κi are the extremal values of the normal curvatures. Robustness: We expect significant issues with outliers in this problem. In practice, the recovery process for texture imaging transformations tends to be unreliable near boundaries, because elements are heavily foreshortened and so have lost some detail. As figure 3 indicates, there are occasional large gradient estimates at or near the boundary. It turns out to be a good idea to use a robust estimator in this problem. In particular, our log-likelihoods all use φ(x; %) = x/(x+%). We usually compose this function with a square, resulting in something proportional to the square for small argument and close to constant for large x. The data approximation problem: Even with a prior, we have a difficult approximation problem. The data points are scattered and so it is natural to use radial basis functions. However, we have only the gradient of the depth, but, which is worse, we do not know the sign of the gradient. We could either fit using only to p2 + q 2 — which yields a problem rather like shape from shading, which requires boundary information, which is often not available — or use the orientation of the gradient as well. To exploit the orientation of the gradient, we have two options. Option 1: The symmetric method We can fit using only p2 and q 2 — in which case, we are ignoring information, because this data implies a four-fold ambiguity and we have only a two-fold ambiguity. A natural choice of fitting error (or negative log-posterior, in modern language) is
2 1 2 2 2 2 2 − (z (x , y ; θ)) + q − (z (x , y ; θ)) ; %) + π(θ) φ( p x i i y i i i i 2σf2 i We call this the symmetric method because the negative log-posterior is invariant to the transformation (pi , qi ) → (−pi , −qi ) Option 2: The EM method The alternative is to approach the problem as a missing data problem, where the missing data is the sign of the gradient. It is natural to try and attack this problem with EM, too. The mechanics are straightforward. Write the depth function as z(x, y; θ), and the Gaussian curvature of the resulting surface as K(θ). The log-posterior is
2 2 i − zx (xi , yi ; θ)] + [qi − zy (xi , yi ; θ)] ) (1 − δi )+ 1 φ([p
+ π(θ)
2 2 2σf2 i φ([pi + zx (xi , yi ; θ)] + [qi + zy (xi , yi ; θ)] ) (δi ) where δi (the missing variables) are one or zero according as the sign of the data is correct or not. The log-posterior is linear in the hidden variables, so the Q function is simple to compute, but the maximization must be done numerically,
236
D.A. Forsyth
Fig. 6. The reconstructed surface for the dress of figure 2 is shown in two different textured views. In one view, the surface is superimposed on the image to give some notion of the relation between surface and image. Again, the surface appears to give a fair qualitative impression of the shape of the dress.
and is expensive because the Gaussian curvature must be integrated for each function evaluation (meaning that gradient evaluation is particularly expensive). Approximating functions: In the examples, we used a radial basis function approximation with basis functions placed at each data point. We therefore have z(x, y; θ) =
i
ai (x − xi )2 + (y − yi )2 + ν
where ν sets a scale for the approximating surfaces and ai are the parameters. The main disadvantage with this approach is that the number of basis elements — and therefore, the general difficulty of the fitting problem — goes up with the number of texture elements. There are schemes for reducing the number of basis elements, but we have not experimented with their application to this problem.
Shape from Texture without Boundaries
237
Fig. 7. The texture on the surfaces of figure 6 implies that the surface can follow scale detail at a scale smaller than the elements on the dress; the figure on the left, which is a height map with lighter elements being closer to the eye, indicates that this is not the case — the reconstruction smooths the dress to about the scale of the inter-element spacing, as one would expect. On the right, texture remapping; because we know the texture imaging transformation for each detected element, we can remap the image with different texture elements. Note that a reconstruction is not required, and the ambiguity in gradient need not be resolved. Here we have replaced spots with rosettes. Missing elements, etc. are due to the relatively crude element detection scheme.
4.1
Experimental Results
We implemented this method for orthographic views in Matlab on an 867Mhz Macintosh G4. We used a crude template matcher to identify texture elements In the examples where the texture element is a circular spot, the element was fixed at a circular spot; for other examples, the element was estimated along with the texture imaging transformations. Determining texture imaging transformations for of the order of 60 elements takes of the order of tens of minutes. Reconstruction is achingly slow, taking of the order of hours for each of the examples shown; this is entirely because of the expense of computing the prior term, a computation that in principle is easily parallelised. Estimating transformations: Figure 1 shows input images with gradient orientations estimated from texture imaging transformations superimposed. Notice that there is no arrow-head on these gradient orientations, because we don’t know which direction the gradient is pointing. Typically, elements at or near the rim lead to poor estimates which are discarded by the EM — a poor transformation estimate leads to a rectified element that doesn’t agree with most others, and so causes the probability that the image element is not an instance of the model element to rise. Reconstructions compared with ground truth: In figure 4, we compare a series of runs of our process under different conditions. In each case, we used
238
D.A. Forsyth
synthetic images of a textured sphere, so that we could compute the root mean square error of the radius of the reconstructed surface. We did five runs each of four cases — estimated (square) texture element vs. known circular texture element and symmetric reconstruction vs EM reconstruction. In each run of the known (resp. unknown) texture element case, we used the same image for symmetric vs EM reconstruction, so that we could check the quality of the gradient information. In all but three of the 20 runs, the RMS error is about 10% of radius. This suggests that reconstruction is successful. In the other three runs, the RMS error is large, probably due to a local minimum. All three runs are from the symmetric reconstruction algorithm applied to an estimated element. We know that the gradient recovery is not at fault in these cases, because the EM reconstruction algorithm recovered a good fit in these cases. This means that estimating the texture element is not significantly worse than knowing it. Furthermore, it suggests that the reconstruction algorithms are roughly equivalent. Reconstructions from Images of Real Scenes: Figures 5 and 6 show surfaces recovered from images of real textured scenes. In these cases, lacking ground truth, we can only argue qualitatively, but the reconstructed surface appears to have a satisfactory structure. Typically, mapping the texture back onto the surface makes the reconstruction look as though it has fine-scale structure. This effect is most notable in the case of the dress, where the reconstructed surface looks as though it has the narrow folds typical of cloth; this is an illusion caused by the shading, as figure 7 illustrates. Texture Remapping: One amusing application is that, knowing the texture imaging transformation and an estimate of shading, we can remap textures, replacing the model element in the image with some other element. This does not require a surface estimate. Quite satisfactory results are obtainable (figure 7).
5
Comments
Applications: Shape from texture has tended to be a core vision problem — i.e. interesting to people who care about vision, but without immediate practical application. It has one potential application in image based rendering — shape from texture appears to be the method with the most practical potential for recovering detailed deformation estimates for moving, deformable surfaces such as clothing and skin. This is because no point correspondence is required for a reconstruction, meaning that shape estimates are available from spotty surfaces relatively cheaply — these estimates can then be used to condition point tracking, etc., as in [14]. However, shape from texture has the potential advantage over Torresani et al.’s method that it does not require feature correspondences or constrained deformation models. SFT=SFM: There is an analogy between shape from texture and structure from motion that appears in the literature — see, for example, [5,9,10,11,13] — but it hasn’t received the attention it deserves. In essence, shape from texture is about one view of multiple instances of a pattern, and structure from motion is (currently) about multiple views of one instance of a set of points. Lemma
Shape from Texture without Boundaries
239
2 is, essentially, a structure from motion result, and if it isn’t known, this is because it treats cases that haven’t arisen much in practice in that domain. However, the analogy has the great virtue that it offers attacks on problems that are currently inaccessible from within either domain. For example, one might consider attempting to reconstruct a textured surface which does not contain repeated elements by having several views; lemma 2 then applies in structure from motion mode, yielding estimates of each texture element; these, in turn, yield normals, and a surface results. Acknowledgments. Much of the material in this paper is in response to several extremely helpful and stimulating conversations with Andrew Zisserman.
References 1. Y. Aloimonos. Detection of surface orientation from texture. i. the case of planes. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 584–593, 1986. 2. A. Blake and C. Marinos. Shape from texture: estimation, isotropy and moments. Artificial Intelligence, 45(3):323–80, 1990. 3. M. Clerc and S. Mallat. Shape from texture through deformations. In Int. Conf. on Computer Vision, pages 405–410, 1999. 4. D.A. Forsyth. Shape from texture and integrability. In Int. Conf. on Computer Vision, pages 447–452, 2001. 5. J. Garding. Shape from texture for smooth curved surfaces. In European Conference on Computer Vision, pages 630–8, 1992. 6. J. Garding. Surface orientation and curvature from differential texture distortion. In Int. Conf. on Computer Vision, pages 733–9, 1995. 7. T. Leung and J. Malik. Detecting, localizing and grouping repeated scene elements from an image. In European Conference on Computer Vision, pages 546–555, 1996. 8. J. Malik, S. Belongie, J. Shi, and T. Leung. Textons, contours and regions: cue integration in image segmentation. In Int. Conf. on Computer Vision, pages 918– 925, 1999. 9. J. Malik and R. Rosenholtz. Computing local surface orientation and shape from texture for curved surfaces. Int. J. Computer Vision, pages 149–168, 1997. 10. J.L. Mundy and A. Zisserman. Repeated structures: image correspondence constraints and 3d structure recovery. In J.L. Mundy, A. Zisserman, and D.A. Forsyth, editors, Applications of invariance in computer vision, pages 89–107, 1994. 11. R. Rosenholtz and J. Malik. Surface orientation from texture: isotropy or homogeneity (or both)? Vision Research, 37(16):2283–2293, 1997. 12. F. Schaffalitzky and A. Zisserman. Geometric grouping of repeated elements within images. In D.A. Forsyth, J.L. Mundy, V. diGesu, and R. Cipolla, editors, Shape, contour and grouping in computer vision, pages 165–181, 1999. 13. T.Leung and J. Malik. Detecting, localizing and grouping repeated scene elements from an image. In European Conference on Computer Vision, pages 546–555, 1996. 14. L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modelling non-rigid objects with rank constraints. In IEEE Conf. on Computer Vision and Pattern Recognition, 2001. to appear. 15. A.P. Witkin. Recovering surface shape and orientation from texture. Artificial Intelligence, 17:17–45, 1981.
Statistical Modeling of Texture Sketch Ying Nian Wu1 , Song Chun Zhu2 , and Cheng-en Guo2 1
2
Dept. of Statistics, Univ. of California, Los Angeles, CA 90095, USA
[email protected] Dept. of Comp. and Info. Sci., Ohio State Univ., Columbus, OH 43210, USA {szhu, cguo}@cis.ohio-state.edu
Abstract. Recent results on sparse coding and independent component analysis suggest that human vision first represents a visual image by a linear superposition of a relatively small number of localized, elongate, oriented image bases. With this representation, the sketch of an image consists of the locations, orientations, and elongations of the image bases, and the sketch can be visually illustrated by depicting each image base by a linelet of the same length and orientation. Built on the insight of sparse and independent component analysis, we propose a two-level generative model for textures. At the bottom-level, the texture image is represented by a linear superposition of image bases. At the top-level, a Markov model is assumed for the placement of the image bases or the sketch, and the model is characterized by a set of simple geometrical feature statistics. Keywords: Independent component analysis; Matching pursuit; Minimax entropy learning; Sparse coding; Texture modeling.
1
Introduction
As argued by Mumford (1996) and many other researchers, the problem of vision can be posted in the framework of statistical modeling and inferential computing. That is, top-down generative models can be constructed to represent a visual system’s knowledge in the form of probability distributions of the observed images as well as variables that describe the visual world, then visual learning and perception become a statistical inference (and model selection) problem that can be solved in principle by computing the likelihood or posterior distribution. To guide reliable inference, the generative models should be realistic, and this can be checked by visually examining random samples generated by the models. Recently, there has been some progress on modeling textures. Most of the recent models involve linear filters for extracting local image features, and the texture patterns are characterized by statistics of local features. In particular, inspired by the work of Heeger and Bergen (1996), Zhu, Wu, and Mumford (1997) and Wu, Zhu, and Liu (2001) developed a self-consistent statistical theory for texture modeling; borrowing results from statistical mechanics, they showed that a class of Markov random field models is a natural choice under the assumption A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 240–254, 2002. c Springer-Verlag Berlin Heidelberg 2002
Statistical Modeling of Texture Sketch
241
that texture impressions are decided by histograms of filter responses. Meanwhile, Portilla and Simoncelli (2000), using steerable pyramid, made a thorough investigation of feature statistics for representing texture patterns. These models appear to be very successful in capturing stochastic textures, but seem to be less effective for textures with salient local structures and regular spatial arrangements. Meanwhile, there have been major developments in sparse coding (Olshausen and Field, 1996) and independent component analysis (Bell and Sejnowski, 1997). These methods represent natural images (actually image patches) by linear superposition of image bases. By asking for sparseness or independence of the coefficients in the linear decomposition, localized, elongate and oriented image bases can be learned from natural images. The learned bases constitute a general and efficient vocabulary for image representation. The idea of sparseness has also been studied in wavelet community (e.g., Mallat and Zhang, 1993; Candes and Donoho, 2000). This inspired us to re-consider the role of linear filters in texture models. The filter responses are inner products between the image and localized image bases, and this is a bottom-up operation. However, generative models are for top-down representation, and therefore it appears to be more appropriate for the images bases to play a representational role as in sparse and independent component analysis. This consideration promoted us to construct texture models based on linear representation instead of linear operation. The resulting model is a twolevel top-down model. At the bottom level, the texture image is represented by linear superposition of image bases, and at the top level, the spatial arrangements of image bases and their coefficients are to be modeled. The sparse and independent component analysis suggests that with well-designed image bases, the coefficients should become much sparser and less dependent than the raw image intensities, and thus more susceptible to modeling. Guided by the top-down model, local structures and their organizations can be more sharply captured, because variable selection or explaining-away effect can be easily achieved in model-based context. To be more specific, the spatial model for the image bases and their coefficients incorporate sparseness in the sense that only a small number of image bases have non-zero coefficients or are active, so that the locations, orientations, and elongations of the active image bases capture the geometrical aspect of the texture image, or they tell us how to sketch the image, while the coefficients of these active bases capture the photometrical aspect of the image, or they tell us how to paint the image given the sketch. To visually illustrate the sketch, we can depict each active image base by a small line segment or a linelet of the same location, orientation, and length, as first proposed by Bergeaud and Mallat (1996). The sparse decomposition in general achieves about 100 folds of dimension reduction. The two-level top-down model is a hidden Markov model, with the coefficients of image bases being latent variables or missing data. While rigorous model fitting typically involves EM-type algorithms (Dempster, Laird, and Rubin, 1977),
242
Y.N. Wu, S.C. Zhu, and C.-e. Guo
in this article, we shall isolate the problem of modeling the sketch of the texture images, while assuming that the sketch can be obtained from the image. Our sketch model is a causal Markov chain model whose conditional distributions are characterized by a set of simple geometrical feature statistics automatically selected from a pre-defined vocabulary. Embellished versions of our model can be useful in the following regards. In computer vision, it can be used for image segmentation and perceptual grouping. In computer graphics, as the sketch captures the geometrical essence of images, it may be used for non-photo realistic rendering. For understanding human vision, it provides a model-based representational theory for Marr’s primal sketch (Marr 1982).
2 2.1
Sketching the Image Sparse Coding and Independent Component Analysis
The essential idea of sparse coding (Olshausen and Field, 1996), independent component analysis (Bell and Sejnowski, 1999), and their combination (Lewiki and Olshausen, 1999) is the assumption that an image I can be represented as the superposition of a set of image bases. The bases are selected from an overcomplete basis (vocabulary) {b(x,y,l,θ,e) }, where (x, y) is the central position of the base on the image domain, and (l, θ, e) is the type of the base. l is the scale or length of the base, θ the orientation, and e the indicator for even/odd bases. The DC components of the bases are 0, and the l2 norm of the bases are 1. Therefore,
c(x,y,l,θ,e) b(x,y,l,θ,e) + N (0, σ 2 ),
(1)
c(x,y,l,θ,e) ∼ p(c) = ρδ0 + (1 − ρ)N (0, τ 2 ), independently.
(2)
I=
(x,y,l,θ,e)
The base coefficients c(x,y,l,θ,e) is assumed to be independently distributed according to a distribution p(c), which is the mixture of a point mass at 0 (then the base is said to be inactive) and a Gaussian distribution with a large variance (for active state). In general, the basis (vocabulary) can be learned from natural images by minimizing the number of active bases (i.e. sparse coding) or by maximizing a measure of independence between the coefficients. In our experiments, we select a set of bases for vocabulary b(x,y,l,θ,e) , as shown in Figure 1. We use 128 difference of offset Gaussian (DOOG) filters (Malik and Perona, 1989) at various scales and orientations, even and odd. There are 5 scales, and we adopt a curvelet design of Candes and Donoho (2000), that is, for smaller scales, the bases become more elongate and there are more orientations. We also use the center-surround Difference of Gaussian (DOG) filters at 8 scales to account for variations that cannot be efficiently captured by the DOOG bases.
Statistical Modeling of Texture Sketch
243
Fig. 1. Image bases used in representation and modeling.
2.2
Matching Pursuit and Its MCMC Version
Since the bases are over-completed, the coefficients have to be inferred by sampling a posterior probability. In this section, we extend the heuristic matching pursuit algorithm of Mallat and Zhang (1993) to a more principled Markov chain Monte Carlo (MCMC) algorithm. We sample the posterior distribution of the coefficients p({c(x,y,l,θ,e) } | I) in Model (1) in order to find a symbolic representation or a sketch of image I. We assume the parameters (ρ, τ 2 , σ 2 ) are known for this model. First, let’s consider the Gibbs sampler for posterior sampling. For simplicity, we use j or i to index (x, y, l, θ, e) and we define zj = 1 if bj is active, i.e., cj = 0, andzj = 0 otherwise. Then the algorithm is as follows: 1. Randomly select a base bj . Compute R = I − i=j ci bi , i.e., the residual image. Let rj =< R, bj > /(1 + σ 2 /τ 2 ), and σ 2 = 1/(1/σ 2 + 1/τ 2 ). 2. Compute the Bayes factor by integrating out cj , γj =
rj2 p(I | zj = 1; {ci , ∀i =j}, I) = exp{ 2 } σ 2 /τ 2 . p(I | zj = 0; {ci , ∀i =j}, I) 2σ
244
Y.N. Wu, S.C. Zhu, and C.-e. Guo
Then the posterior probability p(zj = 1 | {ci , ∀i =j}, I) =
(1 − ρ)γj . ρ + (1 − ρ)γj
3. Sample zj according to the above probability. If zj = 0, then set cj = 0, otherwise, sample cj ∼ N (rj , σ 2 ). Go back to [1]. The problem with this Gibbs sampler is that if σ 2 is small, the algorithm is too willing to activate the base bj even though the response rj is not that large, or in other words, the algorithm is too willing to jump into a local energy minimum. The idea of matching pursuit of Mallat and Zhang (1993) can be used to remedy this problem. That is, instead of randomly selecting a base bj to update, we randomly choose a window W on the image domain, and look at all the inactive bases within this window W . Then we sample one base by letting these bases compete for an entry. So we have the following windowed-Gibbs sampler or a Metropolized matching pursuit algorithm: 1. Randomly select a window W on the image domain, let A be the number of active bases within W , and let B be the number of inactive bases within W . With probability pbirth (a pre-designed number), go to [2]. With probability pdeath = 1 − pbirth , go to [4]. 2. For each inactive base j = (x, y, l, θ, e) with (x, y) ∈ W and zj = 0, compute γj as described in [1] and [2] of the Gibbs sampler. Then with probability pdeath j:zj =0; (x,y)∈W (1 − ρ)γj /ρ paccept = , pbirth × (A + 1) go to [3], with probability 1 − paccept go back to [1]. 3. Among all the inactive bases j with (x, y) ∈ W and zj = 0, sample a base j with probability proportional to γj , then let zj = 1 and sample cj as described in [3] of the Gibbs sampler. Go back to [1]. 4. If A > 0, then randomly select an active base bk with zk = 1 and (x, y) ∈ W . Then temporarily turn off bk , i.e., set ck = 0, and A → A − 1 temporarily. Then for all the inactive bases j with zj = 0 and (x, y) ∈ W , including base k, do the same computation as [2] (as if ck = 0), and compute paccept as in [2]. 5. With probability 1/paccept , accept the proposal of deleting the base k, i.e., set ck = 0. Go back to [1]. With probability 1 − 1/paccept , reject the proposal of deleting base k, i.e., recover the original ck . Go back to [1]. This is a Metropolis algorithm with two types of proposals, adding an inactive base and deleting an active base, and this is a birth and death move. In addition to that, we can easily incorporate updating schemes, including 1) perturbing the coefficient of an active base, i.e., updating cj ; 2) moving an active base to a different position, i.e., updating (x, y); 3) shortening or stretching an active base by changing its length l; 4) rotating an active base to a different orientation, i.e., updating θ.
Statistical Modeling of Texture Sketch
245
For the simple model (1), if we let σ 2 → 0, then the Metropolis algorithm described above goes to the windowed version of the matching pursuit algorithm. In this paper, we shall just use the latter algorithm for simplicity. We feel that the MCMC version of matching pursuit is useful in two aspects. Conceptually, it helps us to understand matching pursuit as a limiting MCMC for posterior sampling. Practically, we believe that the MCMC version, especially with the moves for updating coefficients, positions, lengths, orientations of the active bases, is useful for us to re-estimate the image sketch after we fit a better prior model for sketch.
a
b
c
d
Fig. 2. Sparse decomposition by overcomplete basis and the symbolic sketch.
Experiments I. Figure 2 shows some results of sparse decomposition with overcomplete basis, and more results are shown in Figures 3 and 4. Figure 2.a are observed images, Figure 2.b are reconstructed images with a small number of bases whose coefficients are larger than 5, |cj | > 5, the rate of compression is usually about 100 to 150 folds, e.g., for a 200× 200 image, we only need about 300 bases. Figure 2.c is a finer reconstruction with all bases whose |cj | > 2, the rate of compression is about 30. For each selected base b(x,y,l,θ,e) , we illustrate it symbolically by a linelet of the same length l and orientation θ, while ignoring the brightness c(x,y,l,θ,e) and the odd/even indicator e. All the linelets are of the same width. Figure 2.d shows the sketches of the images.
3
Modeling the Image Sketch
Now we shall improve the simple prior model (1) by a sophisticated sketch model that accounts for the spatial arrangements of image bases.
246
Y.N. Wu, S.C. Zhu, and C.-e. Guo
We may use the following two representations interchangeably for the sketch of a texture image I, and let’s denote the sketch by S. 1. A list: let n be the number of active bases, then we have S = {st = (xt , yt , ct , lt , θt , et ), t = 1, ..., n}. 2. A bit-map: let δx,y be the indicator of whether there is an active base at (x, y) or not, then S = {sx,y = (δx,y , cx,y , lx,y , θx,y , ex,y ), ∀(x, y)}. If δx,y = 0, i.e., there is no active base, then all the (c, l, θ, e) take null values. Using the first representation, the two-level top-down model is of the following form: S = {st = (xt , yt , ct , lt , θt , et ), t = 1, ..., n} ∼ sketch model(Λ), ct bxt ,yt ,lt ,θt ,et + . I= st ∈S
Clearly, the sparse and independent component analysis is a special case of the above model, where at the top level, (xt , yt ) follow Poisson point process, (lt , θt , et ) follow independent uniform distributions, and ct follows a normal distribution with a large variance. The sketch S is the latent variable or missing data, and Λ is the unknown parameter. So the overall model should be fitted by the EM algorithm or its stochastic versions. The general form of such a model fitting procedure is to iterate the following two steps. 1. Scene reconstruction: Estimate S from I, conditional on Λ. 2. Scene understanding: Estimate Λ from S. In our current work, we isolate the problem of modeling S, assuming that S can be inferred from I by matching pursuit algorithm and its MCMC version with the simple sparse and independent prior model on S. To further simplify the problem, we ignore c and e, i.e., the photometrical aspects, and only concentrate on the modeling of (l, θ), i.e., the geometrical aspects of the sketch S. 3.1
Previous Models on Textures
1. MRF and FRAME models. The FRAME model by Zhu, et al. (1997) incorporates the idea of large filters and histogram[10] into Markov random fields. p(I | Λ) =
1 exp{ λl,θ,e (< I, b(x,y,l,θ,e) >)}, Z(Λ) x,y
(3)
l,θ,e
where Λ = {λl,θ,e ( ), ∀(l, θ, e)} is a collection of one-dimensional functions, and Z(Λ) is the normalizing constant. This model is of the Gibbs form. The λ functions can be quantitized into step functions. This model is derived and justified
Statistical Modeling of Texture Sketch
247
as the maximum entropy distribution that reproduces the marginal distributions (or histograms in the quantitized case) of < Iobs , b(x,y,l,θ,e) > where Iobs is the observed image. If we compare model (3) with model (1), we can see that in FRAME model (3), the b(x,y,l,θ,e) are used for bottom-up operation, and the model is based on features < I, b(x,y,l,θ,e) >. In contrast, in model (1), the b(x,y,l,θ,e) are used for top-down representation, and the model is based on latent variables c(x,y,l,θ,e) . We believe that the feature-based models and latent-variable models are two major classes of models, and the former can be modified into the latter if we replace features by latent variables. Of course, this is in general not a trivial step. As to the spatial model for point process S = {st = (xt , yt , lt , θt ), t = 1, ..., n}, Guo, et al. (2001) proposed a Gestalt model of the Gibbs form p(S | Λ) =
1 λ1 (st ) + λ2 (st1 , st2 )}, exp{λ0 n + Z(Λ) t t ∼t 1
(4)
2
where t1 ∼ t2 means that st1 and st2 are neighbors. So this model is a pairpotential Gibbs point process model (e.g., Stoyan, Kendall, Mecke, 1985). Guo et al. (2000) further parameterized the model by introducing a small set of Gestalt features (Koffka, 1935) for spatial arrangement. Again, the model can be justified by the maximum entropy principle. The Gibbs models (3) and (4) are analytically intractable because of the intractability of the normalizing constant Z(Λ). Sampling and maximum likelihood estimation (MLE) have to be done by MCMC algorithms. 2. Causal Markov models. One way to get around this difficulty is to use a causal Markov model. The causal methods have been used extensively in early work on texture generation, most notably, the causal model of Popat and Picard (1993). In the causal model, the joint distribution of the image I is factorized into a sequence of conditional distribution by imposing a linear order on all the pixels (x, y), e.g., m n p(I) = p(Ix,y | IN (x,y) ), x=1 y=1
where x, y index the pixel, and N (x, y) is the neighboring pixels within a certain spatial distance that are scanned before (x, y) The causal model is analytically tractable because of the factorization form. The causal plan has been successfully incorporated in example-based texture synthesis by Efros and Leung (1999), Liang et al. (2001), Efros and Freeman (2001). Hundred of realistic textures have been synthesized. 3.2
Modeling the Sketch Patterns
With the above preparation, we are ready to describe our model for image sketch. Let S be the sketch of I, and let SN (x,y) be the sketch of the causal neighborhood
248
Y.N. Wu, S.C. Zhu, and C.-e. Guo
of (x, y). Recall that both S and SN (x,y) have two representations. Our model is of the following causal form p(S) =
m n
p(sx,y | SN (x,y) ).
x=1 y=1
The conditional distribution is
1 exp{λ0 δx,y + λ1 (lx,y , θx,y ) Z(Λ | SN (x,y) ) λ2 (lx,y , θx,y ; lt , θt ; xt − x, yt − y)}, +
p(sx,y | SN (x,y) ) =
st ∈SN (x,y)
where Z is the normalizing constant, and if lx,y and θx,y take on the null values when δx,y = 0, λ1 ( ) and λ2 ( ) are 0. If λ1 ( ) and λ2 ( ) are always 0, then the model reduces to the simple Poisson model (1). Like in the FRAME model (Zhu, et al. 1997) and the Gestalt model (Guo, et al. 2001), this conditional distribution can be derived or justified as the maximum entropy distribution that reproduces the probability that there exists a linelet, the distribution of the orientation and length of the linelet, and the pair-wise configuration made up by this linelet and a nearby existing linelet. In this model, the probability that we sketch a linelet at (x, y) depends on the attributes of this linelet, and more important, how this linelet lines up with existing linelets nearby, for instance, whether the proposed linelet connects with a nearby existing linelet, or whether the proposed linelet is parallel with a nearly existing linelet, etc. One can envisage this conditional model as modeling the way an artist sketches a picture by adding one stroke at a time. Similar maximum entropy models are also used in language[3]. We can also write the conditional model in a log-additive form log
p(δx,y = 1, lx,y , θx,y | SN (x,y) ) = λ0 + λ1 (lx,y , θx,y ) p(δx,y = 0 | SN (x,y) ) λ2 (lx,y , θx,y ; lt , θt ; xt − x, yt − y). + st ∈SN (x,y)
One may argue that a causal model for spatial point process is very contrived. We agree. A causal order is physically nonsensical. But for the purpose of representing visual knowledge, it has some conceptual advantages because of its analytical tractability. The situation is very similar to view-based methods for objective recognition. Moreover, the model is also suitable for the purpose of coding and compression. Mathematically, one may view this model as a causal (or factorization) approximation to the Gestalt model of Guo et al. (2001). It is expected that the causal approximation loses some of the expressive power of the non-causal model, but this may be compensated by making the causal model more redundant. The current form of the model only accounts for pair-wise configurations of the linelets. We can easily extend the model to account for configurations that involves more than two linelets.
Statistical Modeling of Texture Sketch
3.3
249
Feature Statistics, Model Fitting, and Feature Pursuit
Because the length and orientation has been taken care of by λ1 (ls,y , θx,y ), we choose to parameterize pair-wise configuration function λ2 ( ) as λ2 (D(sx,y , st ), A(sx,y , st )), for st ∈ SN (x,y) , where D( ) is the smallest distance between the two linelets, and A( ) the angle between the two linelets. Note that there is some loss of information by this parameterization, in particular, it cannot distinguish between two crossing linelets and two touching linelets if the angles are the same, because in both cases, the smallest distance is 0. But still, we would like to see how far we can go with this simple parameterization. We could further express λ1 (l, θ) = λ11 (l) + λ12 (θ) + λ13 (l, θ), i.e., decompose λ1 into the main effects for l and θ respectively, and the interaction or combination between l and θ. This is the ANOVA (analysis of variance) decomposition, which is redundant unless we add some constraints. We do not need to worry about this issue because we use this ANOVA decomposition for the sake of feature selection to be described later, and the feature selection method will not select redundant feature statistics. Similarly, we can decompose λ2 ( ) by λ2 (D, A) = λ21 (D) + λ22 (A) + λ23 (D, A). After that we choose to quantize l, θ, D, and A, i.e., each of these variables can only take a finite set of values. Then the functions can be reduced to vectors of function values. For instance, for a function λ(x), if x ∈ {x1 , ..., xn }, we can represent λ( ) by λ = (λ1 , ..., λn ), so that λj = λ(xj ). Therefore λ(x) = n j=1 λj 1x=xj =< λ, H(x) >, where H(x) is the vector of indicators (1x=xj , j = 1, ..., n). Therefore, we can write the conditional distribution in our sketch model as p(sx,y | SN (x,y) ) =
1 exp{λ0 δx,y + < λ11 , H11 (lx,y ) > Z(Λ | SN (x,y) )
+ < λ12 , H12 (θx,y ) > + < λ13 , H13 (lx,y , θx,y ) > < λ21 , H21 (D(sx,y , st )) > + < λ22 , H22 (A(sx,y , st )) > + st ∈SN (x,y)
+ < λ23 , H23 (D(sx,y , st ), A(sx,y , st )) >} 1 = exp{< Λ, H(sx,y | SN (x,y) ) >}, Z(Λ | SN (x,y) )
where Λ = (λ0 , λ11 , λ12 , λ13 , λ21 , λ22 , λ23 ), and H(sx,y | SN (x,y) ) = (δx,y , H11 (lx,y ), H12 (θx,y ), H13 (lx,y , θx,y ), H21 (D(sx,y , st )), H22 (A(sx,y , st )), SN (x,y)
SN (x,y)
H23 (D(sx,y , st ), A(sx,y , st ))).
SN (x,y)
In our work, H(sx,y | SN (x,y) ) have one to two hundred components. It can be easily shown that p(sx,y | SN (x,y) ) achieves maximum entropy under the constraints on < H(sx,y | SN (x,y) ) >p(sx,y |SN (x,y) ) , where < >p means expectation with respect to distribution p. The full model is
250
Y.N. Wu, S.C. Zhu, and C.-e. Guo
p(S | Λ) =
n m
p(sx,y | SN (x,y) )
x=1 y=1 m n
={
m n 1 H(sx,y | SN (x,y) ) >}. } × exp{< Λ, Z(Λ | SN (x,y) ) x=1 y=1 x=1 y=1
Now, let’s consider model fitting. Let Sobs be the observed sketch of an image. Then we can estimate Λ by maximizing the log-likelihood obs obs {< Λ, H(Sobs l(Λ | Sobs ) = x,y | SN (x,y) ) > − log Z(Λ | SN (x,y) )}. x,y
Statistical theories of exponential family models tell us that ∂ obs obs l(Λ | Sobs ) = {H(sobs ,Λ) }, x,y | SN (x,y) )− < H(sx,y | SN (x,y) ) >p(sx,y |Sobs N (x,y) ∂Λ x,y and
∂2 l(Λ | S ) = − {Varp(sx,y |Sobs ,Λ) [H(sx,y | Sobs obs N (x,y) )]}. N (x,y) ∂Λ2 x,y
Therefore, the log-likelihood is concave, and the model can be fitted by the Newton-Raphson or equivalently in this case, the Fisher scoring algorithm, Λt+1 = Λt − [
∂ ∂2 l(Λ | Sobs ) |Λt . l(Λ | Sobs )] |−1 Λt 2 ∂Λ ∂Λ
The convergence of Newton-Raphson is very fast; usually 5 iterations can already give very good fit. Both the first and second derivatives of the log-likelihood are of the form x,y g(x, y). For each pixel (x, y), we need to evaluate the probabilities of all possible sx,y . So the computation is still quite costly, although much more manageable compared to MCMC type of algorithms. To further increase the efficiency, we choose to sample a small number of pixels instead of going through all of them. More specifically, for each (x, y), let πx,y ∈ [0, 1] be the probability that pixel (x, y) will be included in the sample. Then after we collect a sample of (x, y) by independent coin-flipping according to πx,y , we can approximate x,y
g(x, y) ≈
g(x, y)/πx,y ,
(x,y)∈sample
where the right hand side is the Hovitz-Thompson unbiased estimator of the left hand side. As to the choice of πx,y , if there is a linelet at (x, y) on Sobs , then we always let πx,y = 1. For other empty pixels (x, y), we can set πx,y according to our need for speed. Usually, even if we take πx,y = .01 for empty pixels, the algorithm can still give satisfactory fit.
Statistical Modeling of Texture Sketch
251
obs It is often the case that some components of x,y H(sobs x,y | SN (x,y) ) are 0, and if we implement the usual Newton-Raphson procedure, then the corresponding components of Λ will go to −∞, thereby creating hard constraints. While this is theoretically the right result, it can make the algorithm unstable. We choose to components as long as the corresponding components of stop fitting such obs < H(s | S x,y ) drop below a threshold, e.g., .5. x,y N (x,y) ) >p(sx,y |Sobs N (x,y) For a specific observed image, we do not want to use all the one to two hundred dimensions in our model. In fact, we can just select a small number of components of H(sx,y | SN (x,y) ) using some model selection methods such as Akaike information criterion (AIC). While best-set selection is too time-consuming, we can consider a feature pursuit scheme studied by Zhu, et al. (1997), i.e., we start with only λ0 and δx,y in our model. Then we repeatedly add one component of H at a time, so that the added component leads to the maximum increase in log-likelihood. Although the log-likelihood is analytically tractable for the causal model, the computation of the increase of log-likelihood for each candidate component of H is still quite costly. So as an approximation, we choose the component Hk so that obs obs 2 { x,y Hk (sobs x,y | SN (x,y) )− < Hk (sx,y | SN (x,y) ) >pˆ} gk = , obs x,y Varpˆ[Hk (sx,y | SN (x,y) )] is the largest, where pˆ is the currently fitted model. Intuitively, we choose a component that is worst fitted by the current model. Then we stop after a number of steps. Again, the Horvitz-Thompson estimator can be used for faster computation of gk . Experiments II. Figures 3 and 4 show some experiments. For each input image (left), we first compute its sketch (middle), and then a sketch model is learned with 40-60 feature statistics selected by feature pursuit. The angles as well as orientations are quantitized into 12 bins, and the distances are quantitized into 12 bins too. Of course, the model fails to capture some features that human vision is sensitive, which is expected of a pair-wise model. More sophisticated features should be used, and we leave this to future work.
4
Discussion
There are two major loose ends in our work. One is that the coefficients of the active bases are not modeled. The other is that the model fitting is not rigorous. The model can be further extended to incorporate more sophisticated local structures, such as local shapes and lighting, as well as more sophisticated organizations such as flows and graphs. The key is that these concepts should be understood in the context of a top-down generative model. For some stochastic textures, sparse decomposition may not be achievable, and therefore, we might have to stay with models built on pixel values such as the FRAME model. We would like to stress that our goal in this work is to find a top-down generative model for textures. We are not merely trying to come up with a linedrawing version of the texture image by some edge detector, and then synthesize
252
Y.N. Wu, S.C. Zhu, and C.-e. Guo
a). input
b). image sketch
c). a sketch synthesized.
Fig. 3. Examples of learning sketch model. a) is an input image, b) is its computed sketch, and c) is the synthesized sketch as a random sample from the sketch model learned from b).
similar line-drawings. We would also like to point out that our work is inspired by Marr’s primal sketch (Marr, 1982). Marr’s method is bottom-up procedurebased, that is, there does not exist a top-down model to guide the bottom-up procedure. Our eventual goal is to find the top-down model as a visual system’s conception of primal sketch, so that the largely bottom-up procedure will be modelbased. The hope is that the model is unified and explicit like a language, with rich vocabularies for local image structures as well as their spatial organizations. When fitted to a particular image, an automatic model selection procedure will identify a low-dimensional sub-model as the most meaningful words. The model should lie between physics-based models (that are not explicit and unified) and
Statistical Modeling of Texture Sketch
a). input
b). image sketch
253
c). a sketch synthesized.
Fig. 4. Examples of learning sketch model. a) is an input image, b) is its computed sketch, and c) is the synthesized sketch as a random sample from the sketch model learned from b).
image-based synthesis (that does not involve dimension reduction). Needless to say, our current effort is merely a modest first step towards this goal.
254
Y.N. Wu, S.C. Zhu, and C.-e. Guo
References 1. A. J. Bell and T.J. Sejnowski, “An information maximization approach to blind separation and blind deconvolution”, Neural Computation, 7(6): 1129-1159, 1995. 2. F. Bergeaud and S. Mallat, “Matching pursuit: adaptive representation of images and sounds.” Comp. Appl. Math., 15, 97-109. 1996. 3. A. Berger, V. Della Pietra, and S. Della Pietra, “A maximum entropy approach to natural language processing”, Computational Linguistics, vol.22, no. 1 1996. 4. E. J. Cand`es and D. L. Donoho, “Curvelets - A Surprisingly Effective Nonadaptive Representation for Objects with Edges”, Curves and Surfaces, L. L. Schumaker et al. (eds), Vanderbilt University Press, Nashville, TN. 5. A. P. Dempster, N.M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society series B, 39:1-38, 1977. 6. A. A. Efros and T. Leung, “Texture synthesis by non-parametric sampling”, ICCV, Corfu, Greece, 1999. 7. A. A. Efros and W. T. Freeman, “Image Quilting for Texture Synthesis and Transfer”, SIGGRAPH 2001. 8. S. Geman and D. Geman. “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images”. IEE Trans. PAMI 6. pp 721-741. 1984. 9. C. E. Guo, S. C. Zhu, and Y. N. Wu, “Visual learning by integrating descriptive and generative methods”, ICCV, Vancouver, CA, July, 2001. 10. D. J. Heeger and J. R. Bergen, “Pyramid-based texture analysis/synthesis”, SIGGRAPHS, 1995. 11. M. S. Lewicki and B. A. Olshausen, “A probabilistic framework for the adaptation and comparison of image codes”, JOSA, A. 16(7): 1587-1601, 1999. 12. L. Liang, C. Liu, Y. Xu, B.N. Guo, H.Y. Shum, “Real-Time Texture Synthesis By Patch-Based Sampling”, MSR-TR-2001-40, March 2001. 13. J. Malik, and P. Perona, “Preattentive texture discrimination with early vision mechanisms”, J. of Optical Society of America A, vol 7. no.5, May, 1990. 14. S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation”, IEEE Trans. on PAMI, vol.11, no.7, 674-693, 1989. 15. S. Mallat and Z. Zhang, “Matching pursuit in a time-frequency dictionary”, IEEE trans. on Signal Processing, vol.41, pp3397-3415, 1993. 16. D. Marr, Vision, W.H. Freeman and Company, 1982. 17. D. B. Mumford “The Statistical Description of Visual Signals” in ICIAM 95, edited by K.Kirchgassner, O.Mahrenholtz and R.Mennicken, Akademie Verlag, 1996. 18. B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images” Nature, 381, 607-609, 1996. 19. K. Popat and R. W. Picard, “Novel Cluster-Based Probability Model for Texture Synthesis, Classification, and Compression.” Proc. of the SPIE Visual Comm. and Image Proc., Boston, MA, pp. 756-768, 1993. 20. J. Portilla and E. P. Simoncelli, “A parametric texture model based on joint statistics of complex wavelet coefficients”, IJCV, 40(1), 2000. 21. Y. Wu, S. C. Zhu, and X. Liu, (2000)“Equivalence of Julesz texture ensembles and FRAME models”, IJCV, 38(3), 247-265. 22. S. C. Zhu, Y. N. Wu and D. B. Mumford, “Minimax entropy principle and its application to texture modeling”, Neural Computation Vol. 9, no 8, Nov. 1997.
Classifying Images of Materials: Achieving Viewpoint and Illumination Independence Manik Varma and Andrew Zisserman Robotics Research Group Department of Engineering Science University of Oxford Oxford, OX1 3PJ {manik,az}@robots.ox.ac.uk
Abstract. In this paper we present a new approach to material classification under unknown viewpoint and illumination. Our texture model is based on the statistical distribution of clustered filter responses. However, unlike previous 3D texton representations, we use rotationally invariant filters and cluster in an extremely low dimensional space. Having built a texton dictionary, we present a novel method of classifying a single image without requiring any a priori knowledge about the viewing or illumination conditions under which it was photographed. We argue that using rotationally invariant filters while clustering in such a low dimensional space improves classification performance and demonstrate this claim with results on all 61 textures in the Columbia-Utrecht database. We then proceed to show how texture models can be further extended by compensating for viewpoint changes using weak isotropy. The new clustering and classification methods are compared to those of Leung and Malik (ICCV 1999), Schmid (CVPR 2001) and Cula and Dana (CVPR 2001), which are the current state-of-the-art approaches.
1
Introduction
The objective of this paper is the classification of materials from their imaged appearance. Texture is a material property of which we have only an intuitive, and not a sound mathematical, understanding. Therefore, classifying materials by their textural appearance in single images photographed under unknown viewing and illumination conditions is still quite an outstanding problem, though significant progress has been made recently [2,3,9,12,13,17,20]. In particular, Leung and Malik [13] made an important innovation in giving an operational definition of a texton. They defined a 2D texton as a cluster centre in filter response space. This not only enabled textons to be generated automatically from an image, but also opened up the possibility of a universal set of textons for all images. In the same paper Leung and Malik also made a serious attempt on the problem of classifying textures under varying viewpoint and illumination. Their solution was a 3D texton which is a cluster centre of filter responses over a stack of images with representative viewpoints and lighting. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 255–271, 2002. c Springer-Verlag Berlin Heidelberg 2002
256
M. Varma and A. Zisserman
Fig. 1. The change in image appearance of the same texture (# 30, Plaster B) with variation in imaging conditions. Top row: constant viewing angle and varying illumination. Bottom row: constant illumination and varying viewing angle. There is a considerable difference in the appearance across images.
The need to seriously address “3D effects” – where the appearance varies considerably with viewpoint and lighting – is illustrated in figure 1. The importance of such effects for classification [1,4,5] and synthesis [15,19] has also been noted by other researchers. Our approach to the classification problem is to model a texture as a distribution over textons, and learn the textons and texture models from training images. Classification of a novel image then proceeds by mapping the image to a texton distribution and comparing this distribution to the learnt models. Consequently this is quite a standard approach, but the originality is at two points: First, our texton clustering is in an extremely low dimensional space and is also rotationally invariant. The second innovation is to classify the texture from a single image, and we achieve this while representing each 3D texture by a small set of models. This is a considerable difference in approach to that of [13] to which we compare our results, and it is worth elaborating on the difference at this point. The main thrust of Leung and Malik’s work is towards classifying images using 3D textons. In the learning stage, 20 images of each texture are geometrically registered and mapped to a 48 dimensional filter response space. The registration is necessary because the clustering that defines the texton is in the stacked 20 × 48 = 960 dimensional space (i.e. the textons are 960-vectors), and it is important that each filter is applied to the same texture point as viewpoint and illumination vary. In the classification stage, 20 novel images of the same texture are presented. However, again these images must be registered and more significantly must have the same order as the original 20 (i.e. they must be taken from images with similar viewpoint and illumination to the original). In essence, the viewpoint and lighting are being supplied implicitly by this ordering. Leung and Malik also use an MCMC algorithm for classifying a single image under known imaging conditions. However, the classification rate of this method is 87%, far inferior to the 95.6% achieved by the multiple image method.
Classifying Images of Materials
257
Fig. 2. Textures from the Columbia-Utrecht database. All images are converted to monochrome in this work, so colour is not used in discriminating different textures.
Here we present a texture classification method with superior performance to [13], but requiring only a single image as input and with no information (implicit or explicit) about the illumination and viewing conditions. Our approach is most closely related to that of Schmid [18] and Cula and Dana [2,3]. Schmid’s approach is rotationally invariant but the invariance is achieved in a different manner and texton clustering is in a higher dimensional space than here. Cula and Dana’s approach is not rotationally invariant and the method of model selection differs from here. The paper is organized as follows: section 2 is concerned with rotationally invariant clustering and the classification rates of four filter sets are compared. The sets include those used by Schmid [18] and Leung and Malik [13]. Then, in section 3, we describe how the number of models for each texture can be substantially reduced. We present results on the Columbia-Utrecht database [6], the same database used by [2,13]. The database contains 61 textures, and each texture has 205 images obtained under different viewing and illumination conditions. The variety of textures in this database is illustrated in figure 2.
258
M. Varma and A. Zisserman
Fig. 3. The LM filter bank has a mix of edge, bar and spot filters at multiple scales and orientations. It has a total of 48 filters - 6 oriented filters at 3 scales and 2 phases, 8 Laplacian of Gaussian filters and 4 Gaussian filters.
2
Rotationally Invariant Filters and Their Effect on Textons
In this section we investigate the advantages and disadvantages of clustering in a rotationally invariant filter response space. The use of such filters has already been espoused by Schmid [18]. To be definite, we will compare four filter sets: those of Leung and Malik which are not rotationally invariant; those of Schmid which are; and two reduced sets of filters proposed here using maximum response (which are again rotationally invariant). We will assess the filter sets by the classification rate they achieve based on textons clustered in their response space. 2.1
Description of the Four Filter Sets
The Leung-Malik (LM) set: consists of 48 filters, partitioned as follows: Second derivative of Gaussians at 6 orientations, 3 scales and 2 phases (where phase refers to even and odd symmetry) making a total of 36; 8 Laplacian of Gaussian filters; and 4 Gaussians. The scale of the filters range between σ = 1 and σ = 10. They are shown in figure 3. The Schmid (S) set: consists of 13 rotationally invariant filters of the form πτ r r2 F (r, σ, τ ) = F0 (σ, τ ) + cos e− 2σ2 σ where F0 (σ, τ ) is added to obtain a zero DC component with the (σ, τ ) pair taking values (2,1), (4,1), (4,2), (6,1), (6,2), (6,3), (8,1), (8,2), (8,3), (10,1),
Fig. 4. The S filter bank is rotationally invariant and has 13 isotropic, “Gabor-like” filters.
Classifying Images of Materials
259
(10,2), (10,3) and (10,4). The filters are shown in figure 4. As can be seen all the filters have rotational symmetry. The Maximum Response (MR) sets: The MR4 filter bank consists of 14 filters but only four filter responses. The filter bank contains filters at multiple orientations but their outputs are “collapsed” by recording only the maximum filter response across all orientations. This achieves rotation invariance. In a similar spirit to [11,12] we have chosen the scale of our filters as that which gives the strongest response to the textures in the database, and only use a single scale for each filter (we return to the issue of scale below). The filter bank is shown in figure 5 and consists of a Gaussian and a Laplacian of Gaussian both with σ = 10 (these filters have rotational symmetry), an edge filter (σx = 4, σy = 12), and a bar filter (σx = 4, σy = 12). The latter two filters are oriented. Measuring only the maximum response reduces the number of responses from 14 (six orientations for two oriented filters, plus two isotropic) to 4. The MR8 filter bank is similar to the MR4 filter bank but the oriented edge and bar filters occur at 3 scales (σx ,σy )={(1,3), (2,6), (4,12)} thereby giving a total of six responses, three for the edge filter and three for the bar. The remaining two filters are the Gaussian and Laplacian of Gaussian with σ = 10 taken from the MR4 filter bank. The motivation for introducing these MR filters sets is twofold. First, we achieve rotation invariance, but are also able to record the angle of maximum response. This angular information is lost in the S filters. Thus, with the MR filters we are able to compute co-occurrence statistics on orientation, and such statistics may prove useful in discriminating isotropic from anisotropic textures. We return to this point in section 4. The second motivation concerns the low dimensionality of the filter response space. We would like to use the entire filter response of a texture for classification (in the manner of [12]), rather than its vector quantization into textons. Again, we return to this issue in section 4. 2.2
Pre- and Post-processing
When computing the filter responses the following three steps are followed:
Fig. 5. The MR4 filter bank consists of four filters - two rotationally symmetric ones (a Gaussian and a Laplacian of Gaussian), and two oriented ones (an edge and a bar filter, at six different orientations).
260
M. Varma and A. Zisserman
First, before any of the filters are applied, all the images are converted to grey scale and are intensity normalised to have zero mean and unit standard deviation. This normalization gives partial invariance to linear transformations in the illumination conditions of the images. Second, all 4 filter banks are L1 normalised so that the filter responses of each filter lie roughly in the same range. In more detail, each filter Fi in the filter bank is divided by Fi 1 so that the filter has unit L1 norm. This helps vector quantization, when using Euclidean distances, as the scaling for each of the filter response axes becomes the same. Third, following [8] (and motivated by Weber’s law), the filter response at each pixel x is (contrast) normalized as F(x) ← F(x) log (1 + L/0.03)/L where L = F(x)2 is the magnitude of the filter response vector at that pixel. 2.3
Textons by Clustering
We now consider clustering the filter responses in order to generate a texton dictionary. This dictionary will subsequently be used to define texture models based on learning from training images. For each filter set, we adopt the following procedure for computing a texton dictionary: For each texture, we select 13 images randomly (these images sample the variations in illumination and viewpoint), the filter responses over all these images are aggregated, and 10 texton cluster centres computed using the standard K-Means algorithm [7]. The learnt textons for each texture are then aggregated into a single dictionary. For example, if 5 textures are used then the dictionary will contain 50 textons. However, the classification accuracy achieved is not very sensitive to variations in the procedure for dictionary generation (such as the particular textures
Fig. 6. The first 100 textons recovered from 20 training textures using 13 images per texture: (a) S textons. (b) LM textons. (c) MR8 textons. Note that the S textons are rotationally symmetric.
Classifying Images of Materials
261
and method of choosing images etc), see subsection 3.3. Examples of the textons for each filter set are shown in figure 6. Our clustering task is considerably simpler than that of Leung and Malik, and Cula and Dana (who use the same filter bank) as we are able to cluster in low, 4 and 8, dimensional spaces. This compares to 13 dimensional for S, and 48 dimensional for LM (we are not considering 3D textons at this point where the dimensionality is 960). Concerning the rotation properties of the LM and MR textons, consider a texture and an (in plane) rotated version of the same texture. Corresponding features in the original and the rotated texture will map to the same point in MR filter space, but to a different point in LM. It is therefore expected that more significant clusters will be obtained in the rotationally invariant case. Secondly, for the LM filter set, which is not rotationally invariant, it would be expected that its textons can not classify a rotated version of a texture unless the rotated version is included in the training set (both of these points are demonstrated in figures 7 and 8). This establishes that there is an advantage in rotation invariance in that rotated versions of the same texture can be represented by one histogram, whilst several are required for the LM textons. However, there is still the possibility that rotation invariance has the disadvantage that two different textures (which are not rotationally related) have the same histogram. We address this point next, where we compare classification rates over a variety of textures.
Fig. 7. Textons learnt from original and rotated textures: (a) textons learnt from the LM filter bank for an original texture image of ribbed paper and a version rotated by 90 degrees (shown in figure 8). (b) depicts the textons learnt by the MR4 filter bank. In each case, the 10 textons learnt from the original image are shown in the top 2 rows while the 10 learnt from the rotated image are shown in the bottom 2 rows. Note that the LM filter bank learns two sets of textons (where one set is essentially the other set rotated by 90 degrees) and therefore will have two very different histograms for the same texture. Conversely, the MR4 textons are virtually identical.
262
M. Varma and A. Zisserman
Fig. 8. Classification of rotated textures. Column (a) shows three images of the ribbed paper texture (the first two are from # 38B, the last from # 38). Column (b) shows the texton histograms using the LM filter bank. Column (c) shows the texton histograms obtained using the MR filter bank. Note that in the LM case, neither of the top two histograms matches the third (as seen through the eyes of the χ2 distance where small values in a bin are significant), i.e. the description is rotationally variant. In the MR case all the histograms are, more or less, similar.
2.4
Classification Method and Results
Here we perform three experiments to assess texture classification rates over 92 images for each of 20, 40 and 61 textures respectively. The first experiment, where we classify images from 20 textures, corresponds to the setup employed by Cula and Dana [2]. The second experiment, where 40 textures are classified, is modeled on the setup of Leung and Malik [13]. In the third experiment, we classify all 61 textures present in the Columbia-Utrecht database. The 92 images
Classifying Images of Materials
263
are selected as follows: for each texture in the database, there are 118 images where the viewing angle θv is less than 60 degrees. Out of these, only those 92 are chosen for which a sufficiently large region could be cropped across all texture classes. Each experiment consists of three stages: texton dictionary generation; model generation, where models are learnt for the textures from training images; and, classification of novel images. The 92 images for each texture are partitioned into two sets. Images in the first (training) set are used for dictionary and model generation, classification accuracy is only assessed on the 46 images for each texture in the second (test) set. Each of the 46 images per texture defines a model for that texture as follows: the image is mapped (vector quantized) to a texton distribution (histogram). Thus, each texture is represented by a set of 46 histograms. An image from the test set is then classified by choosing the closest model histogram, and hence the corresponding texture class. The distance function used to define closest is the χ2 significance test [16]. In all three experiments we follow both [2] and [13], and learn the texton dictionary from 20 textures (using the procedure outlined before in subsection 2.3). The particular textures used are specified in figure 7 of [13]. In the first experiment, 20 novel textures are chosen (see figure 5 in [2] for a list of the novel textures) and 20 × 46 = 920 novel images are classified in all. In the second experiment, the 40 textures specified in figure 7 of [13] are chosen. Here, a total of 40 × 46 = 1840 novel images are classified. Finally, in the third experiment, all the textures in the Columbia-Utrecht database are classified using the same procedure. The results for all three experiments are presented in table 1. Table 1. Comparison of the classification rates for varying number of texture classes for each of the four filter sets. In all cases, 46 models are used for each texture class. filters S LM MR4 MR8
# of 20 96.30% 96.08% 94.13% 97.50%
texture classes 40 61 95.16% 94.54% 93.75% 93.44% 92.06% 90.76% 96.30% 96.07%
Discussion: Two points are notable in these results. First, the MR8 and S filters out perform the LM filters. This is a clear indicator that a rotationally invariant description is not a disadvantage (i.e. salient information for classification is not lost). Second, the fact that MR8 does better than S and LM is also evidence that clustering in a lower dimensional space is advantageous. The MR4 filter bank loses out because it only contains filters at a single scale and hence can’t extract such rich features. What is also very encouraging with this classification
264
M. Varma and A. Zisserman
method is that as the number of texture classes increases there is only a small decrease in the classification rate.
3
Reducing the Number of Models
In this section, our objective is to reduce the number of models required for each texture class. In the previous section, the number of models was the same as the number of training images (and in effect [13] used 20 models/images for every texture). Here we want to reduce the number of models to that appropriate for each texture, independent of the number of training images. As is demonstrated in figure 9 some textures may require far fewer models than others. We will investigate two approaches for reducing the number of models. The first is selection of the histograms to determine the “natural” models for a particular texture. The second is pose normalization. The classification experiments in this section are slightly modified so as to maximise the total number of images classified. If only M models per texture are used for training, then the rest of the 46 − M training images are added to the test set so that they too may be classified. Thus, for example when classifying 61 textures, if only M = 10 models are used then a total of 82 images per texture are classified giving a total of 82 × 61 = 5002 images. The texton dictionary used is the same as in the previous section and has 200 textons.
Fig. 9. Models per texture: the top row shows four images of the same texture, ribbed paper, photographed under different viewing and lighting conditions. The images look very different. The bottom row shows images of rough paper taken under the same conditions as the images in the first row. These images don’t differ so markedly because the texture doesn’t exhibit surface normal effects. The consequence is that fewer models are required to represent rough paper over all viewpoints and lighting than for ribbed paper.
Classifying Images of Materials
3.1
265
Selection to Determine Texture Models
We would expect that the number of different models that a texture requires is a function of how much the texture changes in appearance with imaging conditions, i.e. it is a function of the material properties of the texture. For example, if a texture is isotropic then the effect of varying the lighting azimuthal angle will be less pronounced than for one that is anisotropic. Thus other parameters (such as relief profile) being equal, fewer models would be required for the isotropic texture (than the anisotropic) to cover the changes due to lighting variation. However, if we are selecting models for the express purpose of classification, then another parameter, the inter class image variation, also becomes very important in determining the number of models. For example, even if a texture varies considerably with changing imaging conditions it can be classified accurately using just a few models if all the other textures look very different from it. Conversely, if two textures look very similar then many models may be needed to distinguish between them even if they do not show much variation individually. Here, we investigate two schemes for model reduction, both of which take into account the inter and intra class image variation. K-Medoid algorithm: Each histogram may be thought of as a point in Rn , where n is the number of bins in the histogram, so that the models for a particular texture class simply consist of a set of points in the Rn space. Therefore, given a distance function between two points, in our case χ2 , the K-Medoid algorithm may be used to cluster the histograms of all the training images (i.e. all the training images taken from all the texture classes) into representative centres (models). K-Medoid is a standard clustering algorithm [10] where the update rule always moves the cluster centre to the nearest data point in the cluster, but does not merge the points as in K-means. In fact, K-Meanscan only be applied to each texture class locally. It can not be applied here as it merges data points and thus the resultant cluster centres can not be identified distinctly with individual textures. Table 2a lists the results of classifying 20 textures using the four different filter banks with K = 60, 120 and 180, resulting in an average of 3, 6 and 9 models per texture. Table 2. Classification rates for each of the four filter sets when the models are automatically selected by (a) the K-Medoid algorithm, and (b) the Greedy algorithm. In all cases there are 20 texture classes. filters S LM MR4 MR8
Average # 3 74.66% 74.16% 69.94% 79.10%
of models 6 86.40% 86.69% 80.93% 89.59% (a)
per texture 9 90.12% 90.36% 85.36% 93.49%
filters S LM MR4 MR8
Average # 3 88.37% 86.69% 85.00% 90.67%
of models 6 97.21% 95.99% 93.66% 98.14% (b)
per texture 9 98.01% 97.83% 96.39% 98.61%
266
M. Varma and A. Zisserman
The classification rate with 9 K-Medoid selected models per texture is almost as good as using all 46 models (see column 1 in table 1). It should be noted here, that due to the increase in the size of the test set, many more novel images have to be classified correctly to achieve the same classification rate as in the previous section. Nevertheless, there is still significant room for improvement. Greedy algorithm: An alternative to the K-Medoid clustering algorithm is a greedy algorithm designed to maximise the classification accuracy while minimising the number of models used. The algorithm is initialized by setting the number of models equal to the number of training images. Then, at each iteration step, one model is discarded. This model is chosen to be the one for which the classification accuracy decreases the least when it is dropped. This iteration is repeated until no more models are left. Figure 10, and table 2, show the resultant classification accuracy versus number of models for the four filter banks when classifying 20, 40 and 61 textures. Figure 11 shows the 9 textures that were assigned the most models as well as the 9 textures that were assigned the least models while classifying all 61 textures. A very respectable classification rate of over 97% correct is achieved for an average of 9 models per texture, even when all 61 texture classes are included. Furthermore, for most of the images that were misclassified, the correct texture class was ranked within the 5 most probable texture classes. It can therefore be hoped that incorporating extra information in the form of second order statistics or the co-occurrence of textons will lead to almost all the images being classified correctly (see future work). These results compare very favourably with those reported in [2] and [13]. Cula and Dana obtain a classification rate of 96% while effectively using 11 models per texture. Using 5 models per texture, they achieve a classification 20 Textures
Classification accuracy %
100
40 Textures
100
95
95
95
90
90
90
85
85
85
80
80
80
75
70
75
S LM MR4 MR8 2
4 6 8 Models per texture
70
61 Textures
100
75
S LM MR4 MR8 2
4 6 8 10 Models per texture
70
S LM MR4 MR8 2
4 6 8 10 Models per texture
Fig. 10. Classification rates for models selected by the Greedy algorithm for 20, 40 and 61 textures.
Classifying Images of Materials
267
Fig. 11. Models selected by the Greedy algorithm while classifying all 61 textures: The top row shows the 9 texture classes, and the corresponding number of models, that were assigned the most number of models by the Greedy algorithm while the bottom row shows the 9 classes that were assigned the least number of models. Moving from left to right, the textures and the number of models assigned to it are: Artifical grass (17), Sandpaper (15), Velvet (13), Plaster B (13), Rug A (13), Terrycloth (12), Aluminium Foil (12), Quarry Tile (12), White Bread (12), Lettuce Leaf (4), Sponge (4), Cracker A (3), Peacock Feather (3), Corn Husk (3), Straw (3), Painted Spheres (3), Roof Shingle (3) and Limestone (2).
rate of roughly 87% (see figure 6 and table 2 in [2]). In contrast, again for 20 textures, the MR8 filter bank achieves a classification rate of 96% using 5 models per texture and achieves 98.4% using, on average, 10.4 models per texture. The main reason why the Greedy algorithm uses a fewer number of models for an equivalent classification rate is because the models that Cula and Dana learn are general models and not geared specifically to classification. They ignore the inter class variability between textures and concentrate only on the intra class variability. The models for a texture are selected by first projecting all the training and test images into a low dimensional space using PCA. A manifold is fitted to these projected points, and then reduced by systematically discarding those points which least affect the “shape” of the manifold. The points which are left in the end correspond to the model images that define the texture. Since the models for a texture are chosen in isolation from the other textures, their algorithm ignores the inter class variation between textures. For 40 textures, Leung and Malik report an accuracy rate of 95.6% for classifying multiple (20) images and 87% for single images using an algorithm which effectively needs 20 models per texture. The MR8 filter bank achieves 95.6% accuracy on the same textures using only 6.05 models per texture, and furthermore achieves 98.01% accuracy using, on average, 9.07 models per texture. 3.2
Pose Normalization
In [17] it was demonstrated that, provided a texture has sufficient directional variation, it can be pose normalized by maximizing weak isotropy of the second moment gradient matrix (a method originally suggested in [14]). The method is applicable in the absence of solid texture effects. Here we investigate if this normalization can be used to at least reduce the effects of changing viewpoint,
268
M. Varma and A. Zisserman Plaster A
Rough Plastic 0.3 0.7 0.25
0.6
0.5
0.2
0.4
0.15
0.3 0.1 0.2 0.05
0
0.1 Raw Pose Normalised 2
4
6
8
10
12
0
Raw Pose Normalised 2
4
6
8
10
12
Fig. 12. The effect of pose normalization on a set of 12 images for two textures: “Rough Plastic” and “Plaster A”. The 12 images have been sorted according to increasing viewing angle and this is represented on the X axis. The Y axis is the χ2 distance between the model image and the given image. The pose normalized images consistently have a reduced χ2 distance which translates into better classification.
and hence provide tighter clusters of the filter responses, or better still reduce the number of models needed to account for viewpoint change. In detail, if the normalization is successful, then for moderate changes in the viewing angle, two such “pose normalized” images of the same texture should differ from each other by only a similarity transformation. If there are no major changes in scale, the responses of a rotationally invariant filter bank (MR or S) to these images should be much the same. A preliminary investigation shows that this is indeed the case for suitable textures. Figure 12 shows results for two textures - “Plaster A” and “Rough Plastic”. Twelve images of each texture are selected to have similar photometric appearance, but monotonically varying viewing angle. The graph shows the χ2 distance between the texton histogram of one of the images (selected as the model image) and the rest, before and after pose normalization. As can be seen, the χ2 distance is reduced for the pose normalized images. This in turn translates to better classification as well. On experiments on 4 textures, using the same 12 image set and 1 model per texture, the classification rate increased from 81.81% before pose normalization to 93.18% afterwards. 3.3
Generalization and Scale
In this subsection, we investigate how various factors affect our algorithm’s classification rate. We first calculate a benchmark classification rate and then vary the images in the training set and also the size of the texton dictionary to see how performance is affected.
Classifying Images of Materials
269
Fig. 13. Scaling the data results in new models: The histogram of texton labelings of (a) the original image (b) the image scaled up by a factor of 3 and (c) the image scaled down by a factor of 3. All three models are substantially different indicating that the texton vocabulary is sufficient, and it is the model that must be extended.
Initially, the texton dictionary is built by learning 10 textons from each of the 61 textures (using the procedure described in subsection 2.3) to have a total of 610 textons. Also, the 46 training images per texture from which the models will be generated are chosen by selecting every alternate image from the set of 92 available. Under these conditions, the MR8 filter bank achieves a classification accuracy rate of 98.3% when classifying all 61 textures using, on average, 8 models per texture. To see if this choice of training images is important for our algorithm, the classification experiment was repeated using 46 training images that were chosen randomly from the 92 available. The average classification accuracy, computed over 10000 trial runs, was 98.16% and had stabilized to that value within 84 runs. We also evaluated the “expressive power” of our texton dictionary by performing the experiment having compiled the textons from only 31 randomly chosen texture classes, giving a total of 310 textons. The classification rate decreased only slightly to 98.19% signifying that our textons are sufficiently versatile. The number of textons in the dictionary can further be reduced by merging textons which lie very close to each other. We pruned down the size of the dictionary from 310 to 100 textons by selecting 80 of the most distinct textons (i.e. those textons that didn’t have any other textons lying close by) and ran KMeans, with K = 20, on the rest. This procedure entailed another slight decrease in the classification accuracy to 97.38%. Thus, while classifying 61 textures, the best classification rate achieved was 98.3% obtained when all 610 textons were used and the worst rate was 97.38% when only 100 textons wree used. We can therefore conclude that our algorithm is robust to the choice of training image set and texton vocabulary with the classification rate not being affected much by changes in these parameters. Finally, a word about scale. It may be of concern that the MR4 filter bank does not have filters at multiple scales and hence will be unable to handle scale changes successfully. To test this 25 images from 14 texture classes were artificially scaled, both up and down, by a factor of 3. The classification experiment was repeated using the original, normal sized, filter banks and texton dictionar-
270
M. Varma and A. Zisserman
ies. We found that as long as models from the scaled images were included as part of the texture class definition, classification accuracy was virtually unaffected and classification rates of over 97% were achieved. However, if the choice of models was restricted to those drawn from the original sized images, then the classification rate dropped to 17%. It is evident from this that filter bank and texton vocabulary are sufficient, and it is the model that must be extended.
4
Conclusions and Future Work
We have demonstrated that with a handful of models per texture, classification rates superior to [2,13] can be achieved. In particular, only a single image is required for classification, not twenty with registration, and there is no need to supply the imaging conditions, either implicitly or explicitly. By developing the MR set, we are now in a position where we will be able to compare the “texton clustering and distribution” method of discriminating textures with the Bayesian approach proposed by Konishi and Yuille [12]. Their method stores the joint PDF of filter responses (in suitably sized bins), and is completely infeasible if the dimension of the filter space is large – [12] use a six dimensional space. For example, it would not be possible to use this approach with the 48 dimensional filter space used by [13]. Konishi and Yuille make strong use of colour, but achieve impressive results in classifying outdoor scenes into a small number of texture classes. We also plan to investigate how the co-occurrence of textons (as in [18]) can lead to further improvements. Traditional rotation invariant filter banks are not very successful at measuring anisotropic features. However, the MR filter banks can detect such features accurately. Furthermore, recording the relative orientation at which the features occur will allow additional discriminability while preserving rotational invariance. Finally, one other interesting point to investigate is whether the dimensionality of the MR filter response space can be reduced still further by not only taking the maximum response over orientations but also over all scales as well. Acknowledgements. We are grateful to Oana Cula, Tomas Leung and Cordelia Schmid for supplying details of the algorithms used in their papers. We are very grateful to Frederick Schaffalitzky for numerous discussions and suggestions. Financial support was provided by a University of Oxford Graduate Scholarship in Engineering, an ORS award and the EC project CogViSys.
References 1. M. J. Chantler, G. McGunnigle, and J. Wu. Surface rotation invariant texture classification using photometric stereo and surface magnitude spectra. In Proc. BMVC., pages 486–495, 2000. 2. O. G. Cula and K. J. Dana. Compact representation of bidirectional texture functions. In Proc. CVPR, 2001.
Classifying Images of Materials
271
3. O. G. Cula and K. J. Dana. Recognition methods for 3d textured surfaces. Proceedings of the SPIE, San Jose, Jan 2001. 4. K. Dana and S. Nayar. Histogram model for 3d textures. In Proc. CVPR, pages 618–624, 1998. 5. K. Dana and S. Nayar. Correlation model for 3D texture. In Proc. ICCV, pages 1061–1067, 1999. 6. K. J. Dana, B. van Ginneken, S. K. Nayar, and J. J. Koenderink. Reflectance and texture of real world surfaces. ACM Trans. Graphics, 18,1:1–34, 1999. 7. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. 8. C. Fowlkes, D. Martin, X. Ren, and J. Malik. Detecting and localizing boundaries in natural images. Technical report, University of California at Berkeley, 2002. 9. G. M. Haley and B. S. Manjunath. Rotation-invariant texture classification using a complete space-frequency model. IEEE PAMI, 8(2):255–269, 1999. 10. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data. John Wiley & Sons, 1990. 11. S. Konishi, A. Yuille, J. Coughlan, and S. Zhu. Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In Proc. CVPR, pages 573–579, 1999. 12. S. Konishi and A. L. Yuille. Statistical cues for domain specific image segmentation with performance analysis. In Proc. CVPR, 2000. 13. T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV, Dec 1999. 14. T. Lindeberg and J. G˚ arding. Shape-adapted smoothing in estimation of 3-d depth cues from affine distortions of local 2-d brightness structure. In Proc. ECCV, pages 389–400, May 1994. 15. X. Liu, Y. Yu, and H. Shum. Synthesizing bidirectional texture functions for realworld surfaces. In Proc. ACM SIGGRAPH, pages 117–126, 2001. 16. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C. Cambridge University Press, 1988. 17. F. Schaffalitzky and A. Zisserman. Viewpoint invariant texture matching and wide baseline stereo. In Proc. ICCV, Jul 2001. 18. C. Schmid. Constructing models for content-based image retrieval. In Proc. CVPR, 2001. 19. A. Zalesny and L. Van Gool. A compact model for viewpoint dependent texture synthesis. volume 2018 of Lecture Notes in Computer Science, pages 124–143. Springer, July 2000. 20. S.C. Zhu, Y. Wu, and D. Mumford. Filters, random-fields and maximum-entropy (FRAME): Towards a unified theory for texture modeling. IJCV, 27(2):107–126, March 1998.
Estimation of Multiple Illuminants from a Single Image of Arbitrary Known Geometry Yang Wang and Dimitris Samaras Computer Science Department, State University of New York at Stony Brook, NY 11794, USA {yangwang, samaras}@cs.sunysb.edu
Abstract. We present a new method for the detection and estimation of multiple illuminants, using one image of any object with known geometry and Lambertian reflectance. Our method obviates the need to modify the imaged scene by inserting calibration objects of any particular geometry, relying instead on partial knowledge of the geometry of the scene. Thus, the recovered multiple illuminants can be used both for image-based rendering and for shape reconstruction. We first develop our method for the case of a sphere with known size, illuminated by a set of directional light sources. In general, each point of such a sphere will be illuminated by a subset of these sources. We propose a novel, robust way to segment the surface into regions, with each region illuminated by a different set of sources. The regions are separated by boundaries consisting of critical points (points where one illuminant is perpendicular to the normal). Our region-based recursive least-squares method is impervious to noise and missing data and significantly outperforms a previous boundary-based method using spheres[21]. This robustness to missing data is crucial to extending the method to surfaces of arbitrary smooth geometry, other than spheres. We map the normals of the arbitrary shape onto a sphere, which we can then segment, even when only a subset of the normals is available on the scene. We demonstrate experimentally the accuracy of our method, both in detecting the number of light sources and in estimating their directions, by testing on images of a variety of synthetic and real objects.
1. Introduction In this paper we consider the problem of multiple illuminant direction detection from a single image, i.e. using an image of an object or scene to recover all real light sources in that environment. Knowledge of illuminant directions is necessary both in computer vision for shape reconstruction, and in image based computer graphics, in order to realistically manipulate existing images. This problem is particularly hard for diffuse (Lambertian) surfaces and directional light sources and cannot be solved using local information only. Lambertian reflectance is the most common type of reflectance. Previous methods that estimate multiple light sources require images of a calibration object of given shape (typically spheres) A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 272–288, 2002. c Springer-Verlag Berlin Heidelberg 2002
Estimation of Multiple Illuminants from a Single Image
273
which needs to be removed from the scene and might cause artifacts. Instead, our method relies instead on partial knowledge of the geometry of the scene and can be used on objects of arbitrary shape. This allows us to possibly use any diffuse object of the scene for illumination calibration. The interest in computing illuminant directions first arose from shape from shading applications, and was initially focused on recovering a single light source, [5,12,9,22,17]. However, illumination in most real scenes is more complex and it is very likely to have a number of co-existing light sources in a scene. In the context of shape estimation, the ability to automatically detect the light sources allows for more practical methods. For example, there exist methods that use stereo images to compute a partially correct geometry, and then refine it by Shape from Shading under a single light source [18]. These methods can now be extended to multiple light sources, using the partial geometry obtained by stereo to compute the parameters of the light sources with the methods proposed in this paper. More recently, with the advent of image based rendering methods, it is crucial to have knowledge of the light. An early attempt to recover a more general illumination description [6], modeled multiple light sources as a polynomial distribution. A discussion of the various types of light sources can be found in [8]. More recently, with the advent of image based methods in computer graphics, estimation of illumination parameters from images is necessary, in order to compensate for illumination artifacts, and also to allow super-imposition of synthetic images of new objects into real scenes. Most such methods need to use a calibration object of fixed shape, typically a sphere. In [3] a specular sphere is used as a light probe to measure the incident illumination at the location where synthetic objects will be placed in the scene. Such a sphere though might have strong inter-reflections with other objects of the scene, especially if they are close to the it. Using the Lambertian shading model, Yang and Yuille [20] observed that multiple light sources can be deduced from boundary conditions, i.e., the image intensity along the occluding boundaries and at singular points. Based on this idea, Zhang and Yang [21] show that the illuminant directions have a close relationship to critical points on a Lambertian sphere and that, by identifying most of those critical points, illuminant directions may be recovered if certain conditions are satisfied. Conceptually, a critical point is a point on the surface such that all its neighbors are not illuminated by the same light sources. However, because the detection of critical points is sensitive to noise, the direction of extracted real lights in not very robust to noisy data. In [13] a calibration object that comprises of diffuse and specular parts is proposed. Inserting calibration objects in the scene complicates the acquisition process, as they either need to be physically removed before re-capturing the image, which is not always possible, or they need to be electronically removed as a post processing step, which might introduce artifacts in the image. Our proposed method can be applied to objects of known arbitrary geometry, as long as that shape contains a fairly complete set of normals for a least-squares evaluation of the light sources. Thus, it would be possible to estimate the illuminants from the image of a scene, using geometry that is part of the scene. Although estimating
274
Y. Wang and D. Samaras
the object geometry from the image would be more convenient, current single image computer vision methods do not provide the necessary accuracy for precise illuminant estimation. Such accuracy can be achieved only by more complex multi-image methods [10] that require precise calibration. The idea of using arbitrary known shape, can also be found in the approach of Sato et al. [19], which exploits information of a radiance distribution inside shadows cast by an object of known shape in the scene. Recently, under a signal processing approach [15,1] a comprehensive mathematical framework for evaluation of illumination parameters through convolution is described. Unfortunately, this framework does not provide a method to estimate high-frequency illumination such as directional light sources when the BRDF is smooth as in the Lambertian case. Convolution is a local operation and the problem is ill-posed when only local information is considered [2]. Our method uses global information to overcome this problem, and in this sense, it is complementary to the methods of [15], [1], [11]. In this paper, we propose an illuminant direction detection process that minimizes global error (by a recursive least-squares algorithm [4]). In general, each point of a surface is illuminated by a subset of all the directional light sources in the scene. We propose a novel, robust way to segment the surface into regions (virtual “light patches”), with each region illuminated by a different set of sources. The regions are separated by boundaries consisting of critical points (where one illuminant is perpendicular to the normal of the surface [21]). We extract real lights based on the segmented virtual “light patches” instead of critical points that are relatively sensitive to noise. Since there are more points in a region than on the boundary, the method’s accuracy does not depend on the exact extraction of the boundary and can tolerate noisy and missing data better. Furthermore, we can further adjust and merge light patches to minimize the least-squares errors. The number of critical boundaries detected is sensitive to the threshold used in the Hough transform, especially when there are noisy or incomplete data. Fewer boundaries will cause lights to be missed, more boundaries will result in spurious lights to be detected. This is not a problem for our method since the spurious lights will be eliminated during the merging stage that follows the Hough transform. The ability of our method to perform well when the data is not perfect is crucial for the extension of the method from spheres (for which we initially develop it) to arbitrary shapes. When the observed shape is not spherical we map its normals to a sphere, although a lot of normals will be missing. However our method works well even for incomplete spheres, as long as there are enough points inside each light patch for the least-squares method to work correctly. Our method to detect multiple illuminant directions consists of the following five steps: 1. We identify critical points (defined in the same manner as [21]) in the image, using two consecutive windows using a standard recursive least-squares algorithm[4]. 2. We group critical points into different critical boundaries by the Hough Transform.
Estimation of Multiple Illuminants from a Single Image
275
3. These critical boundaries segment the image into different separate light patches and we calculate a virtual light for each light patch. 4. We adjust and merge critical boundaries by minimizing the average leastsquares error of all light patches. If two critical boundaries are too close to each other, i.e. the angle between them is less than a threshold angle, then we merge these two critical boundaries. 5. We extract real lights based on these segmented virtual light patches instead of critical points that are relatively sensitive to noise. The absolute intensities of real lights can be determined only if the albedo is known. We performed a series of experiments which demonstrate the superior accuracy of our method. Experiments with synthetic data from a sphere and a vase model allow us to recover accurately up to fifteen light sources from a sphere image, with a 0.5 degree average error. We also show that our method performs well even with added noise (error less than 1 degree for five lights). We successfully detect five lights from the image of a real ball made of rubber, even though the surface is rather rough and at various parts deviates from the Lambertian reflectance assumption. Our results compare favorably with the boundary-based method in [21] for the sphere experiments. We perform experiments on two different nonspherical objects, the synthetic images of a vase and the real images of a toy duck under four illuminants. The algorithm performs accurately, despite the increased levels of noise, both in the reflectance and in the acquisition of the normals. The rest of this paper is structured as follows: Section 2 describes the notion of critical points and their properties as they pertain to our problem. Section 3 describes our basic algorithm and extensions that make it robust to noise and missing data. These properties of our algorithm allow us to introduce a mapping of normals from arbitrary geometry to the shape of a sphere. This is the key to applying our algorithm to objects of arbitrary shape in Section 4. We present our experiments on synthetic and real data in Section 5 and conclude with future working Section 6.
2. Critical Points Definition 1. Given an image, let Li , i = 1, 2, . . ., be the light sources of the image. A point in the image is called a critical point if the surface normal at the corresponding point on the surface of the object is perpendicular to some light source Li . 2.1. Observation Model We assume that images are formed by perspective or orthographic projection and the object in the image has a Lambertian surface with constant albedo, that is each surface point appears equally bright from all viewing directions: L = ρI · n
(1)
where L is the scene radiance of an ideal Lambertian surface, ρ is the albedo, I represents the direction and amount of incident light, and n is the normal to the surface.
276
Y. Wang and D. Samaras
2.2. Sphere Model Initially, we develop our algorithm using a sphere model. We will show later in this paper how to extend this to objects of arbitrary shape. – We assume the observed object is a sphere with Lambertian reflectance properties whose physical size is already known. – For light sources whose direction is co-linear with the lens axis of the camera and the angle with OS is less than ω (Fig. 1(a)) depending on the resolution, the best possible result is their equivalent frontal light source Lf rontal .
S
O
Light ✮ ω
τ P
α S
β
Camera
θ
✕L ✯ φ LP q N
A (a)
(b)
Fig. 1. (a) minimum detectable angle ω. (b) L and its projection LP onto plane P.
It has been proven in [21] that it is not possible to recover the exact value of the intensity of any individual light source among four (or more) pairs of antipodal light sources. However, this kind of situation, i.e. an object illuminated by antipodal light sources, happens rarely, so for simplicity in the rest of this paper, we will make an additional assumption that there are no antipodal light sources. 2.3. Sphere Cross Section with a Plane P Let P be an arbitrary plane such that S, the center of the sphere, lies on it (Fig. 1(b)), Li , i = 1, 2, . . ., be the light sources of the image and (L)iP their projections on P. A point on the arc τ can be specified by its corresponding angle parameter in [α, β] using the following proposition [21]: Proposition 1. Consider an angle interval [α, β] of a sphere cross section (Fig. 1(b)). We can always find a partition θ0 = α < θ1 < . . . < θn = β of the interval [α, β] such that in each [θi−1 , θi ] we have E(θ) = bi sin θ + ci cos θ for some constants bi and ci , 1 ≤ i ≤ n (Fig. 2(a)), where E(θ) is the intensity function along the arc τ . Intuitively, (bi−1 , ci−1 ) represents the virtual light source of the [θi−2 , θi−1 ] part, and (bi , ci ) of the neighboring [θi−1 , θi ] part. These two light sources will be different, if each of these two parts is lit by a different illuminant configuration. More formally, Proposition 2 (from [21]) describes the difference between (bi−1 , ci−1 ) and (bi , ci ):
Estimation of Multiple Illuminants from a Single Image
Lr = L1 − L2 ✻R
✻Y
✠ L1
L1
X ✲ θ ✲
L2
L2
(a)
277
✼ Lr = L2 − L1 (b)
Fig. 2. (a) the xy-space and the θR-space for the case with two sources. (b) Illustration of real light pre-direction. Lr is the real light direction.
Proposition 2. In the configuration of Proposition 1, for any 1 ≤ i ≤ n, we define Λi as the index set of real light sources contributed to the [θi−1 , θi ] part of the arc τ . Then the Euclidean distance between two (bi , ci ) pairs is (bi − bi−1 )2 + (ci − ci−1 )2 = (Lj )P (2) j∈Λ ∪Λ
where (Lj )P is the Euclidean norm, Λ = Λ i−1 \ Λi (the index set of elements in Λi−1 but not in Λi ), Λ = Λi \ Λi−1 , and j∈Λi Lj is the virtual light source corresponding to [θi−1 , θi ]. Propositions 1 and 2 show that the difference between (bi−1 , ci−1 ) and (bi , ci ) will be maximized at a critical point for these two virtual light sources. So, possible critical points can be detected by thresholding (Lj )P . 2.4. Properties of Critical Points Let Σ be the set of all critical points and Ω be the space of the sphere image. Then intuitively Σ will cut Ω into a decomposition, i.e. (3) Ω = ( ui ) Σ i∈I
where each ui ⊂ Ω is a subset of R2 which does not contain any critical points and I is an index set. Proposition 3. Given a decomposition (3), for any ui there exists an equivalent light source L such that the image formed by the corresponding surface part of ui illuminated by L is exactly the same as the image on ui . Proposition 2 already provides us with a criterion to detect critical points on the sphere based on the distance between (bi , ci ) pairs. Unfortunately, this criterion greatly depends on the intensities of virtual light sources, j∈Λ ∪Λ (Lj )P , which are projected on the plane with respect to each different cross section. To locate the critical points more accurately, we provide
278
Y. Wang and D. Samaras
another way to detect critical points on each cross section. Instead of using the distance between (bi , ci ) pairs, we can use the tangent angles defined on the intensity curve (Fig. 3(a)). Proposition 4. Along a sine curve, at a critical point θc , the inner angle γ between two tangent lines of each side (T1 , T2 ) will be larger than 180 degrees:
✻ R T1 }
✍
✻ R T1 }
T2
✍
A
γ
γ
θ ✲
θc
T2 γ
(a)
~ T1
θc
θ ✲
P
(b)
τ
θ
S C
W B
(c)
Fig. 3. (a) inner angle γ. (b) Angles between two tangent lines. (c) a part of arc τ , AB, is covered by two consecutive windows AW and WB.
Proof. We define the sine curve contributing to this critical point θc is x(θ) = κ sin(ξ(θ − θc )), (κ, ξ > 0) (Fig. 3(b)).Then the combined curve function will be sin(θ) (θ ≤ θc ) f (θ) = (4) sin(θ) + κ sin(ξ(θ − θc )) (θ > θc ) We therefore get the tangent lines: T1 : fθ − = cos θc c
(5)
T2 : fθ + = cos θc + κ cos(ξ(θc − θc )) · ξ = cos θc + κξ c
So the angle between T1 and T2 is γ = arctan(cos θc + κξ) − arctan(cos θc ). Since κξ > 0, arctan(cos θc + κξ) > arctan(cos θc ) ⇒ γ > 0,
(6)
that is, γ = 180◦ + γ > 180◦ .
3. Real Light Detection 3.1. Critical Point Detection From Proposition 1 we know that, for every cross section of the sphere with a plane P such that S, the center of the sphere, lies on P (illustrated in figures 2(a),3(c)), there is a partition θ0 = a < θ1 < . . . < θn = β of the angle interval [α, β] such that in each [θi−1 , θi ], we have E(θ) = bi sin θ + ci cos θ for some constants bi and ci , 1 ≤ i ≤ n. By applying a standard recursive least-squares
Estimation of Multiple Illuminants from a Single Image
279
algorithm [4], we can use two consecutive windows to detect the local maximum points of inner angles γ and distance defined by Eqn. (2). Starting from an initial point A, any point B on the same arc can be determined uniquely by the angle θ between SA and SB. Then we cover this part AB by two consecutive windows AW and WB (Fig. 3(c)). With points B and W moving from the beginning point A of the visible part of the circle to its ending point C along the arc τ , we could estimate bi and ci from the data in the two consecutive windows AW and WB respectively. Once a local maximum point of Eqn. (2) is detected, it signifies that we have included at least a ‘critical point’ in the second window WB. Because the inner angle γ defined in Proposition 4 is very sensitive to noise, we use two different criteria simultaneously to detect critical points. First we examine the distance defined in Proposition 2, then if the distance is above a threshold Tdistance , we try to locate the critical point by searching for the maximum inner angle γ along the curve. In practice, for the distance criterion threshold Tdistance , we use a ratio Tratio instead of the direct Euclidean norm to normalize for the varying light intensities. Tratio is calculated from Proposition 2: (bi − bi−1 )2 + (ci − ci−1 )2 (7) Tratio = max{ b2i−1 + c2i−1 , b2i + c2i } where (bi−1 , ci−1 ) and (bi , ci ) are calculated from the two consecutive windows AW and WB. Therefore, we can keep growing the first window AW to find critical point pc using the recursive least-squares algorithm again. Then we fix the initial point A at pc and keep searching for the next ‘critical point’ until we exhaust the whole arc τ . 3.2. Segmenting the Surface Definition 2. All critical points corresponding to one real light will be grouped into a cut-off curve which is called a critical boundary. Intuitively, each critical boundary of the sphere in our model is on a cross section plane through the center of the sphere. Therefore, critical points can be grouped into critical boundaries using the Hough transform in a (ζ, θ) anglepair parameter Hough space, i.e. we apply the cross-section plane equation in the following form: x · nx + y · ny + z · nz = 0 (8) nx = r cos θ, ny = r sin θ cos ζ, nz = r sin θ sin ζ where (x, y, z) is the position of each critical point, (nx , ny , nz ) is the normal of the cross-section plane, r is the radius of the sphere and ζ, θ ∈ [0, 180]. Typically, we use one-third of the highest vote count in the Hough transform as the threshold above which we detect a (ζ, θ) angle pair as a possible critical boundary.
280
Y. Wang and D. Samaras
Although critical points provide information to determine the light source directions [21], they are relatively sensitive to noisy data. Since most real images are not noise free, if we only use the Hough transform to extract critical boundaries, we will very likely find more boundaries than the real critical boundaries, as can be seen from the data in Fig. 4. Noise can either introduce many spurious critical points or move the detected critical points away from their true positions. However, except from the critical points, the non-critical points can also provide important information to determine the light source directions which are much more robust than using critical points only. In this section, we will explore this information in detail.
Light 1 2 3 4 5
X 0.03 0.71 0.72 0.30 0.40
Y 0.87 0.66 0.12 0.72 0.61
Z 0.49 0.23 0.68 0.63 0.68 (a)
Boundary Vote # 1 162 2 107 3 86 4 71 5 60 6 56 7 46
X 0.72 0.37 0.72 0.37 0.31 0.28 0.07
Y 0.65 0.62 0.12 0.71 0.53 0.45 0.88
Z 0.24 0.69 0.68 0.60 0.79 0.85 0.47 (b)
Fig. 4. For an image of a synthetic Lambertian sphere with 5 lights listed in Table (a): We extracted the potential boundaries in (b) by the Hough transform. The critical boundary closest to light 1, is only the seventh boundary to be detected
Definition 3. Critical boundaries will segment the whole sphere image into several regions, and intuitively, each segmented region is corresponding to one virtual light. Each region is called a virtual light patch. Once we get the patches corresponding to each virtual light, the directions of virtual light sources can be calculated. Let A, B, C and D be four points in a patch corresponding to one virtual light source and nA , nB , nC and nD be their normals respectively. From the Lambertian Eqn. (1), augmented by an ambient light term, we have nAx nAy nAz 1 Lx IA nBx nBy nBz 1 Ly IB (9) nCx nCy nCz 1 · Lz = IC nDx nDy nDz 1 α ID where IA , IB , IC and ID are brightness of each pixel in the source image corresponding to four points A, B ,C and D respectively. If nA , nB , nC and nD are non-coplanar, we can obtain the direction of the corresponding virtual light source L, [Lx , Ly , Lz ]T , and the ambient light intensity α by solving the system of equations in (9). Ideally, we would solve for the
Estimation of Multiple Illuminants from a Single Image
281
directions of virtual light sources by using four non-coplanar points from corresponding patches. Due to computation and rounding errors, four non-coplanar points are not always enough for us to get a numerically robust estimate of the direction of a virtual light source. Furthermore, it is not necessary that we can always find several non-coplanar points in an interval of an arc in some plane P as described above. To solve these problems we apply our method in two dimensions instead of one dimension. Thus, we scan the image in both x and y directions, and recover the two dimensional patches that are separated by critical boundaries. Then from each two dimensional patch, we use the internal non-critical points of each virtual light patch to solve for the direction of the virtual light source1 . Proposition 5. If a critical boundary separates a region into two virtual light patches with one virtual light each, e.g. L1 , L2 , then the difference vector between L1 and L2 , Lpre = L1 − L2 , is the real light pre-direction with respect to this critical boundary. Since we have already assumed that there are no antipodal light sources, the real light direction will be either the pre-direction L1 − L2 , or its opposite L2 − L1 (Fig. 2(b)). To find out the true directions, we pick a number of points on the surface, e.g. P1 , P2 , ..., Pk and their normals, e.g. N1 , N2 , ..., Nk , then the true directions will be the solution of: E(Pj ) = max(ei Li · Nj , 0) + Lv · Nj , 1 ≤ j ≤ k. (10) i∈Λ
where Lv is the virtual light source of a possible frontal illuminant whose critical boundaries could not be detected and will be checked as a special case. Selecting points in the area inside the critical boundaries is a robust way to detect real lights. This can be done using standard least-squares methods [4], [14]. 3.3. Finding the Real Lights After we find all the potential critical boundaries, Proposition 5 provides a way to extract real lights by calculating the light difference vector of two virtual light patches of two sides along the critical boundary. However, one real light might be calculated many times by different virtual light patch pairs, and since our data will not be perfect, they will not be necessary exactly the same vector. We introduce an angle threshold to cluster the resulting light difference vectors into real light groups, that can be approximated by one vector. By minimizing the least-squares errors of virtual light patches, we are able to merge the spurious critical boundaries detected by the Hough transform, by the following steps: 1
We only use points that are at least 2 pixels away from the critical boundary for increased robustness to noise.
282
Y. Wang and D. Samaras
1. Find initial critical boundaries by Hough transform based on all detected critical points. 2. Adjust critical boundaries. We adjust every critical boundary by moving it by a small step, and a reduction in the least-squares error indicates a better solution. We keep updating boundaries using a “greedy” algorithm in order to minimize the total error. 3. Merge spurious critical boundaries. If two critical boundaries are closer than a threshold angle Tmergeangle (e.g. 5 degrees), they can be replaced by their average, resulting into one critical boundary instead of two. 4. Remove spurious critical boundaries. We test every critical boundary, by removing it temporarily and if the least-squares error does not increase, we can consider it a spurious boundary and remove it completely. We test boundaries in increasing order of Hough transform votes (intuitively we test first boundaries that are not as trustworthy) 5. Calculate the real lights along a boundary by subtracting neighboring virtual lights as described in Proposition 5.
4. Arbitrary Shape So far, our method requires photographs of a sphere. In this section we extend our method to work with any object of known shape. Obviously, there should exist enough non-coplanar points on the object illuminated by each light to allow for a robust least-squares solution. We assume no inter-reflections and no shadows. These issues will be addressed in future work. In order to extend our method, we need to address the following issues: unlike the sphere model, we can not uniquely determine a point B on the same arc by the angle θ, which is defined as the angle between two projected normals on plane P, NAP and NBP , in which A is an initial point. Furthermore, it is possible that in an arbitrary shape model several separated two dimensional patches share a same virtual light which is illustrated in Fig. 5(a). To avoid such problems, we map the normals from the surface of the arbitrary shape back to a sphere. We detect all potential critical points based on the points mapped on the sphere. As expected, not every point on the surface of the sphere will be corresponding to a normal on the surface of the arbitrary shape, so there will be many holes on the mapped sphere, e.g. the black area in Fig. 5(b). Thus, many critical points’ locations will be erroneously calculated even for noise-free data. Consequently, the critical boundaries calculated by the Hough transform based on these critical points might not be correct or even far away from their correct positions in some cases. Since we can not recover these missing data from the original image, it is impossible to adjust the critical boundaries detected by the Hough transform itself. On the other hand, as long as the critical boundaries are not too far from the truth, the majority of the points in a virtual patch will still correspond to the correct virtual light (especially after the adjustments steps described in Sec. 3.3. Thus it is still possible, using sparse points on the sphere,
Estimation of Multiple Illuminants from a Single Image
P
A
283
LP A ✼ NAP LP C ✯ LP B NCP ❃ ✻ ✻ C NBP B I
(a)
(b)
Fig. 5. (a) two dimensional patches containing points A and C respectively share a same virtual light, i.e. LpA = LpC . (b) Vase and its sphere mapping. Both image sizes are 400 by 400 and the black color represents points which are not mapped by the normals on the shape’s surface.
to calculate the true light for each virtual light patch based on Proposition 5. If two points have the same normal but different intensities, we use the brighter one (assuming that the other is in shadow).
5. Experiments 5.1. Synthetic Sphere We performed a number of experiments with synthetic sphere images to better understand our method. In order to successfully detect the virtual light patch there should exist enough points to obtain a least-squares solution for the system of equations in (9), using at least 10 points in practice. Obviously, the larger the input image, the more lights can be detected from one image. For comparison purposes, we did experiments on images of five random light sources with up to 0.02 percent maximum noise level and the resulting angle estimation errors are even less than the best reported result in [21] for four light sources (Fig. 6(a)) The following parameter values were chosen for the algorithm: sliding window width w = 14 pixels (approximately 8 degrees), distance ratio Tratio = 0.2 and angle threshold for boundary merging (described in Sec. 3.3) Tmergeangle = 5.0 degrees. We have also successfully detected 15 lights on 320 by 320 images with a 0.45 degree average error and 1.64 degree maximum error. Results are shown in Fig. 7. In all the segmented light patch images in this report, detected critical points are denoted with green dots. Typical running times for moderate noise levels and seven light sources are in the order of 5 minutes, on a Pentium4 1.8 Ghz processor. The more merging and adjustment steps the algorithm goes through the slower it becomes. Methods to speed up the computation of the least-squares solution, would improve the performance of the algorithm for noisy environments. 5.2. Real Sphere We used a blue rubber ball illuminated with five light sources. The original photo, taken by an Olympus C3030 digital camera was cropped to 456 by 456 pixels
2
Angle deviation (degree)
Y. Wang and D. Samaras Angle deviation (degree)
284
Critical boundaries only Our algorithm
1.5 1 0.5 0
0
0.004 0.008 0.012 0.016 Maximum noise level (percentage)
0.02
2
w=8 degree
1.5 1 0.5 0
0
(a)
0.004
0.008
0.012
0.016
0.02
Maximum noise level (percentage)
(b)
Fig. 6. Error estimation for different noise levels with five random light sources: (a) Synthetic sphere (b) Synthetic vase
(a)original image
(b)generated image
(c)critical boundaries
Fig. 7. Synthetic sphere image: a synthetic Lambertian ball illuminated by fifteen light sources. Image size: 320x320. Green dots on (c) represent the detected critical points.
with the ball at the center of the image. Sampling the blue channel as brightness, the color image is converted to gray-scale with 256 levels. The following parameter values were used: sliding window width w = 30 pixels (approximately 12 degrees), distance ratio Tratio = 0.5 and angle threshold for boundary merging (described in Sec. 3.3) Tmergeangle = 5.0 degrees. Although the surface did not conform very closely to the Lambertian model, the correct number of light sources was detected and the estimated lights appear to be reasonably close to the truth (no ground data was available). 5.3. Synthetic Arbitrary Shape We used a vase mess with 14995 vertices and 5004 triangles 2 . The rendered image size is 320 by 320 and the mapped sphere is of the same size. For the vase images we did the same experiments as for the synthetic Lambertian sphere image. The following parameter values were chosen for the algorithm: sliding window width w = 14 pixels (approximately 8 degrees), distance ratio Tratio = 0.2 and angle threshold for boundary merging (described in Sec. 3.3) Tmergeangle = 5.0 degrees. We have also successfully detected 15 lights on 320 by 320 images with an 0.56 2
We used a VRML file downloaded from http://www.vit.iit.nrc.ca/3D/Pages HTML/3D Models.html
Estimation of Multiple Illuminants from a Single Image
(a)original image
(b)generated image
(d)initial
285
(c)error image
(e)resulting critical boundaries
Fig. 8. Real sphere image: an almost Lambertian rubber ball with five light sources. Image size: 456x456. (a) the original image, (b) the generated image of a Lambertian ball with the five light sources extracted from (a), (c) the error image: darker color means higher error, (d) the initial eight boundaries and virtual light patches extracted by the Hough transform, and (e) the resulting critical boundaries and virtual light patches calculated by our algorithm, three out of the initial eight boundaries were automatically merged and the locations of the other five boundaries were automatically adjusted.
degree average error and 2.09 degree maximum error. Experiments on images of five random light sources with up to 0.02 percent maximum noise level are reported in Fig. 6(b). 5.4. Real Arbitrary Shape We used a rubber duck illuminated by four light sources. The original image and 3D geometry was captured by the range scanner system described in [7]. In Fig.10(a) we can see that there are some inaccuracies and noise on the recovered 3D shape. The cropped image is 531 by 594 pixels with the duck at the center of the image. Based on the size of the duck, the mapping sphere is a 400 by 400 Lambertian ball. The following parameter values were chosen for the algorithm: sliding window width w = 30 pixels (approximately 13.5 degrees), distance ratio Tratio = 0.5 and angle threshold for boundary merging (described in Sec. 3.3) Tmergeangle = 5.0 degrees. In Fig. 10(e) we can see that in the rightmost region of the mapping sphere one light critical boundary is lost due to data sparseness (as the light is nearly frontal) and cannot be recovered by other critical points or the Hough transform itself. But if we examine the two outmost regions, it is still possible to recover
286
Y. Wang and D. Samaras
(a) original image
(b) generated image
(c) critical boundaries
Fig. 9. Synthetic vase image: a synthetic Lambertian vase with fifteen light sources. Image size: 320x320. (a) the original image, (b) the generated image of a Lambertian vase with the fifteen light sources extracted from (a), (c) the resulting critical boundaries and virtual light patches calculated by our algorithm.
(a)3D shape
(b)original image
(e)mapping sphere
(c)generated image
(f)initial
(d)error image
(g)resulting critical boundaries
Fig. 10. Real arbitrary shape image experiment: a rubber duck illuminated by four light sources. Image Size: 531x594(duck), 400x400(mapping sphere). (a) the 3D shape of the duck frontal surface, R,G,B color values represent the x,y,z components of the normal. Notice that the recovered 3D shape has a high level of noise, (b) the original image, (c) the generated image of a Lambertian duck with the four light sources extracted from (b), where for completeness purposes the eyes and beak were copied from the original image manually as they have different albedo and non-Lambertian reflectance characteristics. The noise in the generated image is due to the inaccuracies in shape estimation. Nonetheless illuminant estimation is still possible, (d) the error image: darker color means higher error, (e) the initial mapping sphere, where blue color represents points which are not mapped by the normals on the shape’s surface, (f) the initial six boundaries extracted by the Hough transform, and (g) the resulting critical boundaries calculated by our algorithm, three of the initial six boundaries were automatically removed and the locations of the other three boundaries automatically adjusted. The two outmost regions A and B are a special case where missing lights are calculated based on virtual light patches.
Estimation of Multiple Illuminants from a Single Image
287
the missing light(s). For example, in this experiment, after calculating all the real lights: L1 = 77.59 × (0.81, 0.49, 0.32), L2 = 66.81 × (−0.72, 0.27, 0.64) and L3 = 78.06 × (−0.95, 0.10, 0.30) along three critical boundaries, we calculate the virtual light LvA and LvB and then calculate for the two outmost virtual light patches A and B: L4 = LvB − L1 = 61.90 × (−0.11, −0.33, 0.94) and L5 = LvA − L2 − L3 − L4 = 0.00 × (0.00, 0.00, 0.00). L4 is clearly a missing light whereas L5 is not. Notice that the recovered 3D shape in Fig. 10(a) has a high level of noise. The noise in the generated image in Fig. 10(c) is due to the inaccuracies in shape estimation. Nonetheless illuminant estimation is still possible. High resolution versions of the experiment images in this paper can be found at http://www.cs.sunysb.edu/ samaras/res/multilight.html
6. Conclusions and Future Work In this paper we presented a method for the estimation of multiple illuminant directions from a single image. We do not require the imaged scene to be of any particular geometry (e.g. a sphere). This allows our method to be used with the existing scene geometry, without the need for special light probes when the illumination of the scene consists of directional light sources. In our experiments with real images, the proposed algorithm performs well when the known geometry is imaged by 300 by 300 pixels at least. Experiments on synthetic and real data show that the method is robust to noise, even when the surface is not completely Lambertian. Future work includes study of the properties of arbitrary surfaces, so that we can avoid the intermediate sphere mapping, speeding up of the least-squares method and extending the method to non-Lambertian diffuse reflectance for rough surfaces. Furthermore, we intend to explore combinations of our method with shadow based light estimation methods and with specularity detection methods. Acknowledgments. Support for this research was provided by DOE grant MO-068. The authors would like to thank Professors M. Subbarao and P. Huang and their graduate students for help with data acquisition and Professor G. Moustakides for useful discussions on adaptive estimation.
References 1. Basri, R., Jacobs, D.: Lambertian Reflectance and Linear Subspaces. ICCV (2001) 383–390 2. Chojnacki, W., Brooks, M.J., Gibbins D.: Can the Sun’s Direction be Estimated prior to the Determination of Shapes? Australian Jnt. Conf. on A.I. (1994) 530–535 3. Debevec, P.: Rendering Synthetic Objects Into Real Scenes. SIGGRAPH (1998). 189–198 4. Haykin, S.: Adaptive Filter Theory. Prentice Hall, Englewood Cliffs (1986) 5. Horn, B.K.P., Brooks, M.J.: Shape and Source from Shading. IJCAI (1985) 932– 936
288
Y. Wang and D. Samaras
6. Hougen, D.R., Ahuja, N.: Estimation of the Light Source Distribution and Its Use in Integrated Shape Recovery from Stereo and Shading. ICCV (1993) 29–34 7. Hu, Q.Y.: 3-D Shape Measurement Based on Digital Fringe Projection and PhaseShifting Techniques. M.E. Dept. SUNY at Stony Brook (2001) 8. Langer, M.J., Zucker, S.W.: What Is A Light Source? CVPR (1997 172–178 9. Lee, C.H., Rosenfeld, A.: Improved Methods of Estimating Shape from Shading Using the Light Source Coordinate System. AI 26(2) (1985) 125–143 10. Lin, H.Y., Subbarao, M.: A Vision System for Fast 3D Model Reconstruction. CVPR (2001) 663–668 11. Marschner, S.R., Greenberg, D.P.: Inverse Lighting for Photography. Fifth Color Imaging Conference (1997) 262–265 12. Pentland, A.P.: Finding the Illuminant Direction. JOSA 72 (1982) 448–455 13. Powell, M.W., Sarkar, S., Goldgof, D.: A Simple Strategy for Calibrating the Geometry of Light Sources. PAMI 23 (2001) 1022–1027 14. Press, W.H., Teukolsky, S.A., Vettering, W.T., Flannery, B.P.: Numerical Recipes in C. Cambridge University Press. (1992) 15. Ramamoorthi, R., Hanrahan, P.: A Signal-Processing Framework for Inverse Rendering. SIGGRAPH (2001) 117–128 16. Ramamoorthi, R., Hanrahan, P.: An Efficient Representation for Irradiance Environment Maps. SIGGRAPH (2001) 497–500 17. Samaras, D., Metaxas, D.: Coupled Lighting Direction and Shape Estimation from Single Images. ICCV (1999) 868–874 18. Samaras, D., Metaxas, D., Fua, P., Leclerc, Y.: Variable Albedo Surface Reconstruction from Stereo and Shape from Shading. CVPR (2000) I:480-487 19. Sato, I., Sato, Y., Ikeuchi, K.: Illumination Distribution from Brightness in Shadows. ICCV (1999) 875–883 20. Yang, Y., Yuille, A.: Sources From Shading. CVPR (1991) 534–439 21. Zhang, Y., Yang, Y.H.: Illuminant Direction Determination for Multiple Light Sources. CVPR (2000) 269–276 22. Zheng, Q., Chellappa, R.: Estimation of Illuminant Direction, Albedo, and Shape from Shading. PAMI 13(7) (1991) 680–702
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses M. Chantler1 , M. Schmidt2 , M. Petrou3 , and G. McGunnigle1 1
3
Texture Lab., Heriot-Watt University, Edinburgh,Scotland,
[email protected] http://www.cee.hw.ac.uk/texturelab/ 2 Darmstadt University of Technology, Germany, Department of Electronic and Electrical Engineering, University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom
[email protected] http://www.ee.surrey.ac.uk/Personal/M.Petrou
Abstract. Changes in the angle of illumination incident upon a 3D surface texture can significantly change its appearance. These changes can affect the output of texture features to such an extent that they cause complete misclassification. We present new theory and experimental results that show that changes in illumination tilt angle cause texture clusters to describe Lissajous’s ellipses in feature space. We focus on texture features that may be modelled as a linear filter followed by an energy estimation process e.g. Laws filters, Gabor filters, ring and wedge filters. This general texture filter model is combined with a linear approximation of Lambert’s cosine law to predict that the outputs of these filters are sinusoidal functions of illuminant tilt. Experimentation with 30 real textures verifies this proposal. Furthermore we use these results to show that the clusters of distinct textures describe different elliptical paths in feature space as illuminant tilt varies. These results have significant implications for illuminant tilt invariant texture classification. Keywords: Texture, illumination, texture features
1
Introduction
It has been shown that changes in the angle of illumination incident upon a 3D surface texture can change its appearance significantly as illustrated in Fig. 1. Such changes in image texture can cause complete misclassification of surface textures [1]. Essentially the problem is that side-lighting, as used for instance in Brodatz’s texture album [2], enhances the appearance of surface texture but produces an image which is a directionally filtered version of the surface height function. Furthermore as the theory developed by Kube and Pentland [3] predicts, the axis of this filter is a function of the illumination’s tilt angle 1 . This is 1
In the axis system we use, the camera axis is parallel to the z-axis, illuminant tilt is the angle the illuminant vector makes with the x-axis when it is projected into the
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 289–303, 2002. c Springer-Verlag Berlin Heidelberg 2002
290
M. Chantler et al.
Fig. 1. Two images of the same surface texture sample captured using different illuminant tilt angles
unfortunate as many texture classifiers employ directional filters in their feature measures [4]. It is not surprising therefore that rotation of the directional filter formed by the illumination function can cause conventional image-based texture classifiers to fail dramatically. Hence the objective of this paper is to present theory and experimental results that allow the behaviour of texture features in classification space to be predicted as a function of the illumination’s angle of tilt. This provides both an insight into the behaviour of existing classifiers and a basis for the development of tilt invariant classifiers. Very little work has been published on this subject with the exception of [5] which presents an informal proof that texture feature means are sinusoidal functions of the illumination’s tilt angle. Dana, Nayer, van Ginneken and Koenderink established the Columbia-Utrecht database of real world surface textures which they used to investigate bidirectional texture functions [6]. Later they developed histogram [7,8] and correlation models [9] of these textures. Leung and Malik developed a texture classification scheme that identifies 3D ’textons’ in the Columbia-Utrecht database for the purposes of illumination and viewpoint invariant classification [10,11]. Neither of these papers is concerned with developing a theory for the tilt response of texture features. In this paper we: 1. present formal theory for the behaviour of texture feature means that also provides expressions for the coefficients of these sinusoidal functions; 2. provide experimental data from thirty real textures 2 that support this theory; 3. use this theory to show that the behaviours of texture clusters follow superelliptical trajectories in multi-dimensional feature spaces; and finally
2
x, y plane, and illuminant slant is the angle that the illuminant vector makes with the camera axis. We have not used the Columbia-Utrecht database as the illumination was held constant while the viewpoint and orientation of the samples were varied during data capture.
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
291
4. show experimental results in 2D feature space that support the conjecture that changes in illumination tilt cause texture clusters to follow behaviours that may be described by Lissajous’s ellipses. We focus on the group of texture features classified as ’filtering approaches’ by Randen [4] such as Gabor, Laws, and ring/wedge filters. These features may be described as a linear filter followed by an energy estimation function. This general feature model is combined with a linearised version of Lambert’s cosine law due to Kube and Pentland [3]. The result is theory that may be used to predict the behaviour of texture features as a function of illuminant tilt. This theory is presented in the next section, after which we describe results derived from experiments with 30 textures.
2
Theory
This section derives an expression for the mean value of a texture filter as a function of the illumination’s tilt angle. The features are modelled simply as a linear filter followed by an energy estimation process as described in the introduction. First, however, we give a short derivation of the linear illumination model that is necessary for the development of the theory. 2.1
A Linearised Model of Lambert’s Cosine Law
We assume that the surface has Lambertian reflection, low slope angles, and that there is no significant shadowing or inter-reflection. Ignoring the albedo factor Lambert’s cosine rule may be expressed as: i(x, y) =
− cos(τ ) sin(σ)p(x, y) − sin(τ ) sin(σ)q(x, y) + cos(σ) p2 (x, y) + q 2 (x, y) + 1
(1)
where: i(x, y) p(x, y) q(x, y) τ σ
is is is is is
the the the the the
radiant intensity; partial derivative of the surface height function in x direction; partial derivative of the surface height function in y direction; tilt angle of the illumination; and slant angle of the illumination respectively.
For p and q 1 we can use a truncated Taylor’s series to linearise this equation to: i(x, y) = − cos(τ ) sin(σ)p(x, y) − sin(τ ) sin(σ)q(x, y) + cos(σ) Transforming the above into the frequency domain and discarding the constant term we obtain: I(ω, θ) = [− cos(τ ) sin(σ) i ω cos(θ) − sin(τ ) sin(σ) i ω sin(θ)]H(ω, θ) ⇐⇒ I(ω, θ) = −i ω sin(σ) cos(θ − τ )H(ω, θ) (2) where:
292
M. Chantler et al.
I(ω, θ) H(ω, θ) (ω, θ)
is the Fourier transform of the image intensity function; is the Fourier transform of the surface height function; and are polar frequency coordinates.
In this paper it is more convenient to express equation 2 in its power spectrum form: I(ω, θ) = ω 2 cos2 (θ − τ )sin2 (σ)H(ω, θ)
(3)
where: I(ω, θ) H(ω, θ)
is the image power spectrum; and is the surface power spectrum.
Equations 2 and 3 are similar to the model first expressed by Kube & Pentland in [3]. In the context of this paper the most important feature of this model is the cos(θ − τ ) factor, which predicts that the imaging function acts as a directional filter of the surface height function. 2.2
The Output of Linear Texture Filters and Their Features
We define a Linear Texture Feature as a linear filter followed by a variance estimator[4]. The process formed by applying such a feature to an image is as shown in Fig. 2. Since the model of the illumination process (equation 3) is also linear we may exchange it with the linear filter (Fig. 3). We use A(ω, θ) to represent the notional power spectrum of the output of the linear texture filter applied directly to the surface height function.
Surface H(ω, θ)
✘ ✘ ✛ ✛ Illuminated Filtered Linear ✲ Illumination ✲ Image ✲ Filter ✲ Image F (ω, θ) I(ω, θ) O(ω, θ) ✚ ✙ ✚ ✙
✘ ✛ feature ✲ Variance ✲ output Estimation ✚ ✙f (τ )
feature generation
Fig. 2. Feature generation
Surface H(ω, θ)
✛ ✘ Linear Filtered ✲ Filter ✲ Surface F (ω, θ) A(ω, θ) ✚ ✙
✛ ✘ Filtered ✲ Illumination ✲ Image O(ω, θ) ✚ ✙
✛ ✘ feature Variance ✲ ✲ output Estimation ✚ ✙f (τ )
Fig. 3. The feature generation model with illumination and linear filter processes interchanged
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
293
Thus to determine the mean output of a Linear Texture Filter we simply have to apply Kube’s model in the form of equation 3 to A(ω, θ) and develop an expression for the variance of the subsequent output. The mean output of a linear texture feature is the variance of the output o(x, y) of its linear filter: f (τ ) = VAR(o(x, y))
(4)
If we assume that o(x, y) has a zero mean and that O(ω, θ) is its power spectrum expressed in polar co-ordinates then we may express equation 4 as: ∞ ∞ f (τ ) =
O(ω, θ) du dv −∞ −∞
∞ 2π ω O(ω, θ) dθ dω
⇐⇒ f (τ ) = 0
(5)
0
Using equation 3 we can express O(ω, θ) as follows: O(ω, θ) = |F(ω, θ)|2 ω 2 cos2 (θ − τ )sin2 (σ)H(ω, θ) ⇐⇒ O(ω, θ) = ω 2 cos2 (θ − τ )sin2 (σ)A(ω, θ)
(6)
where: F(ω, θ) A(ω, θ)
is the transfer function of the linear filter; and = H(ω, θ)|F(ω, θ)|2
Substituting equation 6 into equation 5 we obtain: ∞ f (τ ) =
3
2
2π
ω sin (σ) −∞
cos2 (θ − τ )A(ω, θ) dθ dω
(7)
0
In a previous paper [5] we used an informal argument to show equation 7 is a sinusoidal function of τ . However, in this paper we make a simple trigonometrical substitution. Using: cos2 (x) = 1/2 (1 + cos(2x)) and cos(x − y) = cos(x)cos(y) + sin(x)sin(y) gives: ∞
3
2
2π
ω sin (σ)
f (τ ) = 0
1/2 [1 + cos(2θ)cos(2τ ) + sin(2θ)sin(2τ )]A(ω, θ) dθ dω 0
= a + b cos(2τ ) + c sin(2τ ) = a + d cos(2τ + φ)
(8)
294
M. Chantler et al.
where: 2
∞
a = 1/2 sin (σ)
b = 1/2 sin2 (σ)
c = 1/2 sin2 (σ)
ω
2π A(ω, θ) dθ dω
0
0
∞
2π
ω3
cos(2θ)A(ω, θ) dθ dω
0
0
∞
2π
ω3
0
d=
3
b2 + c2 ,
sin(2θ)A(ω, θ) dθ dω 0
φ = arctan(c/b)
The above parameters (a, b etc.) are all functions of illuminant slant (σ) and A(ω, θ), which is itself a function of the surface height function and the linear texture filter. None of them is a function of illuminant tilt (τ ). Thus equation 8 predicts that the output of a texture feature based on a linear filter is a sinusoidal function of illuminant tilt 3 with a period π radians. 2.3
Behaviour in a Multi-dimensional Feature Space
As texture classifiers normally exploit the output of several features it is important to investigate the behaviour of texture features in multi-dimensional decision space. If two different features are derived from the same surface texture, the results can be plotted in a two-dimensional x, y feature space. Using our sinusoidal prediction we obtain the general behaviour: x = f1 (τ ) = a1 + b1 cos(2τ + φ1 ) y = f2 (τ ) = a2 + b2 cos(2τ + φ2 ) Since the frequency of the two cosines is the same these two equations form two simple harmonic motion components. Therefore the trajectory in 2D feature space is in general a Lissajous ellipse. There are two special cases. If the surface is isotropic and the two filters are identical except for a difference in direction of 90◦ , the mean value and the oscillation amplitude of the two features are the same and the phase difference becomes 180◦ . Thus we predict that the scatter plot for isotropic textures and two identical but orthogonal filters is a straight line. 3
In the case of A(ω, θ) being isotropic (for instance if both the surface and the filter are isotropic) the response will degenerate to a sinusoid of zero amplitude, i.e. it will be a constant (straight-line) function of τ . However, if an isotropic filter is applied to a directional surface then A(ω, θ) will not be isotropic and the tilt response will be a sinusoidal function of tilt.
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
295
If the surface is isotropic and the two filters are identical except a difference in direction of 45◦ , the mean value and the oscillation amplitude of the two features are again the same but the phase difference is now 90◦ . In this case we predict that the scatter plot is a circle. The line and the circle are the two special cases of all possible curves. In the general case of two or more filters the result is an ellipse or super-ellipse. 2.4
Summary of Theory
In this section we derived a formula of the behaviour of texture features as a sinusoidal function of illuminant tilt. The parameters of this equation f (τ ) = a + d cos(2τ + φ) are dependent on the height map of the surface and the linear texture filter. We also predicted that the resulting figure in a multi-dimensional feature space is a super-ellipse.
3
Assessing the Validity of the Sinusoidal Output Prediction
The theory in the previous section predicts that the outputs of texture features derived from linear filters are sinusoidal functions of the illuminant tilt angle and share a common form: f (τ ) = a + d cos(2τ + φ)
(9)
The purpose of this section is to present results from a set of experiments that were designed to assess the validity of this proposition. We used 30 different real surfaces and filtered them with six Gabor and two Laws filters. Two different slant angles were used and this resulted in a total of 324 datasets. Each dataset was assessed to see how closely it followed a sinusoidal function of 2τ using a goodness of fit error metric. In this section we summarise the results using a histogram of the values of this error metric and present upper quartile, median and lower quartile results. First however, we will briefly describe the image-set, the texture features, and the error metric. 3.1
The Image-Set
Thirty physical texture samples were used in our experiments. 512x512 8-bit monochrome images were obtained from each sample using illumination tilt angles ranging between 0◦ and 180◦ degrees incremented by either 10◦ or 15◦ degree steps. All textures were illuminated at a slant angle of 45◦ . In addition six surfaces were also illuminated at a slant angle of 60◦ . The final dataset contains over 600 images. Table 1 contains one example image of every texture.
296
3.2
M. Chantler et al.
The Texture Features
Gabor filters [12] have been very popular in the literature so we have used six texture features of this type. In addition two Laws filters [13] were also used as they are extremely simple to implement, are very effective, and were one of the first sets of filters to be used for texture classification. Gabor Filters. We use the notation typeFΩAΘ to denote a Gabor filter with a centre frequency of Ω cycles per image-width, a direction of Θ degrees, and of type complex or real. Five complex Gabor filters (comF25A0, comF25A45, comF25A90, comF25A135, comF50A45) together with one real Gabor filter (realF25A45) were implemented. All of the Gabor filters were implemented in the frequency domain. Laws Filters. Laws [13] developed a set of two-dimensional FIR filters derived from three simple one-dimensional spatial domain masks. L3 = (1,2,1) E3 = (-1,0,1) S3 = (-1,2,-1)
=⇒ “level detection” =⇒ “edge detection” =⇒ “spot detection”
Laws reported that the most useful filters were a set of 5x5 filters which he obtained by convolving, and transposing his 3x1 masks. We have used two here: L5E5 and E5L5. −1 −2 0 2 1 −4 −8 0 8 4 L5E5 = L5T ∗ E5 = (L3 ∗ L3)T ∗ (L3 ∗ E3) = −6 −12 0 12 6 −4 −8 0 8 4 −1 −2 0 2 1 The E5L5 mask is simply the transpose of L5E5. These filters were implemented in the spatial domain and post-processed with a 31x31 variance estimator. 3.3
The Error Metric
The output of each filter/texture combination was plotted against illuminant tilt angle τ . According to the preceding theory each of these graphs should be a sinusoidal function of 2τ (equation 9). As we do not have the surface height functions we cannot directly calculate the parameters of these equations. We therefore compute the best fit sinusoid for each graph using Fourier analysis for the angular frequency 2 radians/radian 4 . To determine how close a dataset is to our theoretical prediction we calculate an error metric: the mean squared difference between the experimental data and their corresponding best-fit estimates. 4
As the x-axis is the tilt angle which may be measured in either radians or degrees, the period may be expressed as either π radians or 180◦ . Corespondingly the angular frequency may be expressed as f racπ90 radiansdegree or as 2 radiansradian.
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
3.4
297
Results (1D)
Figure 4 provides a graphical summary of our results. It is a histogram of the error metrics obtained from the experiments. It shows an asymmetric distribution in which errors range from 0.01, signifying an extremely good fit, up to three outliers at around 0.29. Figure 5 shows the behaviour of two complex Gabor filters over the complete range of thirty textures. It also contains the median, upper, and lower quartile lines (at 0.036, 0.056 and 0.024 respectively) determined over all of the results. In order to provide an insight into what these error values mean we have selected twelve sample graphs for display. Four results closest to the median error, four results closest to the lower quartile, and the four closest to the upper quartile, have been selected to represent typical, good, and poor results respectively. These are shown in figure 6, figure 7, and figure 8 respectively.
Fig. 4. Histogram of error metric values for the goodness of fit of sinusoidal functions to feature/tilt angle graphs. Low values, e.g. the eight occurrences of 0.01, indicate that very good sinusoidal fits were obtained for those graphs of feature output against illuminant tilt angle. High values, e.g. the three outliers around 0.28 indicate a bad fit
What is evident from these results is that even filter/texture combinations with ’poor’ error metrics (shown in figure 8) follow the sinusoidal behaviour quite closely. This is all the more surprising considering how many of our textures significantly violate the ’no shadow’ and ’low slope angle’ assumptions. Textures michael1 and michael5 for instance obviously contain many shadows and high slope angles (see table 1) yet the tilt responses shown in figure 8 show that they are still approximately sinusoidal.
298
M. Chantler et al.
Fig. 5. Bar-chart of error metric values for the complex Gabor filters F25A45 & F25A135 showing how the goodness of fit varies over the different texture samples. The dashed lines show the median, upper and lower quartile values. Sample images of the textures and7 to rock1 are shown in the table 1.
Fig. 6. Four datasets with error metrics closest to the median error of 0.036. (Each plot shows how one output of one feature varies when it is repeatedly applied to the same physical texture sample, but under varying illuminant tilt angles. Discrete points indicate measured output and the curves show the best-fit sinusoids of period 2τ )
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
Fig. 7. The four datasets closest to the lower quartile (0.024)
Fig. 8. The four datasets closest to the upper quartile (0.056)
299
300
M. Chantler et al.
Fig. 9. The behaviour of six textures in the comF25A0/comF25A45 feature space together with the best fit ellipses (each point on an ellipse denotes a different value of illuminant tilt)
3.5
Results (2D)
Given that the 1D results show that a large number of feature/texture combinations follow a sinusoidal function of tilt, then the behaviour of cluster means in a multi-dimensional feature space should be of a super-elliptical form. In order to provide additional evidence for this proposition we plotted the results in a number of 2D feature spaces. We used the parameters derived from the best-fit sinusoids to determine the equations of the ellipses in these 2D spaces. Figure 9 shows one of these sets of plots. The behaviours of cluster means of six textures in the F25A0/F25A45 feature space are shown together with their best fit ellipses. Because in this case the filters are identical except a difference in direction of 45◦ , the curve of the isotropic surface rock1 is a circle. It can be seen that the ellipses of different textures cross each other showing that illumination tilt invariant classification using a simple linear classifier is not possible.
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
Table 1. One image of each of the thirty sample textures
and1(zoomed)
and2(zoomed)
and3(zoomed)
and4(zoomed)
and5(zoomed)
and6(zoomed)
and7(zoomed)
bigdir45
iso45
radial45
rock45
slab45
slate45
stri45
twins45
wood
chips1
card1
301
302
4
M. Chantler et al.
beans1
rock1
stones2
michael1
michael2
michael3
michael4
michael5
michael6
michael7
michael8
michael9
Conclusions
We have presented new theory and new experimental results that support our proposal that the means of texture features derived from linear filters are sinusoidal functions of illuminant tilt (τ ). That is: f (τ ) = a + d cos(2τ + φ) As these behaviours are of the same frequency, but differ in amplitude and phase, they form a set of simple harmonic components. The trajectories of cluster means therefore describe Lissajous’s figures in a two dimensional feature space, and in general follow super-elliptical trajectories in multidimensional feature spaces. What is perhaps most surprising is that the empirical results agree well with these theoretical predictions given that many of our samples clearly violate the low-slope-angle and no-shadow assumptions.
The Effect of Illuminant Rotation on Texture Filters: Lissajous’s Ellipses
303
These results explain why texture classifiers can fail when the tilt angle is changed and suggest that, given suitable training data, this behaviour may be exploited for the design of tilt invariant texture classifiers.
References 1. M.J. Chantler. Why Illuminant Direction is Fundamental to Texture Analysis, IEE Proceedings on Vision, Image and Signal Processing, Vol. 142”, No.4, August, 1995 pages 199-206 2. P. Brodatz. Textures: a photographic album for artists and designers. Dover, New York, 1966. 3. P.R. Kube and A.P. Pentland. On the imaging of fractal surfaces. IEEE Trans. on Pattern Analysis and Machine Intelligence, 10(5):704–707, September 1988. 4. T. Randen and J.H. Husoy. Filtering for texture classification: A comparative study. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(4):291–310, April 1999. 5. M.J. Chantler and G. McGunnigle. The Response of Texture Features to Illuminant Rotation Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Vol III: pages 955–958, 2000. 6. K.J. Dana, S.K. Nayar, B. van Ginneken, and J.J. Koenderink. Reflectance and texture of real-world surfaces. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 151–157, 1997. 7. K.J. Dana and S.K. Nayar. Histogram model for 3d textures. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 618–624, 1998. 8. B. van Ginneken, J.J. Koenderink, and K.J. Dana. Texture histograms as a function of irradiation and viewing direction. International Journal of Computer Vision, 31(2/3):169–184, April 1999. 9. K.J. Dana and S.K. Nayar. Correlation model for 3d texture. In Proceedings of ICCV99: IEEE International Conference on Computer Vision, pages 1061–1067, 1999. 10. T. Leung and J. Malik. Recognizing surfaces using three-dimensional textons. In Proceedings of ICCV99: IEEE International Conference on Computer Vision, pages 1010–1017, 1999. 11. T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1):29–44, June 2001. 12. A.K. Jain and F. Farrokhnia. Unsupervised texture segmentation using gabor filters. Pattern Recognition, 24(12):1167–1186, December 1991. 13. K.I. Laws. Textured Image Segmentation. PhD thesis, Electrical Engineering, University of Southern California, 1980.
On Affine Invariant Clustering and Automatic Cast Listing in Movies Andrew Fitzgibbon and Andrew Zisserman Visual Geometry Group Department of Engineering Science, The University of Oxford http://www.robots.ox.ac.uk/ vgg http://www.robots.ox.ac.uk/ vgg
Abstract. We develop a distance metric for clustering and classification algorithms which is invariant to affine transformations and includes priors on the transformation parameters. Such clustering requirements are generic to a number of problems in computer vision. We extend existing techniques for affine-invariant clustering, and show that the new distance metric outperforms existing approximations to affine invariant distance computation, particularly under large transformations. In addition, we incorporate prior probabilities on the transformation parameters. This further regularizes the solution, mitigating a rare but serious tendency of the existing solutions to diverge. For the particular special case of corresponding point sets we demonstrate that the affine invariant measure we introduced may be obtained in closed form. As an application of these ideas we demonstrate that the faces of the principal cast of a feature film can be generated automatically using clustering with appropriate invariance. This is a very demanding test as it involves detecting and clustering over tens of thousands of images with the variances including changes in viewpoint, lighting, scale and expression.
1
Introduction
Clustering and classification problems abound in the applied sciences, in applications from citation indexing to the study of gene function. The task of clustering is to divide a large amount of data into disjoint subsets or classes, such that some measure of distance is minimized within classes, and maximized between classes. In computer vision, several recent advances have incorporated clustering algorithms for the canonicalization of large data sets: selecting exemplars [30], building unsupervised object recognizers [1, 2], texton generation [16,21], and learning in low-level vision [19]. What makes clustering hard is that the measurement of distance between data is invariably corrupted by measurement error in the data themselves. To overcome this, the distance measures must include a model of the noise process underlying the measurement errors, and the clustering algorithms must employ sophisticated search techniques in order to minimize the distance. In some problems—particularly those arising in vision applications—there is another common source of variation, caused when the observed data undergo a parametrized transformation. For example, changes in scale and rotation of a head in face recognition A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 304–320, 2002. c Springer-Verlag Berlin Heidelberg 2002
On Affine Invariant Clustering and Automatic Cast Listing in Movies
305
Fig. 1. Clustering on faces. A standard face detector has been run on 30000 frames of the movie “Groundhog Day”, some example detections being shown here. The distance metric developed in this paper is used in a standard clustering algorithm to extract the principal cast list from the several thousand faces detected in a typical feature film.
applications that arise from variations in pose, see figure 1. Clustering algorithms for computer vision must allow for, or be invariant to, such transformations. Our contribution in this paper is to develop a set of distance functions which take account of affine transformation of the data. These distance functions may either be invariant to the transformation or contain priors based on the transformations parameters. Closed form and numerical iterative solutions are given for these functions. This enables a Bayesian maximum a posteriori (MAP) cluster estimation. The rest of the paper is arranged as follows. First we specify the problem in detail, and review the related literature. Section 2 develops the affine-invariant distance with and without priors for a specific problem: the registration of 2D point sets. Section 3 develops the general form of the affine-invariant distance for image comparisons, and shows how the prior is incorporated. Then, in section 4, the application to image clustering is developed in the context of automatic cast indexing. This allows us to give details about the implementation for a concrete application. Finally we conclude with a discussion of current and future avenues for research. The Problem Before reviewing the literature on transformation-invariant clustering, it will prove useful to formally define the clustering problem we wish to solve. We have a data set S of n observations Xi , and want to obtain a partition S = {S1 . . . Sk } such that S = S1 ∪ · · · ∪ SK , and a set of cluster centres {Xc }K c=1 such that a cost K
d(Xi , Xc )
(1)
c=1 Xi ∈Sc
is minimized, where d(X, Y ) is a distance function between X and Y . Thus a solution to a clustering problem always involves two aspects: (1) choosing a distance function d(, ), (2) determining the clusters given d(, ). Research into clustering and classification of data has a long literature [8], and many good algorithms
306
A. Fitzgibbon and A. Zisserman
are available. Examples include K-means [7], normalized cuts [24], minimal spanning tree. In this work, we use the “Partition among Medoids” algorithm of Kaufman and Rousseeuw [14], and treat the clustering routine as a black box. Some of these algorithms offer automatic computation of K, the number of cluster centres, and some require it as an input parameter. The distance metric which is the novel contribution of this paper is independent of these issues, and may be used with any clustering algorithm. Furthermore, although d(, ) imposes the structure of a metric space on the space of the X, it is a property of the clustering algorithm, not the distance, whether the X should live in a space which is also a vector space. Background The problem we investigate in this paper is the following: we wish to cluster a data set under an affine invariant distance function including priors on the affine transformation. Invariant approaches to unsupervised clustering have taken a number of routes. In a vector space, the techniques which have been used for robustification of principal components analysis [5] and to include some transformation invariance [9] could be applied to clustering, but these solutions are expensive to compute, and many interesting computer vision problems do not have data which may be linearly combined. In a metric space, attention must concentrate on the distance function in order to obtain invariance. Simard et al.[26,27] describe the modification required. The key idea is that the clustering will be independent of the transformation parameters if the distance metric is transformation invariant. In spaces which are not metric, it is possible [22,23] to obtain transformation invariance by artificially introducing transformed copies of each datum into the dataset. For example Sch¨olkopf [23] adds transformation invariance to a support-vector classifier by carefully adding examples to the dataset which are transformed copies of the initially selected support vectors. However, this strategy is impractical—unless the range of transformations is very small indeed—because of the enormous expansion in the size of the feature space, and a consequent increase in the cost of the clustering algorithms. Also, although this technique can be adapted to other classifiers, it cannot easily be made to work for unsupervised clustering. The work in this paper concentrates on the common computer vision case, in which the points of interest may be considered to lie in a metric space.
2
Point Sets
In this section we will illustrate the problem for the case of a set of 2D points X = {xi }, i = 1 . . . N in correspondence with another set of of 2D points X = {xi }, i = 1 . . . N . The correspondences {xi ↔ xi } are known. We will demonstrate the behaviour of the affine invariant distance function in comparing and clustering shapes. ˆ i } which are close to We are searching for a set of corresponding points {ˆ xi ↔ x the supplied points {xi ↔ xi }, and are exactly mapped by an affine transformation ˆ i = Aˆ x xi + t, where A is a 2 × 2 matrix and t a 2-vector. For the moment we will neglect the prior p(A, t) on the affine parameters. Then our aim is to compute the distance function dA (X, X ),
On Affine Invariant Clustering and Automatic Cast Listing in Movies
dA ({xi }, {xi })2 = min
a,{ˆ xi }
N
307
(xi − x ˆi )2 + (yi − yˆi )2 + (xi − x ˆi )2 + (yi − yˆi )2
i
(2)
where a is a 6-vector specifying the 6 parameters of the affine transformation. Note, x}, i = 1 . . . N and affine transforcomputing dA involves also estimating the points {ˆ mation a – a total of 2N + 6 parameters – and the distance is the minimum over all these parameters. We are only interested in the distance, so for our purposes the points {ˆ x} and transformation a are “nuisance parameters”. It will be seen below that the distance can be computed without explicitly solving for these nuisance parameters. In a practical situation this problem might arise in computing the affine transformation between two images of a planar object. The points {xi }, {xi } would be the measured points on the object in the first and second images respectively. If it is assumed the measured points are corrupted by additive Gaussian noise in their position, then minimizing the cost in (2) gives the maximum likelihood estimate of the xi }; and the distance dA () is known as reprojection transformation a and points {ˆ xi }, {ˆ error [10]. The distance dA is a generalization from similarity to affine transformations of the “Procrustean” distance of Mardia [6]. If we concatenate the set of 2D points into a single 2N -vector, so that for X = (x1 , x2 . . . xN ) etc, then (2) becomes ˆ )2 + (X − X ˆ )2 dA (X, X )2 = min(X − X ˆ a,X
(3)
Note a (2 × 2) affine transformation on the 2D space x does not result in a (2N × 2N ) general affine transformation on the 2N dimensional space X. Instead the corresponding transformation matrix is block diagonal with the 2 × 2 matrix A defining the block. The ˆ ↔X ˆ are constrained to lie on a six-dimensional hyperplane corresponding points X (a subspace) in R2N . Figure 2 gives a geometric picture of the vectors involved and the constraint surface. We now turn to computing the distance. It will be seen that the distance may be computed in closed form by a simple algorithm. Algorithm for computing the distance dA ({xi }, {xi }): 1. Translate the point set {xi } so that its centroid is zero, i.e. compute the centroid 1 x and subtract it from each xi . i i N 2. Similarly translate the point set {xi } so that its centroid is also zero. 3. Form the N × 4 matrix M with rows (xi , xi ) = (xi , yi , xi , yi ). 4. Form the Singular Value Decomposition of M = UDV , where D is a diagonal matrix with diagonal elements σ1 , σ2 , σ3 , σ4 (the singular values) followed by zero’s. 5. Then the required distance dA ({xi }, {xi })2 = σ32 + σ42 . ˆi, x ˆ i may be obtained as 6. If required the points x xi ˆi x − v v − v v = I 4×4 3 3 4 4 ˆ i xi x where vi are the columns of V.
308
A. Fitzgibbon and A. Zisserman
X
/
d/
X
a
X
/
d
X
Fig. 2. Affine invariant distance measure (AIDM). The points X and X in R2N are the supplied ˆ and X ˆ which are closest to X and X respectively, but are related vectors. We seek the points X exactly by an affine transformation. The hyperplane π represents all the points that can be reached ˆ (or equivalently on X ˆ ). It is the orbit of the affine group, specified by six by the affine action on X parameters. The distance dA (X, X )2 = d2 + d2 . It is evident that the distance is minimized by finding the hyperplane spanned by the affine action which minimizes the perpendicular distance ˆ and X ˆ are the projections of X and X to the supplied points X and X , and that the points X respectively onto this plane.
The proof uses the property of the SVD that the closest fitting rank 2 column space to M is given by the first two columns of U, but is omitted here for lack of space. The result is closely related to the factorization algorithm [28], though here applied to a single view, instead of multiple views, and for structure restricted to a plane [12]. 2.1 Application to Shape Matching Suppose we wish to cluster the six polygonal shapes in figure 3(b). Each polygon (2 examples each of a unit square, a rotated and scaled square, and a triangle) is defined by four points, and the correspondence between the points of each polygon is known. If Euclidean distance is measured between the points (after registering the shapes’centroid) then there are three pairings which have a small distance, and these are the pairing of the square with the square, triangle with triangle and rotated square with rotated square (see table 1). However, if affine invariant distance is measured between the shapes then all four squares may be clustered, but the triangles are still distinct. Indeed if there is no noise added to the points then the affine invariant distance between the square and its rotated and scaled version is zero. 2.2
Including Priors on the Transformation
Of course, in practical computer vision applications, one is rarely concerned about invariance to all of the transformations which a particular parametrization admits. For example, in digit recognition one wishes to maintain invariance to small differences in rotation, scaling and translation of letters, but not to allow a “6” to be rotated into a
On Affine Invariant Clustering and Automatic Cast Listing in Movies 1.5
1.5
1
1
0.5
0.5
0
0
0.5
1
1.5
2
2.5
(a)
3
0
0
0.5
1
1.5
2
2.5
3
3.5
309
4
(b)
Fig. 3. Shape space. (a) Two shapes (“exemplars”) defined by four points. (b) Affine observations of the exemplars: the observations are generated by applying an affine transformation (which is the identity in four of the cases shown) and then adding Gaussian noise with σ = 0.05 independently to each point. There are six observations, shown in three superimposed pairs. The square and rotated square are in the same affine equivalence class.
“9” [27]. Thus we wish to include a priori knowledge in the form of prior probability distributions p(a) on the transformation parameters in the distance function. If the generative model which produces the point sets is an affine transformation followed by additive Gaussian noise on the point’s location, then the distance dA (X, X ) ˆ (which is the (negative) log-likelihood that X and X are observations of the object X represents the affine equivalence class). As this is a log-likelihood, a prior must also be included in the distance function as a (negative) log-likelihood, i.e. as − log p(a). In this way, by Bayes, the distance is related to the posterior probability that the two observations are in the same affine equivalence class given a prior on the transformation between observation pairs. If we can use a zero-mean Gaussian prior on a, we may write p(a) = exp(−a Λa). Although the process need not be zero-mean, we assume for simplicity that it is. Given this prior then, the distance function of equation (3) is extended to ˆ )2 + (X − X ˆ ))2 + a Λa dA (X, X )2 = min(X − X ˆ a,X
(4)
In this synthetic example, we assume a more complex distribution for the affinities. The prior used is p(A, t) = exp(−|1 − det A|), where A is the 2 × 2 matrix of the affine ˆ i = Aˆ transformation x xi + t, computed between the point sets. The prior in this case simply measures the relative change in areas of the shapes. Table 1 shows the effect of applying the distance function (3) between the pairs of shapes of figure 3b. Without the prior the square and ‘rotated and scaled square’ are in the same equivalence class, but including the prior breaks the symmetry.
3
Images
For many vision applications, the primary datum is the vector of greylevels representing an image or an image patch, with an associated parametrized transformation, to which invariance is sought. Given an image represented as a column vector I, the image mapped
310
A. Fitzgibbon and A. Zisserman
Table 1. Image intensity represents the distance measured between pairs of shapes from figure 3b, where dark indicates a low distance. (Note the shapes are first translated so that their centroids are coincident). (a) Euclidean distance. The only clusters are on the diagonal, for example the two noisy versions of the square form a cluster, but the square and rotated square do not. (b) Affine invariant distance dA measured between the shapes. Note that distances cluster into the two equivalence classes. (c) Affine distances with priors on the transformation parameters (see text). The effect of the prior is to break the symmetry (the affine equivalence of the sets). The distances now cluster again into three equivalence classes.
(a) Euclidean
(b) Affine
(c) Affine with priors
by the transformation with parameters a is written T (a; I). Note that under many common transformations, this transformation is linear in the image graylevels, despite being nonlinear in a. As in (2) we seek to estimate an image I0 which maps exactly under this induced transformation but minimizes the distance to the measured images I, I . (I − T (a; I0 ))2 + (I − T (a ; I0 ))2 dA (I, I )2 = min a,a ,I0
(5)
where we have specified here the sub-space by a point I0 which is transformed to estimate ˆI by one transformation parametrized by a, and is transformed to estimate ˆI by another transformation parametrized by a . In order to cluster a set of images, we will compute dA pairwise in the set, and apply a standard metric-space clustering algorithm. Note that this formulation of the distance is essentially that of estimating the affine transformation which best aligns the two images, and reporting the squared error between the aligned and original images. This is a problem with a long history, and two basic approaches may be discerned: direct [13] and feature-based [29]. The distance metric in this paper is a purely direct method without any feature correspondences. 3.1 AIDM: Affine Invariant Distance Measure In general, the transformation will not be linear in the parameters a, so we shall linearize about the origin. Expanding in terms of derivatives of I with respect to the 6 parameters of the affine transformation, we calculate a N× 6 Jacobian matrix D = ∇a T (a; I), and similarly from I calculate D . Then to a first order approximation (5) becomes (I − (I0 + Da))2 + (I − (I0 + D a ))2 dA (I, I )2 = min a,a ,I0
(6)
On Affine Invariant Clustering and Automatic Cast Listing in Movies
311
Collecting the parameters I0 , a and a into a single vector of unknowns x = [I 0 , a , a ] , this is equal to
2 I 1D 0 dA (I, I ) = min − x I 1 0 D x 2
= min m − Mx 2
(7)
x
Where 1 is an N × N identity matrix. Finally, differentiating with respect to x and solving gives the solution for the minimizing x as x = M+ m, where M+ = (M M)−1 M is the Moore-Penrose pseudoinverse of M. Substituting this x into (7) gives the desired distance. If the affine transformation is small, then this first order approximation is often sufficient. In general though we are seeking the true solution to (5). This solution may be obtained in the standard manner by a pyramid search and the linearization employed on the various levels of the pyramid. An alternative is to use a good non-linear minimizer (such as Levenberg-Marquardt) to search for the global solution of (5) directly. In the next section we shall show how including the priors on the transformation parameters is analogous to using a Levenberg-Marquardt-like algorithm. Advantages of Including the Prior The transformation priors may be written x Λx x, so if priors are included the minimization (7) becomes instead 1
dA (I, I )2 = min Mx − m 2 + Λx2 x 2 x
(8)
A linear solution may be obtained as above by differentiating with respect to x and solving to give x = (M M + Λ)−1 M m. Now the numerical advantages of including the prior become apparent: if M is poorly conditioned, the Newton step x = M+ m can place x, and hence I0 , far from the data points I, I , and assign a large value to d(I, I ). Whether one computes the pseudoinverse directly from the Moore-Penrose formula or via the SVD, poor conditioning of M M will generally imply an erroneous step. With the prior, the computation inverts M M + Λ rather than M M. We assume that the information matrix Λ is positive definite (as it is the inverse of a covariance matrix), so the smallest eigenvalue of the matrix to be inverted is bounded away from zero. Hence the computed step is actively constrained to lie close to the initial estimate. A useful comparison is with schemes for nonlinear optimization. The computation of (5) under the tangent distance approximation of [26,31] is effectively a Gauss-Newton update for the hidden parameters I0 , a and a . The convergence properties of such schemes are well known [3,4,20]: where the approximation is good, it gives excellent convergence; but where the linearization is not valid, the update can be drastically wrong. Modern optimization techniques invariably regularize the Gauss-Newton update using trust region strategies [3,4], and thus gain improved convergence. Including the prior in (8) confers similar advantages on AIDM, as will be shown in the applications which follow, and in the distance matrices in figure 6.
312
A. Fitzgibbon and A. Zisserman
Fig. 4. Test sequence. A selection of the 251 faces detected on a 2000 frame sequence from the film “Groundhog Day”. The start and end frame of the sequence are shown in figure 7. There are four principal cast members in this sequence, and some significant distractors. If the face detector fires on background clutter (e.g. the television on the bottom left) it will generally do so reliably, giving a large consistent cluster. Lighting and head orientation changes are significant, so the distance metric must be robust to these effects.
4 Application As an example of the application of the the proposed invariant distance function, we consider the problem of automatically extracting the principal cast members from a movie sequence. This application is a challenging analogue to the clustering of digits problem [15], and requires the full power of the affine invariant distance as well as the incorporation of motion priors in order to achieve useful results. This is because, although movies are generally well photographed, the actors’ faces tend to be seen in quite varied positions and under lighting conditions that are not typical of more traditional mugshot or security applications. Figures 1, 4 and 5 show some examples. To give an idea of the magnitude of the task: a feature length film contains of the order of 150K frames and a principal cast of perhaps 20, with each character appearing in 1000s of frames. In the examples shown here, we generally subsample temporally by a factor of five, so that we are dealing with datasets of the order of 30000 frames. The strategy of the algorithm is to detect faces in each frame of the movie, and cluster the detected faces to obtain a representative set which summarizes the faces of the cast, and of course, associates the cast members with the scenes in which they appear. In the following sections we will describe the steps involved in clustering for a test sequence of 2000 frames with 251 detected faces from the film ’Groundhog Day”. See figure 4. Face detection: Face detection is an area that has seen significant progress in recent years, and impressive systems have been built [11]. In this application we used a wellengineered local implementation [18] of the Schneiderman and Kanade [22] face detector. Parameters were set to obtain a true positive rate of about 80% of frontal faces, which induces a false detection rate of about 0.01 faces per frame. This detector obtains scale and translation invariance by oversampling. The output comprises the image plane translation of the face template centre, and the scale at which the maximum response was
On Affine Invariant Clustering and Automatic Cast Listing in Movies
313
Fig. 5. Pre-processing. Upper set: original (raw) extracted faces. Lower set: faces after bandpass filtering and feathering. Variations in lighting across the image are removed, and boundary effects are diminished.
(a)
(b)
(c)
Fig. 6. Distance matrices for the test sequence. For this sequence, the faces are naturally clustered, and large dark blocks in the distance matrix indicate the clusters. (a) Multiresolution tangent distance [31]. (b) AIDM without priors. (c) AIDM with deformation prior. In this case, the distance computed by the Newton methods (b,c) is visibly better defined than (a). The improvement due to deformation priors (c) over (b) is not visually obvious, but amounts to a small percentage reduction in the number of divergence failures (individual bright pixels in all matrices).
314
A. Fitzgibbon and A. Zisserman
Fig. 7. Temporal coherence. (Left) Two example frames from an 80 sec extract from the test movie. (Right) Image-plane x coordinate of detected faces over that range. Although there are two actors in shot for much of the sequence, they are frequently not detected because the face goes into a profile view. However we can see that, starting at frame 3000, there are two faces in shot, “A” at x ≈ 200 and “B” at x ≈ 500 (to the right of the frame). For the first 400 frames, B faces the camera, giving a long consistent trace, and A alternates her gaze between B and the viewer. From 3400 to 3550, both talk to each other in profile, so there are no detections, and then A takes over the straight-to-camera slot.
recorded. The detected faces are rescaled to 81 × 81 pixels, and stored with their frame number and position. Examples of the output are shown in figure 1. Plots of the recovered image-plane position (shown in figure 7) illustrate the strong temporal coherence of the detector outputs, a constraint we will incorporate via priors into the clustering. Variations in pose and lighting: The principal sources of variation in this application arise from changes in illumination and in facial pose. Images can be normalized for illumination by applying a bandpass filter to the extracted face regions, and scaling the windows’ intensities to have mean zero and variance 1. In addition, faces are feathered by multiplying them with a Gaussian centred at the centre of the image. In detail the filters used are: – Bandpass: B = I ∗ Gσ=1 − I ∗ Gσ=4 2 – Feather: F (x, y) = B(x, y) exp − r(x,y) .5 Examples of this pre-processing are shown in figure 5. Affine invariant clustering: The distance was computed using a multiscale technique [31] on a Gaussian pyramid, with the finest scale corresponding to σ = 2 pixels. The affineinvariant distance was computed by iterating the solution of (8), with up to five iterations allowed. Implemented in unoptimized Matlab, evaluation of the distance for a single pair of faces takes about half a second. With a typical movie returning several thousand face detections, computation of the complete distance matrix would be impractical. In order to accelerate the process, we pruned the face set using a Euclidean k-medoids algorithm [14] on the affinely distorted faces, with k chosen to return a few hundred
On Affine Invariant Clustering and Automatic Cast Listing in Movies
315
faces. Then computation of the affine-invariant distance matrix on the reduced set requires compute time only of the order of a few hours—certainly less than the initial face detection stage. Finally, the k-medoids algorithm is applied to the affine-invariant distance matrix and the top K cluster centres extracted, for a user-specified K. Figure 6 shows some example distance matrices, computed with the tangent distance approximation to affine invariance, and our new Newton methods. In this example, the new metrics improve the signal-to-noise ratio of the clusters, as evidenced by the dark squares in the new distance matrices. Examples of the effect on the clustering output are shown in figure 9.
4.1
Clustering Including Priors
Two classes of priors will be included: priors on the transformation between any pair of frames; and priors on the transformation between contiguous frames. We will first describe these two classes, and how they are learnt from images, and then show their effectiveness in improving clustering. The resultant distance matrices are shown in figure 8. Deformation priors: The prior on the affine transform parameters is learnt by manually selecting the eyes and the centre of the mouth in 200 randomly chosen faces, computing the affine transformations which maps between these, and fitting a single 6D Gaussian distribution to the resulting transformation parameters. This prior is generic to all of the pairwise comparisons, and simply represents the likelihood that the face detector will tend to detect faces in only a small number of poses. Including this prior into the distance metric has only a small effect on the computed distance in almost all cases. However, in cases where the Newton method might diverge, producing an extreme affine correction, this prior will tend to regularize the computation. In the 2000-frame test sequence, this occurs in about 1% of the comparisons, all of which are repaired by the deformation prior. Temporal coherence prior – speed: An important additional constraint in the analysis of image sequences is that provided by the motion of the detected faces. In movies particularly, the faces generally move little from frame to frame, so that there is an a priori restriction on the speed at which objects can move. A typical assumption might be that the image velocity vector x˙ = dx/dt is drawn from a zero-mean distribution which decays monontonically away from the origin. Figure 7 gives an indication of the stability of the face detector’s position reports on a studio-bound shot. In outdoor or crowd scenes the variance is somewhat higher. To incorporate this knowledge, we can augment the distance computation between two faces. Each 81 × 81 face window in the pair I, I is associated with its location p or p in the full frame, and the time (frame number) at which it was detected t or t . Then, for any pair, a speed estimate is given by s(I, I ) = p − p /|t − t|. This is converted to a log-likelihood via a kernel function E, and added to the image distance, which then contains three terms
316
A. Fitzgibbon and A. Zisserman
(a)
(b)
(c)
Fig. 8. Distance matrices with and without priors. (a) Affine-invariant distance metric with deformation priors only. (b) Matrix of speed prior contributions only. This can be precomputed without any iteration as it is based solely on the image locations reported by the face detector. (c) Combined deformation and speed priors. The notable contribution of the speed prior in this case (strong narrow white bars) is the increase in dissimilarity between detections which are spatially separated, but temporally close – e.g. two actors in the same scene.
dA (I, I )2 = min (I − T (a; I0 )2 + (I − T (a ; I0 )2 − log p(a−1 a ) + E(s(I, I )) a,a ,I0
(9) Because the dependence of s on a is weak in our examples, the motion cost is simply added to the AIDM matrix before clustering. It was empirically determined that an effective form for the error was an arctangent sigmoid E(x) = tan−1 x. Figure 8 shows the AIDM matrix before and after incorporating the motion constraint. An effect of this prior is to increase the distance between faces across shot boundaries, since generally faces are not in precisely the same position in such frames and consequently the speed is large. A prior could also be included on whether the same face appears in contiguous shots. Clustering: The distance matrix has been computed, and incorporates both deformation and speed priors. Clustering is now performed using the “Partition among Medoids” algorithm [14]. Because this algorithm has a run time which is greater than quadratic in the number of data, a hierarchical strategy was employed. Given an n-element dataset and a requirement for k clusters, the algorithm divides the set into N subsets, which can be clustered in reasonable time. Each subset is clustered to obtain k medoids, and the process repeated over the N k cluster centres. Figure 10 shows the results of the algorithm on 2111 faces detected in “Groundhog Day”.
5
Discussion
We have presented a new affine invariant distance metric which efficiently manages priors on the transformation parameters, and shown how the use of such priors regularizes
On Affine Invariant Clustering and Automatic Cast Listing in Movies
317
(a)
(b) Fig. 9. Clusters for the test sequence. (a) Clusters computed using tangent distance. (b) Clusters computed using AIDM with both deformation and speed priors. The number of clusters was set a priori at four. The tangent distance clusters are poor, showing very little within-cluster coherence. The AIDM clusters correctly extract the main two characters, and the third cluster center is the third character, although that cluster includes some spurious background.
clustering problems in a controllable way. Previous efficient approaches used ad hoc regularizers and did not include priors, while previous approaches incorporating priors were expensive to compute. The new technique is analogous to the use of “trust region” and Levenberg-Marquardt strategies in nonlinear optimization. The power of this metric for unsupervised clustering has been demonstrated by automatically extracting the principal cast from a movie. The primary difficulty with the current versions of the algorithm is a poor tolerance to changes in expression of the characters. Relaxing the distance threshold will result in merging of clusters containing different characters. Future investigations will include improvements to the tolerance to expression change and other hard cases (see figure 12). In particular we intend to learn the noise distribution in the manner of [25]. Clustering invariant to certain classes of transformations is an interesting strategy to investigate for many vision problems. Here we have investigated face detection which may be considered as a very strongly defined interest region operator. A similar example
318
A. Fitzgibbon and A. Zisserman
Fig. 10. Principal cast list of “Groundhog Day”. Cluster centres under AIDM with priors. Duplicates have not been suppressed, so actors can appear multiple times, with different expressions.
Fig. 11. Principal cast list of “The Player”. Top 40 cluster centres from 2899 detected faces on every fifth frame (35641 frames).
Fig. 12. Difficult cases. Examples (from “The Player”) where the plain AIDM has difficulty. Lighting change is handled reasonably well by our preprocessing. Modification of (5) to include a robust kernel goes some way to handling the problems of occlusion.
would be a building facade detector, where invariant clustering would be expected to determine the sides of the building. More generally, affine invariant clustering on 2D appearance would (partially) remove 3D viewpoint effects and determine clusters corre-
On Affine Invariant Clustering and Automatic Cast Listing in Movies
319
sponding to the aspects of an object. Finally, affine invariant clustering on filter responses will enable viewpoint variant and invariant textons [17] to be learnt from single images of general curved surfaces, instead of, as now, from images of textured planes. Acknowledgements. We are very grateful to Krystian Mikolajczyk and Cordelia Schmid of INRIA Grenoble for supplying us with the face detection software. David Capel and Michael Black made a number of helpful comments. Funding was provided by the Royal Society and the EC CogViSys project.
References 1. Y. Amit and D. Geman. A computational model for visual selection. Neural Computation, 11(7):1691–1715, 1999. 2. M. C. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In ECCV (2), pages 628–641, 1998. 3. R. Byrd, R.B. Schnabel, and G.A. Shultz. A trust region algorithm for nonlinearly constrained optimization. SIAM J. Numer. Anal., 24:1152–1170, 1987. 4. A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust-Region Methods. MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2000. 5. F. De la Torre and M. J. Black. Robust principal component analysis for computer vision. In Proc. International Conference on Computer Vision, 2001. 6. I. Dryden and K. Mardia. Statistical shape analysis. John Wiley & Sons, New York, 1998. 7. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. 8. D. Fasulo. An analysis of recent work on clustering algorithms. Technical Report UW-CSE01-03-02, University of Washington, 1999. 9. B. Frey and N. Jojic. Transformed component analysis: joint estimation of spatial transformations and image components. In Proc. International Conference on Computer Vision, pages 1190–1196, 1999. 10. R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049, 2000. 11. E. Hjelm˚as and B. K. Low. Face detection: A survey. Computer Vision and Image Understanding, 83(3):236–274, 2001. 12. M. Irani. Multi-frame optical flow estimation using subspace constraints. In ICCV, pages 626–633, 1999. 13. M. Irani and P. Anandan. About direct methods. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, volume 1883 of LNCS, pages 267–277. Springer, 2000. 14. L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, NY, USA, 1990. 15. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 16. T. Leung and J. Malik. Recognizing surfaces using three-dimensional textons. In Proc. 7th International Conference on Computer Vision, Kerkyra, Greece, pages 1010–1017, Kerkyra, Greece, September 1999. 17. T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, December 1999. 18. K. Mikolajczyk, R. Choudhury, and C. Schmid. Face detection in a video sequence – a temporal approach. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2001.
320
A. Fitzgibbon and A. Zisserman
19. B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–9, 1996. 20. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C. Cambridge University Press, 1988. 21. C. Schmid. Constructing models for content-based image retrieval. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2001. 22. H. Schneiderman and T. Kanade. A histogram-based method for detection of faces and cars. In Proc. ICIP, volume 3, pages 504 – 507, September 2000. 23. B. Sch¨olkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. In Articial Neural Networks, ICANN’96, pages 47–52, 1996. 24. J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 731–743, 1997. 25. H. Sidenbladh and M. J. Black. Learning image statistics for Bayesian tracking. In Proc. International Conference on Computer Vision, pages II:709–716, 2001. 26. P. Simard, Y. Le Cun, and J. Denker. Efficient pattern recognition using a new transformation distance. In Advances in Neural Info. Proc. Sys. (NIPS), volume 5, pages 50–57, 1993. 27. P. Simard, Y. Le Cun, J. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Lecture Notes in Computer Science, Vol. 1524, pages 239–274. Springer, 1998. 28. C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factorization approach. International Journal of Computer Vision, 9(2):137–154, November 1992. 29. P. H. S. Torr and A. Zisserman. Feature based methods for structure and motion estimation. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, volume 1883 of LNCS, pages 278–294. Springer, 2000. 30. K. Toyama and A.Blake. Probabalistic tracking in a metric space. In Proc. International Conference on Computer Vision, pages II, 50–57, 2001. 31. N. Vasconcelos and A. Lippman. Multiresolution tangent distance for affine-invariant classification. In Advances in Neural Info. Proc. Sys. (NIPS), volume 10, pages 843–849, 1998.
Factorial Markov Random Fields Junhwan Kim and Ramin Zabih Computer Science Department Cornell University Ithaca, NY 14853
Abstract. In this paper we propose an extension to the standard Markov Random Field (MRF) model in order to handle layers. Our extension, which we call a Factorial MRF (FMRF), is analogous to the extension from Hidden Markov Models (HMM’s) to Factorial HMM’s. We present an efficient EM-based algorithm for inference on Factorial MRF’s. Our algorithm makes use of the fact that layers are a priori independent, and that layers only interact through the observable image. The algorithm iterates between wide inference, i.e., inference within each layer for the entire set of pixels, and deep inference, i.e., inference through the layers for each single pixel. The efficiency of our method is partly due to the use of graph cuts for binary segmentation, which is part of the wide inference step. We show experimental results for both real and synthetic images. Keywords: Grouping and segmentation, Layer representation, Graphical model, Bayesian inference, Markov Random Field, Factorial Hidden Markov Model
1
Introduction
Markov Random Fields (MRF’s) have been extensively used in low level vision because they can naturally incorporate the spatial coherence of measures of interest (intensity, disparity, etc) [12]. However, MRF’s cannot effectively combine information over disconnected spatial regions. Layer representations are a popular way of addressing this limitation [16,14,1]. The main contribution of this paper is to propose a new graphical model that can represent image layers, and to develop an efficient algorithm for inference on this graphical model. We extend the standard MRF model to several layers of MRF’s, which is analogous to the extension from Hidden Markov Models (HMM’s) to Factorial HMM’s (see Figure (1) and Figure (2)) [6]. A Factorial HMM has the structure shown in Figure (1), and consists of i) a set of hidden variables, which are a priori independent, and ii) a set of observable variables, whose state depends on the hidden variables as shown in the figure. The inference algorithm for Factorial HMM’s iterates between (exact) inference within each hidden variable (via the forward-backward algorithm), and (approximate) inference using indirect dependencies between hidden variables through the observables. Similarly, our algorithm alternates between wide inference and deep A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 321–334, 2002. c Springer-Verlag Berlin Heidelberg 2002
322
J. Kim and R. Zabih
inference. Wide inference, i.e., inference within each layer for the entire set of image pixels, utilizes what we call pseudo-observables, which we define in section 4.2. These play the same role as observables in standard MRF’s. Parallel to the forward-backward algorithm in HMM’s, we use graph cuts for binary segmentation [7], where the binary value signifies whether the object is present or not at the given pixel and layer. Deep inference, on the other hand, is inference through the layers for each single pixel. We develop an efficient EM-based algorithm to evaluate pseudo-observables by deep inference. The algorithm uses dependencies between observables and hidden variables (but not within the hidden variables), as well as the result of the binary segmentation from the wide inference step. The rest of the paper is organized as follows. Section 2 summarizes related work. In section 3, we define a Factorial MRF and show how this model can be used for layers. Our EM-based algorithm for inference on Factorial MRF’s is presented in section 4. We demonstrate the effectiveness of our algorithm by experimental results for both real and synthetic images in section 5. Some technical details, especially regarding transparent layers, are deferred to the appendix.
(a)
(b)
Fig. 1. Extension from HMM to factorial HMM: (a) A directed graph specifying conditional dependence relations for an HMM. (b) A directed graph for factorial HMM with three underlying Markov chains (borrowed from [6]). From now on, a gray node represents an observable variable, whereas a white node represents a hidden variable.
2
Related Work
Inference problems on probabilistic models are frequently encountered in computer vision and image processing. In the structured variational approximation [10], exact algorithms for probability computation on tractable substructures are combined with variational methods to handle the interaction between the substructures which make the system as a whole intractable. Factorial HMM’s [6] are a natural extension of HMM’s, in which the hidden state is given by the joint configuration of a set of independent variables. In this case, the natural tractable structure consists of the HMM’s for each hidden variable, for which the forward-backward algorithm can be used. On the other hand, in the case of
Factorial Markov Random Fields
(a)
323
(b)
Fig. 2. Extension from MRF to factorial MRF: (a) A graph specifying conditional dependence relations for an MRF. (b) A graph for a factorial MRF with three underlying MRF’s. Some of the nodes and links are omitted from the drawing for legibility
the Hidden Markov decision trees, there are two natural tractable substructures, the “forest of chains approximation”, and the “forest of trees approximation” [10]. Transformed HMM’s [9] can be considered as Factorial HMM’s with two hidden variables, i.e., transformation variable and class variable. They use the model to cluster unlabeled video segments and form a video summary in an unsupervised fashion. Besides the Factorial HMM, researchers have proposed other extensions such as Coupled HMM’s for Chinese martial art action recognition [3], or Parallel HMM’s for American sign language recognition [15]. Layer representations are known to be able to precisely segment and estimate motion for multiple objects, and to provide compact and comprehensive representations. Wang and Adelson [16] approached the problem by iteratively clustering motion models computed using optical flow. Each layer, ordered in depth, contains an intensity map and alpha map, and they occlude each other with image compositing rules. Ayer and Sawhney [1] proposed an EM algorithm for robust maximum-likelihood estimation of the multiple models and their layers of support. They also applied the minimum descriptive length principle to estimate the number of models. Weiss [17] presented an EM algorithm that can segment image sequences by fitting multiple smooth flow fields to the spatiotemporal data. He showed how to estimate a single smooth flow field, which eventually leads to the multiple model estimation. The number of layers is estimated automatically using similar methods to the parametric approach. Torr et al. [14] concentrated on 3D layers, consists of approximately planar layers that have arbitrary 3D positions and orientations. 3D layer representations can naturally handle parallax effects on the layer as opposed to 2D approaches. Frey [5] recently proposed a Bayesian network for appearance-based layered vision which describes the occlusion process, and developed iterative probability propagation to recover the identity and position of the objects in the scene.
324
3
J. Kim and R. Zabih
Problem Formulation
The observable image, i = {ip | p ∈ P, P is set of pixels}, results from several layers of objects. Let fpl be a binary random variable which is 1 if an object exists at pixel p in layer l, and 0 otherwise. We assume that the layers are a priori independent of each other and we model each layer as Markov Random Field whose clique potentials involve pairs of neighboring pixels. This is the Ising Model [12], P (f l ) (1) P (f ) = l
P (f l ) ∝ exp −
p q∈Np
l 2θ{p,q} δ(fpl = fql ) .
(2)
Here f l indicates f l = {fpl | p ∈ P}, and N = {Np | p ∈ P} denotes a neighborl is a coefficient roughly signifying the degree hood system, where Np ⊂ P. θ{p,q} of connection between two pixels p and q in layer l (we will describe this in more detail later). Given the configuration of layers, the likelihood of the observable image at each single pixel is independent from other pixels: P (i | f ) = P (ip | fp ), (3) p
where fp indicates fp = {fpl | l ∈ L, L is set of layers}. By convention we will number the layers so that the largest numbered layer is closest to the viewer. By equations (1) and (3), we have: P (i, f ) = P (ip | fp ) P (f l ) (4) p
l
The likelihood of the observable, P (ip | fp ), hinges on the characteristic of layers. We will consider two kinds of layers, either transparent or opaque. For opaque layers, the observable image at each single pixel comes from the image containing an object that is closest to the viewer, as usual with layers. Let N (i; µ, C −1 ) denote a gaussian distribution with mean µ and covariance C; we have P (ip | fp ) = N (ip ; F (fp ), C(fp )−1 ) F (fp ) = W l Mpl l
−1
C(fp )
=
C l−1 Mpl ,
l
where W l denotes the attributes of the object at layer l, which can be intensity, displacement, etc, and C l is their covariance. Mpl is a binary random variable which is 1 if at pixel p there is an object present (i.e., fpl = 1) and this object
Factorial Markov Random Fields
325
is not occluded (i.e., for all l > l, fpl = 0) (see Figure(2)). For transparent layers, on the other hand, the observable image at each pixel results from a linear combination of the objects that present at each layer, contaminated by gaussian noise: P (ip | fp ) = N (ip ; F (fp ), C −1 ) F (fp ) = W l fpl l
From now on however, for the sake of readability, we will proceed with opaque layers. In the appendix we derive the formulas for the transparent layers.
4
EM Algorithm
Given the probabilistic model, the inference problem is to compute the posterior probabilities of the hidden variables (objects in each layers, f ), given the observations (image, i). In some other cases, we may want to infer the single most probable state ofthe hidden variables. The parameters of a factorial MRF can be estimated via the Expectation Maximization algorithm [4], which iterates between assuming the current parameters to compute posterior probabilities over the hidden states (E-step), and using these probabilities to maximize the expected log likelihood of the parameters (M-step). The EM algorithm starts by defining the expected log likelihood of complete data: Q(φnew | φ) = E{log P (i, f | φnew ) | φ, i}
(5)
l For the factorial MRF model, the parameters are φ = {W l , C l , θ{p,q} }. For the l sake of simplicity however, we will proceed with setting θ{p,q} = const.
4.1
M-Step
The M-Step for each parameter is obtained by setting the derivatives of Q with respect to those each parameters to zero, and solving. For the opaque layers, we have: W
l new
Cl
new
l p ip Mp
= =
p
p
Mpl
ip ip Mpl −
W l Mpl ip
p,l
Note that, for L = 1, these equations reduce to the standard single-layered MRF with similar observable. (See the appendix for the derivation.)
326
J. Kim and R. Zabih
(b)
(a)
Fig. 3. Structural approximation: (a)Hidden variables retain their Markov structure within each chain, but are independent across chains [6]. (b)Hidden variables retain their Markov structure within each layer, but are independent across layers.
4.2
E Step
As in the factorial HMM [6], the factorial MRF is approximated by L uncoupled MRF’s (see Figure(3)). We approximate the posterior distribution P by a tractable distribution Q. We write the structured approximation as: l hp P (f l ) (6) Q(i) = p
l
l
Comparing to Eq(4), we note that hlp has taken the place of P (ip | fpl ), which hlp is expected to approximate. From now on, we will call hlp a pseudo-observable, because it plays a similar role to observables in the standard MRF. We can safely regard hlp as the likelihood of fpl , i.e., hlp = 0.2 implies that the likelihood of fpl = 1 (present) is 0.2, and that of fpl = 0 (absent) is 0.8. We describe how to evaluate hlp in the next subsection, which leads to calculation of each variable’s maximum a posteriori (MAP) estimates (f ∗ ), which in turn leads to evaluation of each variable’s expectation (f ). Pseudo-observable: To evaluate the pseudo-observable (hlp ) for the entire set of layers, we calculate hlp for each layer under the assumption that the probability for fpl for other layers are given. In that case, we can evaluate the effect on the observable by setting fpl of the desired layer. Thus we have: hlp = P (ip | fpl ) = =
l
l
P (ip | Mpl , fpl )P (Mpl | fpl )
P (ip | Mpl )P (Mpl | fpl ),
Factorial Markov Random Fields
where
L l =l (1
327
− fpl ) if l ≥ l 0 otherwise
P (Mpl | fpl ) =
MAP Estimate: The above fixed point equations for hlp requires the evaluation of fpl in terms of hlp . Unlike the case of factorial HMM’s, where there is a tractable forward-backward algorithm, we don’t have a tractable algorithm to evaluate fpl from hlp . This is because the inference in even a single MRF is intractable unless the joint distribution is also Gaussian, in which case, an analytic solution is available. Thus we use an algorithm using the MAP estimates, which can be rapidly obtained via graph cuts [7]. Parallel to the case of factorial HMM’s, we find the MAP estimates considering the hlp as the observable. In our experiments, we used the graph cut algorithm of [2], which is specifically designed for the kind of graphs that arise in computer vision problems. Expectation: Given the MAP estimates f ∗ obtained above, we can approximate f as: ∗ ∗ fpl = fpl P (fpl | i) fpl δ(fpl = fpl ) = fpl , fpl
fpl ∗
where we assume that P (fpl | i) δ(fpl = fpl ). Unfortunately, this is too crude an approximation (it is basically binary thresholding). We can make a better approximation using the MRF priors: l l fpl = fpl P (fpl | i) = fpl P (fpl | fN )P (fN | i) p p
fpl fpl
fpl
fpl
l fN p
∗
l l l P (fpl | fN )δ(fN = fN )= p p p
l fN p
fpl
∗
l fpl P (fpl | fN ), p
∗
l l l | i) δ(fN = fN ). It is possible relax the where we assume that P (fN p p p ∗ l l l crudeness of the assumption by setting P (fNNp | i) δ(fN = fN ), or Np Np
l P (fN N
Np
l | i) δ(fN N
Np
l = fN N
Np
∗
), but there is a price to pay in terms of
computational load. However, we find that the above 1st-order neighbor approximation suffices. To summarize, we need the pseudo-observable (h) in order to evaluate the expectation (f ), whereas we need the expectation (f ) in order to evaluate the pseudo-observable (h). However the evaluation is done indirectly through the MAP estimates (f ∗ ). Thus the E step eventually consists of iteration through above three steps until convergence. Algorithm 1 shows the overall algorithm.
328
J. Kim and R. Zabih
Algorithm 1 Overall algorithm 1: Initialize parameters and hidden variables randomly 2: repeat 3: Update W l , C l 4: repeat 5: Calculate pseudo-observable (h) from f 6: Calculate MAP estimates (f ∗ ) from h by graph cuts for binary segmentation 7: Calculate expectation (f ) from f ∗ by 1st order neighbor approximation and MRF prior 8: until convergence 9: until convergence
Convergence: One of the underlying assumptions of the theory of the EM algorithm is the use of an exact E step [4]. The exact EM algorithm maximizes the log likelihood with respect to the posterior probability distribution over the hidden variables given the parameters. The structural approximation algorithm, on the other hand, does the same job with the additional constraint that the posterior probability distribution over the hidden variable is of a particular tractable form [6], such as Q in Eq(5) for instance. The convergence argument for our approach is slightly more complicated than the standard structural approximation algorithm. In our case, the E step is not exact even in a single layer MRF, as opposed to the exact E step in a single HMM chain used in factorial HMM’s. We can use Monte Carlo sampling methods using Markov chain for each MRF layer [8], which offers the theoretical assurance that the sampling procedure will converge to the correct posterior distribution ultimately. It is not a particularly attractive approach though, since inference on a single MRF layer is not the overall goal of our algorithm, but a subroutine. Although we do not have a formal justification of convergence per se (which remains for future work), the experimental results offer strong evidence for convergence property of our approach.
Table 1. Two different decomposition: Depending on the initial guess on the intensity in M-Step, the algorithm can end up with two different decompositions.
input
layer 1
layer 2
result 1 result 2
Factorial Markov Random Fields
329
Table 2. E-steps: The bottom rows show the final results after convergence. (a)boxbox-box synthetic image (b)box-box-grid syntehtic image (c)face-box-box synthetic image with appearance model for the face
input (a) (b) (c) iter layer 1 layer 2 layer 3 layer 1 layer 2 layer 3 layer 1 layer 2 layer 3 0
f
1
h
1 f∗ 1
f
2
h
2 f∗
5 5.1
Experiments Synthetic Image
We did experiments for a couple of synthetic images, shown in (a) and (b) of Table (2). Without any prior knowledge, depending on the initial guess on the intensity in M-Step, the algorithm can end up with two different decompositions (see Table (1)). In this case, result1 is more “natural”, but without prior knowledge, result2 is not terrible either. The question of how to incorporate some bias toward more natural decomposition (i.e. the closer, the brighter) remains for future work. Also, it is relatively straightforward to incorporate the appearance model to our probabilistic formulation, as shown in (c) of Table (2). 5.2
Real Image
We used the disparity map of the Garden-flower images and Tsukuba stereo images from a recent algorithm for computing motion and stereo with occlusions [11] as shown in Table(3) and Table(4) repectively. Notice that the result depends on the number of layers (which user must specify) as shown in Table(3). Although both of them give reasonable layer decomposition, the result for three layers give
330
J. Kim and R. Zabih
a better result, i.e., the final composition is closer to the input. Our algorithm deals with only layer occupancy, but not layer texture as shown in Table(3) (c) and (e), and Table(4) (d) and (f). We can either (1) incorporate texture in our formulation, or (2) use an image inpainting method such as [13].
Table 3. Layer analysis result on the flower garden sequence. (a) Left image. (b) Disparity map adopted from [11]. We run our algorithm on this disparity map, not on the original image. (c),(d) Extracted layers for two layer decomposition. (e),(f),(g) Extracted layers for three layer decomposition. Depending on the number of layers that user specifies, the algorithm can end up with two different decompositions.
6
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Conclusion
We have proposed an extension to the standard Markov Random Field (MRF) model suitable for layer representation. This new graphical model is a natural model of the image composition process of several objects. We also presented an efficient EM-based algorithm for inference. Our inference algorithm makes use of the fact that each MRF layer is natural substructure of the whole factorial MRF. Although each MRF layer is intractable in a strict sense, we developed an efficient inference based on the graph-cuts algorithm for binary segmentation. As the Table(3), Table(4) shows, our algorithm can decompose an image composed of several layers in a reasonable way, as long as user specifies appropriate number of layers, which makes automatic determination of number of layers a natural extension of this work.
Factorial Markov Random Fields
331
Acknowledgments. We wish to thank Yuri Boykov and Vladimir Kolmogorov for the use of their software, and Rick Szeliski for providing us with imagery. This research was supported by NSF grants IIS-9900115 and CCR-0113371, and by a grant from Microsoft Research. Table 4. Layer extraction result on Tsukuba dataset : (a) Left image, (b) Disparity map adopted from [11]. (c) First layer. (d) Second layer. (e) Third layer. Left part demonstrates an example where our algorithm gives random arbitrary result for completely occluded region without any priori knowledge. (f) Fourth layer. Notice a small area with unrecovered texture because of layer 1 occluding it. (g) Fifth layer. (h) Sixth layer. We don’t show here the background layer, which covers all over the image
7 7.1
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(f)
Appendix M Step
Opaque layer: We start by expanding Q: Q = −1/2
[ip C(fp )−1 ip − 2ip C(fp )−1 F (fp )
p
+ F (fp ) C(fp )−1 F (fp )] +
l
p,q∈εN
−2θp,q∈εN δ(fpl = fql ) − log Z
(7)
332
J. Kim and R. Zabih
From the model of opaque layers we have: P (ip | fp ) = N (ip ; F (fp ), C(fp )−1 ) F (fp ) = W l Mpl l
−1
C(fp )
=
C l−1 Mpl
l L
Mpl = fpl
(1 − fpl )
l =l+1
The average values are given by: C(fp )−1 =
C l−1 Mpl
l
C(fp )
−1
F (fp ) C(fp )
−1
F (fp ) =
C l−1 W l Mpl
l
F (fp ) =
tr{W l C l−1 W l diag{Mpl }}
l
By taking derivative of Q with respect to the parameters, we get the parameter estimation equations: l ∂Q p ip Mp l−1 l l−1 l l l new = C i M − C W M = 0 → W = p p p l ∂W l p p p Mp ∂Q = −1/2 ip ip Mpl + W l Mpl ip − 1/2 W l W l Mpl = 0 l−1 ∂C p p p new l l l → Cl = ip ip Mp − W Mp ip p
p,l
Transparent layer: Similar to the case of opaque layers, we expand Q: Q = −1/2 [ip C −1 ip − 2ip C −1 F (fp ) p
+ F (fp ) C −1 F (fp )] +
l
p,q∈εN
−2θp,q∈εN δ(fpl = fql ) − log Z
(8)
From the modelof transparent layers we have P (ip | fp ) = N (ip ; F (fp ), C −1 ), where F (fp ) = l W l fpl . The average values are given by: F (fp ) = W l fpl l
F (fp ) C −1 F (fp ) = ( W l fpl ) C −1 ( W l fpl ) = tr{W l C −1 W l fpl fpl }
l
l
l
l
Factorial Markov Random Fields
333
By taking derivative of Q with respect to the parameters, we get the parameter estimation equations: ∂Q = [ W l fpl fpl − ip fpl ] = 0 → W new = ( ip fp )( fp fp )+ l ∂W p p p l ∂Q C + = [ ip fpl W l − 1/2ip ip − 1/2 W l fpl fpl W l ] = 0 2 ∂C −1 p l l l new l l = ip ip − W fp ip →C p
p,l
References 1. S. Ayer and H. Sawhney. Layered representation of motion video using robust maximum-likelihood estimation of mixture models and MDL encoding. In Proceedings of the International Conference on Computer Vision, pages 777–784, 1995. 2. Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of mincut/max-flow algorithms for energy minimization in computer vision. In Proceedings of the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, volume 2134 of LNCS, pages 359–374, 2001. 3. M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. Technical Report 407, Vision and Modeling Group, MIT Media lab, November 1996. 4. N. M. Dempster,A.P. Laird and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39:185–197, 1977. 5. B.J. Frey. Filling in scenes by propagating probabilities through layers and into appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages I:185–192, 2000. 6. Zoubin Ghahramani and Michael I. Jordan. Factorial hidden markov models. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 472–478. The MIT Press, 1996. 7. D. Greig, B. Porteous, and A. Seheult. Exact maximum a posteriori estimation for binary images. In J. R. Statist. Soc. B, 51(2):271–279, 1989. 8. W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. 9. N. Jojic, N. Petrovic, B. Frey, and T. Huang. Transformed hidden markov models: Estimating mixture models of images and inferring spatial transformations in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2000. 10. Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational methods for graphical models. In M.I. Jordan (Ed.), Learning in Graphical Models, MIT Press, 1999. 11. Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with occlusions using graph cuts. In Proceedings of the International Conference on Computer Vision, pages 508–515, 2001. 12. S. Z. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag, Tokyo, 1995.
334
J. Kim and R. Zabih
13. V. Caselles M. Bertalmio, G. Sapiro and C. Ballester. Image inpainting. In SIGGRAPH, pages 417–424, 2000. 14. P.H.S. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian approach to layer extraction from image sequences. PAMI, 23(3):297–303, March 2001. 15. C. Vogler and D. Metaxas. Parallel Hidden Markov Models for American Sign Language recognition. In Proceedings of the International Conference on Computer Vision, pages 116–122, 1999. 16. J. Y. A. Wang and E. H. Adelson. Representing Moving Images with Layers. IEEE Transactions on Image Processing, 3(5):625–638, September 1994. 17. Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., pages 520–526, 1997.
Evaluation and Selection of Models for Motion Segmentation Kenichi Kanatani Department of Information Technology, Okayama University, Okayama 700-8530 Japan,
[email protected]
Abstract. We first present an improvement of the subspace separation for motion segmentation by newly introducing the affine space constraint. We point out that this improvement does not always fare well due to the effective noise it introduces. In order to judge which solution to adopt if different segmentations are obtained, we test two measures using real images: the standard F test, and the geometric model selection criteria.
1
Introduction
Segmenting individual objects from backgrounds is one of the most important of computer vision tasks. A significant clue is provided by motion: humans can easily discern independently moving objects by seeing their motions without knowing their identities. To solve this problem, Costeira and Kanade [2] presented a segmentation algorithm based on feature tracking. Since then, various extensions have been proposed: Gear [3] used the reduced row echelon form and graph matching, Ichimura [4] applied the discrimination criterion of Otsu [14] and the QR decomposition for feature selection [5], and Inoue and Urahama [6] introduced fuzzy clustering. Costeira and Kanade [2] attributed their algorithm to the Tomasi-Kanade factorization [17], but the underlying principle is a simple fact of linear algebra: the image motion of points moving rigidly in the scene belongs to a 4-dimensional subspace [3,8,10]. Directly exploiting this subspace constraint, Kanatani [8,10] introduced model selection to it and showed that his method, which he called subspace separation, far outperforms the Costeira-Kanade algorithm and Ichimura’s method [4]. His method has also been used for analyzing the effect of illumination on moving objects by Maki and Wiles [12] and Maki and Hattori [13]. In reality, the subspace constraint is a weak condition: the data points belong to a 3-dimensional affine space within that 4-dimensional subspace [3]. Using this stronger affine space constraint is expected to lead to more accurate segmentation, but all the Costeira-Kanade-type algorithms [2,3,4,5,6] are unable to exploit this constraint. This is because they rely on zero/nonzero discrimination of the elements of a matrix computed from the data. In contrast, the subspace separation [8,10] is based on the analysis in the original data space rather than particular matrix elements, so it can easily incorporate this constraint. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 335–349, 2002. c Springer-Verlag Berlin Heidelberg 2002
336
K. Kanatani
In this paper, we first present a new segmentation algorithm based on the affine space constraint, which we call affine space separation, and demonstrate by simulation that it indeed outperforms the subspace separation. We next show a rather surprising fact that the affine space separation does not always fare well in real situations. We assert that this is because for real data the degree of deviation from the affine space constraint is different from the degree of deviation from the subspace constraint; one may be larger or smaller than the other depending on situations. We clarify this relationship by analysis and simulation. Then, a question arises: if the subspace separation and the affine space separation produce different solutions for the same image sequence, which should we believe? To answer this question, one needs to evaluate the reliability of individual segmentations. Here, we test two measures using real images: the standard F test, and the geometric model selection criteria [9]. In Sec. 2, we summarize the subspace separation [8,10]. In Sec. 3, we describe our affine space separation and compare its performance with the subspace separation. In Sec. 4, we explain the performance difference in terms of the effective noise associated with the constraint. In Sec. 5, we present criteria for a posteriori evaluation of segmentation. In Sec. 6, we test these criteria using real images. In Sec. 7, we give our conclusion.
2
Subspace Separation
We first summarize the subspace separation [8,10], on which our new algorithm is built. 2.1
Subspace Constraint
We track N rigidly moving feature points over M images and let (xκα , yκα ) be the image coordinates of the αth point in the κth frame. If all the image coordinates are stacked vertically into a 2M -dimensional vector in the form (1) pα = x1α y1α x2α y2α · · · yM α , the image motion of the αth point is represented by a single point pα in a 2M -dimensional space R2M . The XY Z camera coordinate system is regarded as the world coordinate system with the Z-axis taken along the optical axis. Fix an arbitrary object coordinate system to the object, and let tκ and {iκ , j κ , kκ } be, respectively, its origin and orthonormal basis in the κth frame. Let (aα , bα , cα ) be the object coordinates of the αth point. Its position in the κth frame with respect to the world coordinate system is given by r κα = tκ + aα iκ + bα j κ + cα kκ . If orthographic projection is assumed, we have xκα ˜κ, = ˜tκ + aα˜iκ + bα j˜κ + cα k yκα
(2)
(3)
Evaluation and Selection of Models for Motion Segmentation
337
˜ κ are the 2-dimensional vectors obtained from tκ , iκ , j , where ˜tκ , ˜iκ , j˜κ , and k κ and kκ , respectively, by chopping off the third components. If the vectors ˜tκ , ˜ κ are stacked over the M frames vertically into 2M -dimensional ˜iκ , j˜ , and k κ vectors m0 , m1 , m2 , and m3 , respectively, the vector pα has the form pα = m0 + aα m1 + bα m2 + cα m3 .
(4)
This implies that the N points {pα } should belong to the 4-dimensional (linear ) subspace spanned by the vectors {m0 , m1 , m2 , m3 }. This subspace constraint holds for all affine camera models including weak perspective and paraperspective [15], because eq. (2) holds irrespective of the metric condition [15,17] that demands {iκ , j κ , kκ } be orthonormal. 2.2
Subspace Separation Theorem
It follows that the motions of the feature points are segmented into independently moving objects if the N points in Rn (n = 2M ) are grouped into distinct 4-dimensional subspaces. Inspired by the Tomasi-Kanade factorization [17], Costeira and Kanade [2] found that this can be done by zero/nonzero discrimination of the elements of a matrix computed from the data. Kanatani [8,10] pointed out that the underlying principle can be proved as a pure mathematical theorem independent of the Tomasi-Kanade factorization. We first reiterate this fact here. Let {pα } be N points that belong to an r-dimensional subspace L ⊂ Rn . Define an N × N matrix G = (Gαβ ) by Gαβ = (pα , pβ ),
(5)
where (a, b) denotes the inner product of vectors a and b. Let λ 1 ≥ · · · ≥ λN be the eigenvalues of G, and {v 1 , ..., v N } the orthonormal system of the corresponding eigenvectors. Define an N × N matrix Q = (Qαβ ) by Q=
r
vi v i .
(6)
i=1
Divide the index set I = {1, ..., N } into m disjoint subsets Ii , i = 1, ..., m, and let Li be the subspace defined by the ith set {pα }, α ∈ Ii . If the m subspaces Li , i = 1, ..., m, are linearly independent, the following holds: Theorem 1. The (αβ) element of Q = (Qαβ ) is zero if the αth and βth points belong to different subspaces: Qαβ = 0,
α ∈ Ii ,
β ∈ Ij ,
i =j.
(7)
This can be proved without reference to the Tomasi-Kanade factorization [17] as follows. For N (> n) vectors {pα }, there exist infinitely many sets of numbers N {c1 , ..., cN }, not all zero, such that α=1 cα pα = 0, but if the points {pα }
338
K. Kanatani
belong to two subspaces L1 and L2 such that L1 ⊕ L2 = Rn , the set of such precise) is generated by those “annihilating coefficients” {cα } (“null space” to be for which pα ∈L1 cα pα = 0 and those for which pα ∈L2 cα pα = 0. A formal proof is given in [8,10]. Maki and Wiles [12] and Maki and Hattori [13] found that in this abstract form this theorem allows one to analyze the effect of illumination on moving objects. In the presence of noise, however, all the elements of Q are nonzero in general. But if we progressively group points pα and pβ for which |Qαβ | is large and interchange the corresponding rows and columns of Q, we should end up with an approximately block-diagonal matrix. Costeira and Kanade [2] proposed this type of strategy, known as the greedy algorithm. Whatever realigning strategy is taken, however, all such algorithms [2,3,4,5,6] are severely vulnerable to noise, because segmentation is based on zero/nonzero discrimination of matrix elements, for which we do not know how small the nonzero elements might be in the absence of noise, yet small nonzero elements and large nonzero elements have the same meaning as long as they are nonzero. In the presence of noise, a small error in one datum can affect all the elements of the matrix in a complicated manner, so finding a suitable threshold is difficult even if the noise is known to be Gaussian with a known variance [3]. To avoid this difficulty, Kanatani [8,10] worked in the original data space rather than the matrix elements derived from it. Using synthetic and real images, he demonstrated that his method far outperforms the Costeira-Kanade algorithm [2] and Ichimura’s method [4].
3
Affine Space Separation
We now present a new segmentation algorithm based on a stronger constraint. 3.1
Affine Space Constraint
We have overlooked a crucial fact about eq. (4): the coefficient of m0 is identically 1 , implying that the N points {pα } should belong to a 3-dimensional affine space within the 4-dimensional subspace. This affine space constraint has long been known [3] but so far has not been utilized in any way. This is because no theorem corresponding to Theorem 1 is available for this constraint; as long as segmentation is based on zero/nonzero discrimination of matrix elements, there is no way to exploit this constraint. However, it can immediately be incorporated into the subspace separation [8,10], in which model selection and robust fitting are introduced in the original data space. We now replace the subspace constraint involved there by the affine space constraint; we call the resulting algorithm the affine space separation. Doing simulation, we will demonstrate that it indeed outperforms the subspace separation.
Evaluation and Selection of Models for Motion Segmentation
3.2
339
Affine Space Merging
Initially regarding individual points as groups consisting of one element, we successively merge two groups by fitting a 3-dimensional affine space to them. Let A i and A j be candidate affine spaces to merge, and let Ni and Nj be the respective numbers of points in them. Let JiA and JjA be the individual residuals, i.e., the sums of the squared A be distances of the data points to the fitted affine spaces A i and A j . Let Ji⊕j the residual that would result if a single affine space is fitted to the Ni + Nj A points. It is reasonable not to merge the two groups if Ji⊕j is much larger than A A A Ji + Jj . But how large should Ji⊕j be for this judgment? In fact, we always A have Ji⊕j ≥ JiA + JjA because a single affine space has fewer degrees of freedom to adjust than two affine spaces. It follows that we need to balance the increase in the residual against the decreases in the degree of freedom. For this purpose, Kanatani used his geometric AIC [7]. His method is translated into the affine space constraint as follows. We assume that the tracked feature points are disturbed from their true positions by independent Gaussian noise of mean zero and standard deviation , which we call the noise level . Since a 3-dimensional affine space in Rn has 4(n − 3) degrees of freedom1, the geometric AIC has the following form [7]: A = J + 2 3(N + N ) + 4(n − 3) 2 . (8) G-AICA i j i⊕j i⊕j If two 3-dimensional affine spaces are fitted to the Ni points and the Nj points separately, the degree of freedom is the sum of those for individual affine spaces. Hence, the geometric AIC is given as follows [7]: A A G-AICA = J + J + 2 3(N + N ) + 8(n − 3) 2 . (9) i j i j i,j A Merging A i and A j is reasonable if G-AICA i⊕j < G-AICi,j . However, this criterion can work only for Ni + Nj > 4. Also, all the affine spaces are included in subspaces of higher dimensions, so the matrix Q still provides information about the possibility of merging. Following Kanatani [8,10], we integrate these two criteria to define the following similarity measure between the affine spaces A i and A j : G-AICA i,j max |Qαβ |. (10) sij = A p ∈A ,p ∈A G-AICi⊕j α i β j
Two affine spaces with the largest similarity are merged successively until the number of affine spaces becomes a specified number m. If some of the resulting affine spaces contain less than four elements, they are taken as first candidates to be merged. For evaluating the geometric AIC, we need to estimate the noise level . This can be done using the knowledge that the points {pα } should belong to a 1
It is specified by four points in Rn , but they can move within that affine space into three directions. So, the degree of freedom is 4n − 4 × 3.
340
K. Kanatani
(a)
(b) Fig. 1. Image sequences of points in (a) 3-dimensional motion and (b) 2-dimensional motion.
(4m − 1)-dimensional affine space in Rn in the absence of noise. Let JtA be the residual for fitting a (4m − 1)-dimensional affine space to {pα }. Then, JtA /2 is subject to a χ2 distribution with (n − 4m + 1)(N − 4m) degrees of freedom [7], so we obtain the following unbiased estimator of 2 : ˆ2A = 3.3
JtA . (n − 4m + 1)(N − 4m)
(11)
Dimension Correction and Robust Fitting
We also incorporate two addtional techniques which proved very effective in the subspace separation [8,10]. The first is dimension correction: as soon as more than four elements are grouped together, we optimally fit a 3-dimensional affine space to them and replace the points with their projections onto the fitted affine space for computing the matrix Q. This effectively reduces the noise in the data if the local grouping is correct. The second technique is an a posteriori reallocation. Since a point that is misclassified in the course of merging never leaves that class, we attempt to remove outliers from the m resulting classes A1 , ..., Am by robust fitting. Points near the origin may be easily misclassified, so we select from each class A i half (but not less than four) of the elements that have large norms. We fit 3dimensional affine spaces A1 , ..., Am to them again and select from each class A i half (but not less than four) of the elements whose distances to the closest affine space A j , j =i, are large. We fit 3-dimensional affine spaces A1 , ..., Am to them again and allocate each data point to the closest one. Finally, we fit 3-dimensional affine spaces A 1 , ..., Am to the resulting point sets by LMedS [16]. Each data point is reallocated to the closest one.
Evaluation and Selection of Models for Motion Segmentation 40
40
35
35 30
30 1 2 3 4
25 20
25 20 15
15
1 2 3 4
10
10
5
5 0
341
0 0
0.2
0.4
0.6
(a)
ε
0.8
1
0
0.2
0.4
0.6
(b)
ε
0.8
1
Fig. 2. Misclassification ratio of segmentation vs. the noise level (pixels) for the 3-D motion of Fig. 1(a) and the 2-D motion of Fig. 1(b): (1) greedy algorithm, (2) Ichimura’s method, (3) subspace separation, and (4) affine space separation.
3.4
Experiments
Fig. 1(a) is a sequence of five simulated images (512 × 512 pixels) of 20 points in the background and 14 points in an object. They are independently moving in three dimensions. Fig. 2(b) is a sequence of five simulated images (512 × 512 pixels) of 20 points in the background and 5 points in an object. They are independently moving in two dimensions. We added Gaussian noise of mean 0 and standard deviation (pixels) to the coordinates of all the points and classified them into two groups. If the motion is 2-dimensional, the vector m3 in eq. (4) can be taken to be identically zero, so the N points {pα } should belong to a 2-dimensional affine space within the 3-dimensional subspace spanned by {m0 , m1 , m2 }. The algorithm described in the preceding sections can easily be tailored to this case (we omit the details). Fig. 2 plots the average ratio of misclassified points over 500 independent trials for different for the 3-dimensional motion (a) and the 2-dimensional motion (b): we compared (1) the greedy algorithm, (2) Ichimura’s method [4], (3) the subspace separation [8,10], and (4) our affine space separation. As we see, the subspace separation indeed outperforms both the greedy algorithm and Ichimura’s method, but the affine space separation further improves the accuracy. The improvement is particularly remarkable for the 3-dimensional motion; for the 2-dimensional motion the subspace separation already gives a nearly ideal solution. Thus, one is tempted to conclude that the affine space separation is better than the subspace separation. We now assert that this is not necessarily so.
4 4.1
Effective Noise Effective Noise Level
The performance gain of the affine space separation is brought about by using a stronger constraint. In general, the stronger the constraint, the better the inference based on it provided it strictly holds. However, any constraint is an
342
K. Kanatani
Fig. 3. The perspective effects of the third frame of Fig. 1(a) for the angles of view α = 0◦ (orthographic projection), 40◦ , and 80◦ .
idealization of the phenomenon, and slight deviations from it are regarded as “noise”, which we call effective noise to distinguish it from other sources of data inaccuracy (e.g., insufficient image resolution). Generally, the stronger the constraint, the less strictly it holds. In other words, the effective noise is usually larger for a stronger constraint than for a weaker constraint. Suppose we have correctly segmented N points into m groups by subspace S separation, and let J1S , ..., Jm be the residuals of fitting d-dimensional subspaces to individual groups, where d = 3 or 4 depending on whether the motion is 2dimensional or 3-dimensional. The effective noise level S is defined in such a way that random noise of that magnitude would produce these residuals. Using the same logic for deriving eq. (11), we obtain m S 2 i=1 Ji S = . (12) (n − d)(N − d) A are the residuals of fitting (d − 1)-dimensional affine Similarly, if J1A , ..., Jm spaces to individual groups, the effective noise level A is estimated from m A 2 i=1 Ji . (13) A = (n − d + 1)(N − d)
4.2
Perspective Effects
Since both the subspace constraint and the affine space constraint are based on the affine camera model of eq. (4), the effective noise is not zero for a perspective camera even when no image noise exists. Moreover, it should be different for the subspace constraint and for the affine space constraint, since their strengths are different. To confirm this, we added perspective effects to the image sequence of Fig. 1(a) by changing the angle of view α with the field of view fixed. Fig. 3 shows the changes of the third frame of Fig. 1(a) for α = 0◦ , 40◦ , and 80◦ (α = 0◦ corresponds to orthographic projection). Fig. 4(a) plots the effective noise level vs. the angle of view α for the affine space constraint (solid line) and the subspace constraint (dashed line). We can see that the presence of perspective effects is equivalent to much larger noise for the affine space constraint than the subspace constraint, which is relatively insensitive to the angle of view.
Evaluation and Selection of Models for Motion Segmentation
343
20
2.5
2
15
ε 1.5
10 1
5 0.5
0
0
10
20
30
40
(a)
50
60
α
70
80
90
0
0
0.2
0.4
0.6
(b)
ε
0.8
1
Fig. 4. (a) Effective noise level vs. the angle of view α (degrees) for the affine space constraint (solid line) and the subspace constraint (dashed line). (b) Misclassification ratio of segmentation vs. the noise level (pixels) for affine space separation (solid lines) and subspace separation (dashed lines). For each, the angles of view are α = 0◦ , 40◦ , and 80◦ from bottom to top.
Fig. 4(b) shows the average misclassification ratio of the affine space separation and the subspace separation when Gaussian noise of standard deviation is added to image coordinates for the angles of view α = 0◦ , 40◦ , and 80◦ . The result is quite different from Fig. 2(a), because in the presence of the perspective effects the noise level on the horizontal axis of Fig. 2(a) should be read to be the sum of the effective noise and the random noise. For the affine space constraint, the effective noise is already high even when random noise is small, so the misclassification ratio is very high as compared with the subspace constraint. As random noise increases, however, the misclassification ratio of the subspace separation grows more rapidly than the affine space separation, because the latter has a higher capability of canceling random noise. Thus, we are compelled to conclude that neither the affine space separation nor the subspace separation is universally superior; their performance depends on the balance between the perspective effects and the data accuracy.
5
A Posteriori Evaluation of Segmentation
If neither of the two methods is always better than the other, and if the two methods produce different solutions, which should we believe? We present two criteria for evaluating the reliability of individual segmentations. 5.1
F Test for Segmentation
Suppose the N points {pα } are segmented into m groups having Ni points, i = 1, ..., m. We first consider the subspace separation. Let JiS be the residual of fitting a d-dimensional subspace Li to the ith group. The subspace Li has codimension n − d and d(n − d) degrees of freedom. Hence, JiS /2 should be subject to a χ2 distribution with φS i = (n − d)Ni − d(n − d) = (n − d)(Ni − d)
degrees of freedom [7].
(14)
344
K. Kanatani
Li
J tS
JiS S
Ji - J tS
Lt
Fig. 5. The residual of subspace fitting.
If we let JtS be the residual of fitting an md-dimensional subspace Lt to the entire N points {pα }, JtS /2 should be subject to a χ2 distribution with φSt =
m
(n − md)Ni − md(n − md) = (n − md)(N − md)
(15)
i=1
degrees of freedom irrespective of the correctness of the segmentation. The residual JiS is the sum of squared distances of the points of the ith group to the subspace Li , which is the sum of their squared distances to the subspace Lt and the squared distances of their projections onto Lt to the subspace Li (Fig. 5). Let us call the former the external distances, and the latter the internal m distances. The sum of the squared internal distances for all the points is i=1 JiS − JtS ; the sum of the squared external distances is JtS . m If the segmentation is correct (the null hypothesis), ( i=1 JiS −JtS )/2 should also be subject to a χ2 distribution. The noise that contributes to the internal distances and the noise that contributes to the external distances are orthogonal m to each other, hence independent. So, ( i=1 JiS − JtS )/2 has m
φSi − φSt = (m − 1)d(N − md)
(16)
i=1
degrees of freedom [7]. It follows that m ( i=1 JiS − JtS )/(m − 1)d(N − md) FS = JtS /(n − md)(N − md)
(17)
should be subject to an F distribution with (m−1)d(N −md) and (n−md)(N − md) degrees of freedom. If this segmentation is not correct (the alternative hypothesis), the internal distances will increase on average while the external distances are not affected. It follows that if (m−1)d(N −md)
F(n−md)(N −md) (α) < F S ,
(18) (m−1)d(N −md)
this segmentation is rejected with significance level α, where F(n−md)(N −md) (α) is the upper αth percentile of the F distribution with (m − 1)d(N − md) and (n − md)(N − md) degrees of freedom.
Evaluation and Selection of Models for Motion Segmentation
345
We can do the same for the affine space separation. Let JiA be the residual of fitting a (d − 1)-dimensional affine space A i to the ith group. The affine space A i has codimension n − d + 1 and d(n − d + 1) degrees of freedom. Hence, JiA /2 should be subject to a χ2 distribution with φA i = (n − d + 1)Ni − d(n − d + 1) = (n − d+)(Ni − d)
(19)
degrees of freedom. If we let JtA be the residual of fitting an (md−1)-dimensional affine space At to the entire N points {pα }, JtA /2 should be subject to a χ2 distribution with φA t
=
m
(n − md + 1)Ni − md(n − md + 1) = (n − md + 1)(N − md)
(20)
i=1
degrees of freedom irrespective m of the correctness of the segmentation. If this segmentation is correct, ( i=1 JiA − JtA )/2 should also be subject to a χ2 distribution with m A φA (21) i − φt = (m − 1)d(N − md) i=1
degrees of freedom. It follows that m ( i=1 JiA − JtA )/(m − 1)d(N − md) A F = JtA /(n − md + 1)(N − md)
(22)
should be subject to an F distribution with (m − 1)d(N − md) and (n − md + 1)(N − md) degrees of freedom. Hence, if (m−1)d(N −md)
F(n−md)(N −md) (α) < F A ,
(23)
this segmentation is rejected with significance level α. 5.2
Model Selection for Segmentation
The result of an F test depends on the significance level, which we can arbitrarily set. In contrast, model selection can dispense with any thresholds. Here, we apply the geometric AIC and the geometric MDL [9] (see [18] for other types of model selection criterion). We first consider the subspace separation. The geometric AIC and the geometric MDL for the model that the segmentation is correct are, respectively, m m S S S S G-AIC and i i=1 i=1 G-MDLi , where G-AICi and G-MDLi are given as follows: G-AICSi = JiS + 2 dNi + d(n − d) 2 , 2 G-MDLSi = JiS − dNi + d(n − d) 2 log . (24) L Here, L is a reference length, for which we can use an arbitrary value whose order is approximately the same as the data, say the image size; the model selection result is not affected as long as it has the same order of magnitude [9].
346
K. Kanatani
The geometric AIC and the geometric MDL for the general model that the points {pα } can be somehow segmented into m motions are given as follows: G-AICSt = JtS + 2 mdNi + md(n − md) 2 , 2 G-MDLSt = JtS − mdNi + md(n − md) 2 log . (25) L Since the latter model is correct irrespective of the segmentation result, we can estimate the noise level from it, using ˆ2 =
JtS . (n − md)(N − md)
(26)
m The condition that the segmentation is not correct is G-AICSt < i=1 G-AICSi m or G-MDLSt < i=1 G-MDLSi , which are rewritten, respectively, as 2 − log < F S, (27) 2 < F S, L where F S is the F statistic given by eq. (17). Thus, model selection reduces to the F test, the difference being that the threshold is given without specifying any significance level. When the noise is small, − log(/L)2 is usually larger than 2, so the geometric AIC is more conservative than the geometric MDL, which is more confident of the particular result. Next, we consider the affine space separation. The geometric AIC and the geometric MDL of the ith group are A G-AICA = J + 2 (d − 1)N + d(n − d + 1) 2 , i i i 2 A 2 G-MDLA = J − (d − 1)N + d(n − d + 1) log . (28) i i i L The geometric AIC and the geometric MDL for the general model are A 2 G-AICA t = Jt + 2 (md − 1)Ni + md(n − md + 1) , 2 A 2 G-MDLA . t = Jt − (md − 1)Ni + md(n − md + 1) log L
(29)
The noise level is estimated from the general model, using ˆ2 =
JtA . (n − md + 1)(N − md)
The condition that the segmentation is not correct is G-AICA t < m A or G-MDLA < G-MDL , which reduces to t i i=1 2 − log < F A, 2 < F A, L where F A is the F statistic given by eq. (22).
(30) m
i=1
G-AICA i (31)
Evaluation and Selection of Models for Motion Segmentation
347
Fig. 6. Real images of moving objects (above) and the selected feature points (below).
6
Real Image Experiments
Fig. 6 shows a sequence of perspectively projected images (above) and feature points manually selected from them (below). Three objects are fixed in the scene, so they are moving rigidly with the scene, while one object is moving relative to the scene. The image size is 512 × 768 pixels. We applied the greedy algorithm and Ichimura’s method [4] and found that the resulting segmentations contained some errors. However, the subspace separation and the affine space separation both resulted in the correct segmentation. We then evaluated the reliability of this segmentation and observed the following: 1. The effective noise levels computed from eqs. (12) and (13) give S = 0.64 (pixels) and A = 0.94 (pixels), confirming our prediction that the affine space separation has larger effective noise than the subspace separation. 2. The F statistics of eqs. (17) and (22) are F S = 0.893 and F A = 1.483. The corresponding upper 5th percentiles are 1.346 and 1.293, respectively, so with significance level 5% the subspace separation is not rejected but the affine space separation is rejected. 3. From eqs. (27) and eqs. (31), the geometric AIC selects both segmentations as correct. The subspace separation passes the test with a larger margin than the affine subspace separation. 4. Letting L = 600, we have − log(ˆ /L) = 13.5, 13.1 for the subspace separation and the affine space separation, respectively, so the geometric MDL again selects both segmentations as correct. The geometric MDL accepts the segmentation result much more leniently than the geometric AIC. All these tests indicate that for this image sequence the subspace separation is more reliable than the affine space separation. In other words, although both segmentations happen to be correct for the current occurrence of noise, the subspace separation has a larger chance of being correct than the affine space separation if noise occurred differently. To test this prediction, we did a bootstrap experiment [1]: we added random Gaussian noise of mean 0 and standard deviation 1 (pixel) to the coordinates of the detected feature points in Fig. 6 and did segmentation 500 times, each time using different noise. We found that the subspace separation was correct 100% while the affine space separation was correct only for 90.9% of the trials. We then increased the standard deviation to 2 (pixels) and found that the subspace
348
K. Kanatani
separation was still correct 100% while the affine space separation was correct only 85.3%. Thus, we have confirmed that if the two methods produced different segmentations we would have to believe the subspace segmentation for this image sequence, for which the perspective effect is relatively strong while the feature detection is relatively accurate (since we did manual selection).
7
Concluding Remarks
In this paper, we first presented an improvement, in theory, of the subspace separation [8,10] for motion segmentation by incorporating the affine space constraint and demonstrated by simulation that it outperforms the subspace separation. Next, we pointed out that this gain in performance is due to the stronger constraint based on the affine camera model and that this introduces effective noise that accounts for the deviation from the model. We have confirmed this by simulation. Then, we presented two criteria for evaluating the reliability of a given segmentation: the standard F test, and the geometric model selection criteria [9]. Using real images, we have demonstrated that these criteria enable us to judge which solution to adopt if the two methods result in different segmentations. We have observed that the affine space separation is more accurate than the subspace separation when the perspective effects are weak and the noise in the data is large, while the subspace separation is more suited for images with strong perspective effects with small noise. In practice, we should apply both methods, and if the results are different, we should select the more reliable one indicated by our proposed criteria. In this paper, we have assumed that the number of independent motions is known. Model selection can also provide a powerful tool for estimating it (see [10,11] for some results). Acknowledgements. The author thanks Noriyoshi Kurosawa of FSAS Network Solutions Inc., Japan, and Chikara Matsunaga of FOR-A, Co. Ltd, Japan, for doing numerical simulations and real image experiments. He also thanks Atsuto Maki of Toshiba Co., Japan, and Naoyuki Ichimura of AIST, Japan, for helpful comments. This work was supported in part by the Ministry of Education, Culture, Sports, Science and Technology, Japan, under a Grant in Aid for Scientific Research C(2) (No. 13680432), the Support Center for Advanced Telecommunications Technology Research, and Kayamori Foundation of Informational Science Advancement.
References 1. B. Efron and R. J. Tibshirani, An Introduction to Bootstrap, Chapman-Hall, New York, 1993. 2. J. P. Costeira and T. Kanade, A multibody factorization method for independently moving objects, Int. J. Comput. Vision, 29-3 (1998), 159–179.
Evaluation and Selection of Models for Motion Segmentation
349
3. C. W. Gear, Multibody grouping from motion images, Int. J. Comput. Vision, 29-2 (1998), 133–150. 4. N. Ichimura, Motion segmentation based on factorization method and discriminant criterion, Proc. 7th Int. Conf. Comput. Vision, September 1999, Kerkyra, Greece, pp. 600–605. 5. N. Ichimura, Motion segmentation using feature selection and subspace method based on shape space, Proc. 15th Int. Conf. Pattern. Recog., September 2000, Barcelona, Spain, Vol.3, pp. 858–864 6. K. Inoue and K. Urahama, Separation of multiple objects in motion images by clustering, Proc. 8th Int. Conf. Comput. Vision, July 2001, Vancouver, Canada, Vol. 1, pp. 219–224. 7. K. Kanatani, Statistical Optimization for Geometric Computation: Theory and Practice, Elsevier Science, Amsterdam, The Netherlands, 1996. 8. K. Kanatani, Motion segmentation by subspace separation and model selection, Proc. 8th Int. Conf. Comput. Vision, July 2001, Vancouver, Canada, Vol. 2, pp. 301–306. 9. K. Kanatani, Model selection for geometric inference, plenary talk, Proc. the 5th Asian Conf. Comput. Vision, January 2002, Melbourne, Australia, Vol. 1, pp. xxi–xxxii. 10. K. Kanatani, Motion segmentation by subspace separation: Model selection and reliability evaluation, Int. J. Image Gr., Vol. 2, (2002), to appear. 11. K. Kanatani and C. Matsunaga, Estimating the number of independent motions for multibody segmentation, Proc. the 5th Asian Conf. Comput. Vision, January 2002, Melbourne, Australia, Vol. 1, pp. 7–12. 12. A. Maki and C. Wiles, Geotensity constraint for 3D surface reconstruction under multiple light sources, Proc. 6th Euro. Conf. Comput. Vision, June–July 2000, Dublin, Ireland, Vol. 1, pp. 725–741. 13. A. Maki and H. Hattori, Illumination subspace for multibody motion segmentation, Proc. IEEE Conf. Comput. Vision Pattern Recog., December 2001, Kauai, Hawaii, Vol. 2, pp. 11–17. 14. N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Sys. Man Cyber., 9-1 (1979), 62–66. 15. C. J. Poelman and T. Kanade, A paraperspective factorization method for shape and motion recovery, IEEE Trans. Pat. Anal. Mach. Intell., 19-3 (1997), 206–218. 16. P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. 17. C. Tomasi and T. Kanade, Shape and motion from image streams under orthography—A factorization method, Int. J. Comput. Vision, 9-2 (1992), 137– 154. 18. P. H. S. Torr, A. Zisserman and S. J. Maybank, Robust detection of degenerate configurations while estimating the fundamental matrix, Comput. Vision Image Understand., 71-3 (1998), 312–333.
Surface Extraction from Volumetric Images Using Deformable Meshes: A Comparative Study Jussi Tohka DMI / Institute of Signal Processing Tampere University of Technology P.O. Box 553, FIN-33101,Finland
[email protected]
Abstract. Deformable models are by their formulation able to solve surface extraction problem from noisy volumetric images. This is since they use image independent information, in form of internal energy or internal forces, in addition to image data to achieve the goal. However, it is not a simple task to deform initially given surface meshes to a good representation of the target surface in the presence of noise. Several methods to do this have been proposed and in this study a few recent ones are compared. Basically, we supply an image and an arbitrary but reasonable initialization and examine how well the target surface is captured with different methods for controlling the deformation of the mesh. Experiments with synthetic images as well as medical images are performed and results are reported and discussed. With synthetic images, the quality of results is measured also quantitatively. No optimal method was found, but the properties of different methods in distinct situations were highlighted.
1
Introduction
Deformable models are by their formulation able to solve ill-posed problems in image processing and analysis. This is due additional image independent soft and hard constraints that are set. For the surface extraction problem, these state that the resulting surface should be smooth in the basic setting. The ability to deal with noisy and poor contrast images makes three dimensional deformable surface models particularly interesting for automated medical image analysis: Obviously, the use of model-based techniques is necessary for analyzing medical images because of imaging artifacts and noise. On the other hand, large variability of biological shapes restricts the use of shape models with only few parameters. Consequently, deformable surface meshes - which can be used to extract surfaces of highly variable shapes - are popular tools in automated medical image analysis. For deformable surface models in general and their applications to medical images, see surveys [8,9]. Deformable surface meshes are polygon meshes whose geometry evolves driven by image data and shape regularity constraints starting from given initial A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 350–364, 2002. c Springer-Verlag Berlin Heidelberg 2002
Surface Extraction from Volumetric Images
351
meshes. There are two basic approaches controlling the evolution of the mesh. The force based approach assigns an equation of motion to each vertex of the mesh (or mexel) and lets it move in a force field formed by internal and external forces. Internal forces derive from the geometry of the mesh itself. They guarantee the smoothness of the final mesh. External forces are derived from the image data and draw deformable mesh towards salient image features. On the other hand, the energy based approach assigns an energy to each mesh belonging to the set of admissible meshes. The energy is minimized and its minimum argument is the resulting mesh. Again, the energy is divided in external and internal energies. The internal energy is derived from geometry of the mesh and the external energy is derived from image data. If the initializations for the processes are not in the close vicinity of the target, advanced solutions for both approaches are required. Especially, when images are noisy, the energy function is likely to have multiple local minima and as well equations of motion may stabilize before the correct surface has been found. Methods to minimize the energy globally have been proposed as well as techniques for constructing external forces so that they are not sensitive to image noise nor positioning of the initial surface. In this paper, a few of recent methods for minimizing energy and techniques for advanced construction of external forces are compared experimentally. We will fix the internal energy/force model and image data and see how the different methods cope given a generic (i.e automatic) but reasonable initialization. The chosen deformable meshing techniques are compared with synthetic images in order to obtain quantitative results and with medical images to link findings into real world situations. To our knowledge, this kind of empirical study does not exist in the context of deformable surface meshes. With 2-D deformable models some quantitative comparisons have been reported, at least in [3,15]. There are approaches to the problem which are relevant to the study but are intensionally left out from it. For example, algorithms based on multi-resolution image representations (see e.g. [7]) are not considered in order to keep the size of the study reasonable. The three-dimensional generalization of the snakesalgorithm [6] is not included, since its limitations are well known. Furthermore, we do not consider algorithms that automatically adapt the topology of the mesh and hence the topology of the target surface as well as the resolution of the final mesh are to be fixed in advance.
2
Deformable Surface Mesh
In this study, simplex meshes [1] are applied for surface representation. In a simplex mesh each vertex of the mesh, or mexel, has exactly three neighbors. In another words their graphs are 3-regular. Triangular meshes, which are topological duals of simplex meshes, are also often applied in deformable meshing, see e.g. [7]. Meshes are denoted by M = (V, G), where V = {v1 , . . . , vN } ⊂ R3 is set of mexels and G is the graph of the mesh. Three mexels adjacent to vi are denoted by vij , j = 1, 2, 3. The unit normal to the mesh at vi is denoted by ni .
352
J. Tohka
The input for a deformable mesh, image I, is a preprocessed version of an image to be analyzed . In I voxels are given an intensity value based on their saliency, which depends on the application. For example, I can be an edgedetected version of the original image if the interest is in finding the boundary of a volumetric target. The image I is normalized to have intensity values from 0 to 1, the voxel of the highest saliency receiving the intensity value of 1.
3
Force Based Approach
The evolution of the mesh can be controlled by assigning an equation of motion to each mexel. In the general case the equation of motion is Newtonian: m¨ vi (t) + γ v˙ i (t) = αFint (vi (t)) + βFext (vi (t)),
(1)
where α, β, γ, m ∈ R and t is time. In discrete form this is written as (assuming m = 1) vit+1 = vit + (1 − γ)(vit − vit−1 ) + αFint (vit ) + βFext (vit ),
(2)
with the mesh when time t = 0 given. When γ = 1, (2) reduces to a Lagrangian equation of motion. The internal force 3
Fint (vi ) =
1 v i − vi . 3 j=1 j
(3)
The internal force (3) causes surface to shrink to a point if no external forces are present. This is sometimes useful, because it reduces sensitivity to the initialization provided that initial meshes are set outside the target. Semi-implicit evolution equations, such as in snakes [6], are not popular with 3-D meshes. This is since in the 3-D case this involves multiplication of large matrices, which do not have a banded structure as in the 2-D case. 3.1
Delingette’s External Force
A local external force model was introduced by Delingette [1]. These forces rely on a good initialization or, as in our case, shrinking effect caused by internal forces to produce acceptable results. The external force is Fext (vi ) = Fgrad (vi ) + Fedge (vi ).
(4)
The gradient force Fgrad (vi ) = βgrad ((mi − vi ) · ni )ni , where mi is the centre of the voxel of greatest intensity in s×s×s neighbourhood of the voxel containing vi . We set s = 5. The edge force Fedge (vi ) = βedge (ei − vi ),
Surface Extraction from Volumetric Images
353
where ei is the center of the closest edge voxel to vi along the normal ni . For computing Fedge a binary edge-image, where each voxel is classified either edge or non-edge, has to be created. In this study, the optimal thresholding method is used for automatically binarizing I as described in [2]. After the thresholding, isolated edge voxels are removed in order to reduce sensitivity to noise. Delingette also suggests the use of a two stage technique for recovering the target surface. At the first stage, a coarse mesh with a large α is used for getting close to the target surface. Thereafter, a more fine mesh with smaller α is used for capturing details of it. We also adopted the procedure after few failed attempts to use a single resolution mesh. 3.2
Gradient Vector Flow External Forces
In [16,15] methods for deriving global external forces from image data were proposed. With gradient vector flow (GVF) [16] the external force is defined as the vector field Fext = f = (f1 , f2 , f3 ) : R3 → R3 minimizing the expression R3
µ
3
||∇fi (x)||2 + ||∇I(x)||2 ||f (x) − ∇I(x)||2 dx,
(5)
i=1
where µ is a real scalar. A solution can be found by introducing time variable t and finding a steady state solution of f (x; t) = µ∇2 f (x; t) − (f (x, t) − ∇I(x))||∇I(x)||2 , f (x; 0) = ∇I(x), dt
(6)
where the Laplace-operator ∇2 applies only on the spatial component (i.e x) of f . Generalized gradient vector flow (GGVF) [15] is defined as the equilibrium solution of the following partial differential equation (PDE): f (x; t) = g(||∇I(x)||)∇2 f (x; t) − h(||∇I(x)||)(f (x, t) − ∇I(x)), dt f (x, 0) = ∇I(x),
(7) (8)
x 2 −( κ )
where g(x) = e , κ ∈ R and h(x) = 1 − g(x). For finding numerical solutions to the above PDEs, guidelines given in [16,15] are followed. After numerical solutions are found, external forces are normalized to have maximal magnitude of 1. From (discrete) numerical solutions, continuous external force fields are obtained by trilinear interpolation.
4
Energy Based Approach
With the energy based approach one minimizes the energy of the deformable mesh: N 1 E(M) = (λEint (vi ) + (1 − λ)Eext (vi )), (9) N i=1 where λ ∈ [0, 1].
354
J. Tohka
The internal energy of vi is defined as Eint (vi ) =
||vi −
1 3
3
j=1
A(M)
vij ||2
,
(10)
where A(M) is a measure of area of M. It is computed as the average area of faces of the mesh. Areas of faces of simplex meshes are computed by triangulating them. Note that the internal energy is invariant to scalings, translations and rotations of the mesh. A comparison of (3) and (10) reveals that (3) is the negative of the gradient of the numerator of Eint (vi ) for each i. However, similar result does not hold if we consider the numerator of the internal energy of M as whole because also Eint (vij ), j = 1, 2, 3, depend on vi . The external energy is defined as (11) Eext (vi ) = 1 − I(vi ). 4.1
Dual Surface Minimization
Dual surface minimization (DSM) algorithm was introduced in [12] and refined in [10] for minimization of the energy of deformable surface meshes. It is a generalization of an energy minimization method for active contours [4]. The initialization for the process consists of two surface meshes differing only in size. One of these is placed inside of the sought surface and the other is placed outside of it. Thereafter, the energy of the mesh with higher energy is minimized locally by the greedy algorithm [13] with the constraint that outer mesh must shrink and the inner mesh must grow in size. If the mesh of higher energy is stuck on a local minima, its energy function is modified by adding a penalty for the current position of the mesh [10]. The penalty is increased gradually until the mesh is forced to move from its current position. The whole process continues iteratively until areas of the inner mesh and the outer mesh are approximately equal. It is necessary to equip meshes with a global position parameter (or moving reference point) if the earlier implementation is used [12]. We tried if the usage of the global position parameter can improve results with the refined implementation [10]. It was found that using meshes equipped with the global position parameter was not worthwhile in our experiments because that did not improve the quality of results and is computationally expensive. 4.2
Minimization Method Based on Genetic Algorithms
A hybrid of real coded genetic algorithms [5] (RCGA) and greedy algorithm [13] for mesh optimization was used in [11]. The algorithm is abbreviated here as GAGR and it is as follows 1. 2. 3. 4.
Global minimization of the energy of the mesh by a RCGA; Greedy minimization of the energy of the mesh obtained from the first step; Adaptation of the resolution of the mesh; Minimization of the energy of the adapted mesh for the final solution by the greedy algorithm;
Surface Extraction from Volumetric Images
355
The RCGA uses BLX-0.3 crossover [5], tournament selection and an initial population just like in [11]. Our implementation differs from the original in that the crossover operator is constrained to produce meshes within the image domain. In the original implementation such reasonable constraint was not used and penalty function was applied to solve the problem. Only 100 generations of RCGA are used in the first step for efficiency as in [11].
sphere
metasphere A
metasphere B
Fig. 1. Surfaces used in the first experiment
5 5.1
Experiments Material
Three types of tests were performed for the evaluation. The first experiment was a standard one. First metaspheres [14], see Fig. 1, were drawn to 64 × 64 × 64 grid and then white Gaussian noise was added to the images. Thereafter, images were filtered with a Gaussian filter with the 3 × 3 × 3 kernel and the standard deviation 1.0. The intensity of the true surfaces was 1. The images for the second kind of tests consisted a true surface of intensity of one and few false surfaces of the intensity of 0.5. Some cross sections of images are shown in Fig. 2. In the third experiment, boundaries of brain from two FDG (fluorodeoxyglucose) PET (positron emission tomography) images were extracted. An example of a PET image is shown in Fig. 3. Salient image features in this case mean edges which were found by applying a three-dimensional Sobel operator [17]. An edge (magnitude) map from the PET image in Fig. 3 is shown also in Fig. 3. 5.2
Error Measures
With synthetic images, it is possible to evaluate extraction errors also quantitatively. Here two criteria were used: Denote the set of voxels belonging to true
356
J. Tohka
Image 1
Image 2
Image 3
Fig. 2. Central cross sections in xy planes of images for the second experiment.
(digital) surface by T S and the set of voxels belonging to the extracted surface by ES. Furthermore, let the set of voxels inside the true surface be T V and the set of voxels inside the extracted digital surface be EV . Then, 1 = 1 −
|T V ∩ EV | , |T V ∪ EV |
2 = max min dist(X, Y ), X∈T S Y ∈ES
(12) (13)
where distance between voxels X and Y (dist(X, Y )) is the Euclidean distance between their integer coordinates. The first error measure captures well the goodness of overall result. (Note that the quantity 1−1 is known as the Tanimoto coefficient.) The second one reacts also to very local reconstruction errors. This makes it useful in determining if high curvature parts of a surface are extracted well. However, the second measure is not necessarily good for determining how well the surface extraction has succeeded in a global sense. 5.3
Experiment Settings and Implementation
Resolution of Meshes. Meshes of 1280 mexels were used with DSM, GVF and GGVF. With GAGR, at the first stage, meshes of 320 mexels were used. Then the mesh resulting from the resolution adaptation step (Step 3 in the algorithm) was constrained to have 1280 mexels. Same applies for the Delingette’s method. Both algorithms adapt the mesh resolution by repeated application of T22 operator defined in [1]. Initializations. With synthetic images, every algorithm except GAGR, was tested with five different initializations. The initializations for the force based methods were spheres all completely outside the true surface. Also initializations inside the true surface were tested with GVF and GGVF but results were not good unless initialization was very close to the target. This is due to shrinking effect of the internal force. For DSM, outer initial surfaces were same than the ones for the force based methods and inner initial surfaces were obtained by scaling from these. With GAGR, an initial population is random, and only one population was used for each image. Similar settings apply for the experiments with PET images, except three ellipsoid initializations per image were tested.
Surface Extraction from Volumetric Images
357
Fig. 3. PET images: Top row (left to right): transaxial, sagittal and coronal views of a FDG-PET image used in the experiments. Bottom row: 3-D Sobel edges from the image in the top row, left to right: Transaxial, sagittal and coronal views.
Parameter Values. The selection of parameter values is often a fundamental issue in implementing deformable models. Here values of as many parameters as possible were fixed before any experiments by using suggestions of the authors of the algorithms. Basically, only values of λ and α were tuned during the experiments. For each method, we tried to find such λ or α that would be suitable for every experiment. Sometimes this was impossible and a change in α or λ between experiments improved results significantly. However, within an experiment all parameter values were fixed since a method is not really automatic if it requires different values of parameters for each input in order to succeed. The Lagrangian motion was used for force-based techniques in all tests. Also, the Newtonian motion with γ = 0.65 was studied, but results were generally worse than with the Lagrangian motion. However, the use of the Newtonian motion can offer a speed-up if a large α is applied. Results with two versions of GVF external forces are reported, one with µ = 0.05 and the other with µ = 0.1, because they were clearly different. Also other values of µ (0.2 and 0.4) were tested but they did not seem to work. With GGVF, κ = 0.05 suggested in [15] worked well so there was no reason to play with it. Generally, the value of α was 0.1 for GVF and 0.3 for GGVF. With all force based models, β was set so that the maximal magnitude of external forces was 1. With Delingette’s method α in the first stage was generally 0.6 and in the second stage 0.1. The values of other parameters were: βedge = 0.0625, βgrad = 0.15, β = 1. Closest edges to a voxel where looked only within radius of 8 from the voxel. Sometimes, especially with PET images, the binarization of the input image failed completely. Then no edge external force was used and βgrad was set
358
J. Tohka
to 0.3. For the energy based approach, only the value of λ was adjusted. It was usually 0.3. For DSM the rest of the parameters (relating mainly to the search spaces) were selected as advised in [10,12]. Complexity. Time consumptions of the algorithms are not given because no attention was paid on how to implement algorithms efficiently, and so listing time consumptions would be misleading. On the average, the method of Delingette and DSM are the fastest algorithms and GAGR is the slowest one. The complexity of GVF and GGVF methods depends heavily on the size of the underlying image in contrast to the other methods. This makes it difficult to evaluate the speed of GVF and GGVF methods as compared to others. On the average, the usage of GVF or GGVF external forces leads to solutions slower than DSM or Delingette’s external forces but faster than GAGR.
6 6.1
Results Results with Synthetic Images
Quantitative results of the first experiments are given in Table 1. The values listed in Table 1 are medians of the error criteria values resulting from different initializations. Some examples of reconstructed surfaces are in Fig. 4. Results were not surprising. All algorithms performed fairly well with images with a sphere in it. In this case, increased noise level most clearly degraded results obtained by DSM. On average, GGVF and GAGR were best methods. With more complex surfaces, DSM performed nearly as well as GGVF and GAGR, which were again best methods. However, there was a larger deviation of error criteria values resulting from different initializations with DSM than with the other methods. The performance of both GVF based algorithms suffered heavily from noise with Metaspheres in the image. As can be seen from Fig. 4, no really good surface reconstruction of Metasphere A was obtained by using any of the algorithms when the noise variance was 0.6. However, the results obtained by GAGR, DSM, GGVF and external force of Delingette were acceptable. This was not the case with GVF external forces. Deligette’s external force led to the surface shrunk to a point from some of the initializations with noiseless images although it produced quite good results with noisy images. This can avoided by carefully tuning the parameter α, but that might require a-priori knowledge about the signal to noise ratio of the image. The error criteria values from the second experiment can be found from Table 2. Again, the values are medians of the 2 values resulting from different initializations. The findings of the experiment were interesting. Meshes with GGVF external forces stuck with outermost false surface. Changing values of κ and α did not help. GVF with µ = 0.1 produced best results in this experiment. The results with GVF with µ = 0.05 and α = 0.1 were worse as can be seen from Table 2. However, GVF with µ = 0.05 and α = 0.3 gave results similar to those with GVF and µ = 0.1.
Surface Extraction from Volumetric Images
359
The method of Delingette was also quite good in this experiment. Surprisingly, although the binarization algorithm labeled all surfaces, true or false, as edges, results with the edge external force were better than those with no edge external force and βgrad = 0.3. DSM was successful with Image 1 and in some extent with Image 3. The extraction of true surface from Image 2 failed. The reason for failure was that the algorithm failed to converge because of the oscillation. This problem was also noted in [10] and an effort were made to prevent oscillations in this study. GAGR optimization produced results almost as good as those obtained with GVF and µ = 0.1. Only with Image 3, surface extraction has failed a little bit as 2 value in Table 2 reveals.
Table 1. Error criteria values from the first set of tests. Medians of values from different initializations are shown. Symbol σ 2 stands for variance of Gaussian noise. GVF 0.05 means GVF with µ = 0.05, and similarly GVF 0.1 means GVF with µ = 0.1. Parameter values were as listed in Sect. 5.3. measure
surface
σ2
sphere
1
0 0.2 0.6 1 1.4 metasphere A 0 0.2 0.6 1.0 metasphere B 0 0.2 0.6 1.0 sphere
2
0 0.2 0.6 1 1.4 metasphere A 0 0.2 0.6 1.0 metasphere B 0 0.2 0.6 1.0
GVF 0.05 GVF 0.1 0.03 0.03 0.09 0.11 0.09 0.12 0.13 0.17 0.13 0.17 0.03 0.03 0.11 0.16 0.16 0.24 0.20 0.26 0.07 0.07 0.30 0.37 0.27 0.38 0.27 0.36 1.4 1.7 2.0 2.8 2.7 1.4 4.0 4.6 5.4 3 9.3 8.6 6.5
1.4 2.2 2.2 3.0 3.0 1.4 5.1 7.0 6.3 2.8 10 10 10
Method GGVF Delingette 0.03 0.08 0.05 0.07 0.06 0.08 0.06 0.08 0.07 0.13 0.02 0.09 0.05 0.09 0.10 0.11 0.14 0.12 0.06 1 0.13 0.19 0.19 0.21 0.24 0.26 1.4 1.4 2.0 1.7 1.4 1.4 3.0 5.8 5.7 2.2 6.1 7.3 10
1.4 1.7 2.4 3.0 3.6 4.9 2.4 3.3 5.4 6.4 8.5 9.1
DSM 0.03 0.06 0.14 0.17 0.21 0.03 0.02 0.10 0.14 0.07 0.11 0.18 0.32
GAGR 0.05 0.06 0.06 0.07 0.07 0.06 0.07 0.08 0.11 0.10 0.16 0.20 0.22
2.2 2.0 2.4 4.5 4.2 2.2 2.0 3.2 6.1 3.2 3.0 6.7 10.3
2.2 1.4 2.2 1.7 2.2 1.7 2.2 3.0 4.4 3.5 8.5 7.3 7.3
360
J. Tohka
Table 2. Error values from the second set of tests. Listed values are medians of the error criteria obtained from different initializations. Parameter values were as listed in Sect. 5.3. measure image no. 1 2
6.2
1 2 3 1 2 3
GVF 0.05 GVF 0.1 0.16 0.03 0.06 0.05 0.07 0.02 3.6 1.4 2.0 2.2 5.7 1.4
Method GGVF Delingette 0.50 0.07 0.51 0.07 0.42 0.09 4.4 3.0 10 1.7 7.3 3.2
DSM 0.04 0.34 0.12 1.7 9.6 4.6
GAGR 0.05 0.05 0.10 2.0 2.2 7.7
Results with Medical Images
For PET images there were two methods above others. GGVF and DSM produced excellent results with all initializations and both images. However, with GGVF the success was achieved only after changing value of α to 0.1. GGVF with α = 0.3 worked well only with one of the images. DSM was good with both images and pre-defined set of parameters. GGVF results were slightly more accurate than the ones obtained with DSM. See Fig. 5 for examples. With GAGR we had some success in this experiment without tuning parameters, however resulting surfaces were clearly less accurate (see Fig. 5) than with the methods mentioned above. Using λ = 0.4 improved the resulting surfaces, but not much. Delingette’s technique produced some moderate results after tuning of value of α in the both stages of the two stage algorithm. But further adjustment of parameter values would still be required in order to have acceptable results. Also best parameter values varied from image to image and from initialization to initialization. See Fig. 6. It was necessary to use the gradient external force only, since the thresholding algorithm did not perform well. However, thresholding original PET-image and then finding edges is probably better alternative than simply thresholding the gradient-image (as we did). GVF external forces had the worst performance of the methods in this experiment, even adjusting value of α would not help much. See Fig. 5 for the best result obtained by GVF.
7
Discussion
We have compared experimentally few recent algorithms based on deformable meshes for the surface extraction from noisy volumetric images. As one could have expected, no universally best method for controlling the deformation of the mesh was found. It depends on the application which algorithm is the best. In general, GAGR seems to be the most reliable choice: It gave good results in synthetic image experiments and errors in the PET image experiment were not large. However, if it is known in advance that the target surface is the
Surface Extraction from Volumetric Images
GVF, µ = 0.05
GVF µ = 0.1
GGVF
Delingette
DSM
GAGR
361
Fig. 4. Extracted Metasphere A from the image of noise variance of 0.6 using different methods. Surfaces are from initializations that led to the median value of the 1 .
outermost surface in the image as was the case in the PET image experiment, then algorithms like GGVF or DSM - which can actually gain from the situation - are the ones to be preferred. GAGR as general purpose optimization algorithm does not have the capability to take advantage of structural a-priori information about the image unless it is coded into the energy function. DSM and GGVF were also reliable choices, but they also have weaknesses as the second set of tests with synthetic images proved. GGVF external forces can cope with random noise sufficiently well, but are not able to handle false surfaces in the image. In the case of false surfaces, meshes with GGVF external forces tended to stuck with the outermost “strong enough” surface in the image. The problem of the DSM was that it was not always able to converge: the mesh with higher energy may stick oscillating between few surface positions. This is clearly an implementational problem, but solving it completely and efficiently could be challenging. The implementation of DSM from [12] does probably not impose oscillation problems but has other disadvantages compared to the implementation from [10]. GVF external forces worked well when a simple surface, such as a sphere, was to be extracted. However, when the surface in the image was more complex and the image was even slightly noisy we were not able to get good results with GVF external forces. This is due to the regularizing component of (5) which tends to blur the surface details, see [15]. GVF external forces failed with PET images,
362
J. Tohka
GGVF
DSM
GAGR
GVF Fig. 5. Cross-sections of the extracted brain surfaces from a PET image. From left to right: transaxial view, sagittal view, coronal view. Result with GVF was obtained by using the best set of parameter values (α = 0.05, µ = 0.05) and the best initialization. Other results were obtained with pre-defined parameter values and results from other initializations look similar.
where the problem is intensity inhomogeneity of the target surface in addition to noise. However, GVF external forces worked well in extraction of the correct surface from images where also false surfaces were present. The external forces of Delingette turned out to do well in synthetic image experiments. The extraction results were good but generally not the most accurate ones. In the PET image experiment, we were not able to obtain good results
Surface Extraction from Volumetric Images
a)
b)
c)
363
d)
Fig. 6. Results with external forces of Delingette. a) b) and c): The result from the best initialization with a PET image, transaxial, sagittal and coronal views. Parameter α = 0.3 at the first stage and α = 0.1 at the second stage. d) The transaxial view of a result with the same initialization but a different image than for the result shown in a),b) and c).
with this external force model. The results could maybe be improved by using a more advanced thresholding method than the one used in this study and fine tuning the parameter values. Indeed, the results obtained by using Delingette’s external forces were sensitive to parameter values for the deformable mesh. In fact, this is also the major difference between force based and energy based approach. The energy based scheme is less sensitive to values of parameters, which makes it easier to select them well. On the other hand, the force based approach has an advantage of not discretizing the search space, what often leads visually more pleasant surface reconstructions. To conclude, there is no optimal deformable meshing scheme which upperhands the others in all applications. This study has highlighted strengths and weaknesses of recent methods to guide the deformation process in a general context. Because of generic nature of our experiments, we have been able to observe common characteristics of each method. Hopefully, the study will be a useful reference to those applying deformable meshes. Acknowledgments. The author would like to thank Turku PET Centre for providing PET images and Ph.D. Ulla Ruotsalainen for comments on earlier drafts. During the preparation of this paper financial support was obtained from Tampere Graduate School of Information Science and Engineering, Academy of Finland, KAUTE-Foundation and Jenny and Antti Wihuri’s Foundation.
References [1] H. Delingette. General object reconstruction based on simplex meshes. International Journal of Computer Vision, 32:111–142, 1999. [2] R.C. Gonzales and R.W. Woods. Digital Image Processing. Addison-Wesley, Reading, MA, 1992. [3] S.R. Gunn. Dual Active Contour Models for Image Feature Extraction. PhD thesis, University of Southampton, 1996.
364
J. Tohka
[4] S.R. Gunn and M.S. Nixon. A robust snake implementation: a dual active contour. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997. [5] F. Herrera, M. Lonzano, and J. L. Verdegay. Tackling real-coded genetic algorithms: Operators and tools for behaviorial analysis. Artificial Intelligence Review, 12(4), 1998. [6] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321 – 331, 1988. [7] J.-O. Lachaud and A. Montanvert. Deformable meshes with automated topology changes for coarse-to-fine three-dimensional surface extraction. Medical Image Analysis, 3(2), 1998. [8] T. McInerney and D. Terzopoulos. Deformable models in medical image analysis: A survey. Medical Image Analysis, 2(1), 1996. [9] J. Montagnat, H. Delingette, N. Scapel, and N. Ayache. Representation, shape, topology and evolution of deformable surfaces. application to 3D medical image segmentation. Technical report, INRIA, 2000. [10] U. Ruotsalainen, J. Mykk¨ anen, J. Luoma, J. Tohka, and S. Alenius. Methods to improve repeatability in quantification of brain PET images. In World Congress on Neuroinformatics: Part II Proceedings, ARGESIM Report no. 20, pages 659 – 664, 2001. [11] J. Tohka. Global optimization of deformable surface meshes based on genetic algorithms. In Proc. of 11th International Conference on Image Analysis and Processing, ICIAP01, pages 459 – 464, 2001. [12] J. Tohka and J. Mykk¨ anen. Deformable mesh for automated surface extraction from noisy images. Submitted to IEEE Transactions on Medical Imaging, 2001. Available online http://www.cs.tut.fi/ jupeto/dmnoisy.pdf. [13] D.J. Williams and M. Shah. A fast algorithm for active contours and curvature estimation. CVGIP: Image Understanding, 55(1):14 – 26, 1992. [14] C. Xu. Deformable Models with Application to Human Celebral Cortex Reconstruction from Magnetic Resonance Images. PhD thesis, the Johns Hopkins University, 1999. [15] C. Xu and J.L Prince. Generalized gradient vector flow external forces for active contours. Signal Processing, 71(2), 1998. [16] C. Xu and J.L. Prince. Snakes, shapes and gradient vector flow. IEEE Transactions on Image Processing, 7(3), 1998. [17] S. Zucker and R. Hummel. A three-dimensional edge operator. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3(3), 1981.
DREAM2 S: Deformable Regions Driven by an Eulerian Accurate Minimization Method for Image and Video Segmentation Application to Face Detection in Color Video Sequences St´ephanie Jehan-Besson1 , Michel Barlaud1 , and Gilles Aubert2 1
Laboratoire I3S, CNRS-UNSA 2000, route des Lucioles 06903 Sophia Antipolis, France {jehan,barlaud}@i3s.unice.fr 2 Laboratoire J.A. Dieudonn´e, CNRS-UNSA Parc Valrose 06108 Nice Cedex 2, France
[email protected]
Abstract. In this paper, we propose a general Eulerian framework for region-based active contours named DREAM2 S. We introduce a general criterion including both region-based and boundary-based terms where the information on a region is named “descriptor”. The originality of this work is twofold. Firstly we propose to use shape optimization principles to compute the evolution equation of the active contour that will make it evolve as fast as possible towards a minimum of the criterion. Secondly, we take into account the variation of the descriptors during the propagation of the curve. Indeed, a descriptor is generally globally attached to the region and thus “region-dependent”. This case arises for example if the mean or the variance of a region are chosen as descriptors. We show that the dependence of the descriptors with the region induces additional terms in the evolution equation of the active contour that have never been previously computed. DREAM2 S gives an easy way to take such a dependence into account and to compute the resulting additional terms. Experimental results point out the importance of the additional terms to reach a true minimum of the criterion and so to obtain accurate results. The covariance matrix determinant appears to be a very relevant tool for homogeneous color regions segmentation. As an example, it has been successfully applied to face detection in real video sequences.
1
Introduction
Active contours are powerful tools for image and video segmentation. Since the original work on snakes [1], an extensive research has been performed that leads today to the use of “region-based active contours”. Originally, active contours were boundary-based methods. Snakes [1], balloons [2] or geodesic active contours [3] are driven towards the edges of an image. The evolution equation is then computed from a criterion that only includes a local information on the boundary of the object to segment. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 365–380, 2002. c Springer-Verlag Berlin Heidelberg 2002
366
S. Jehan-Besson, M. Barlaud, and G. Aubert
The key idea of region-based active contours, firstly proposed by [4,5,6,7], and further developed by [8,9,10,11,12,13], is to introduce a global information on the different regions to segment, in addition to the boundary-based information, to make the active contour evolve. However, it is not trivial to compute the evolution equation of the active contour that will make it evolve towards a minimum of a criterion including both region-based and boundary-based terms. Recently, many papers have addressed this problem. Some of these works do not compute the theoretical expression of the velocity vector of the active contour but they choose the displacement that will make the criterion decrease [7,9]. Other works propose the computation of the velocity vector by reducing the whole problem to boundary integrals [6,8] or by using the level set method [10,11]. They then use the Euler-Lagrange equations to compute the evolution equation. However, the information on the different regions, that we call here “descriptor” of the corresponding region, is generally globally attached to the region. Indeed this case arises for statistical descriptors such as the mean, the variance or the histogram of the region. The main drawback of previous works on regionbased active contours is that they do not take into account this possible variation of the descriptors to compute the evolution equation of the active contour. We propose here a general Eulerian framework for region-based active contours named DREAM2 S (Deformable Regions driven by an Eulerian Accurate Minimization Method for Segmentation). The main contribution of our work is to propose a theoretical framework based on shape optimization principles. With such an approach, we can readily take into account the variation of the descriptors that are globally attached to the evolving regions (named region-dependent descriptors). We show that the variation of these region-dependent descriptors during the evolution of the active contour induces additional terms in the evolution equation of the active contour. These additional terms have never been previously reported in the literature. Some examples of unsupervised spatial segmentation using the variance of the regions show the importance of the additional terms for the accuracy of segmentation results. For color images segmentation, we propose to use the covariance matrix determinant which appears to be a very relevant tool for homogeneous color regions segmentation. This tool is successfully applied to face detection on real video sequences. The Eulerian framework DREAM2 S is described in Sec. 2. This framework is then applied to the unsupervised segmentation of homogeneous regions using the covariance matrix determinant in Sec. 3.
2
Setting a General Framework for Region-Based Active Contours
Here, we describe the equations addressing the issue of the segmentation of an image I in two regions. An image I(x, y) is a function defined for (x, y) ∈ Ω ⊂
DREAM2 S: Deformable Regions
367
IR2 , where Ω is the image domain. The image domain is considered to be made up of two parts: Ωin the region containing the objects to segment and Ωout the background region. Their common boundary is noted Γ (see Fig.1). Background Part Ω out Objects Ω in
Ω in Γ
Fig. 1. The two regions of an image
We search for the partition of the image which minimizes the following criterion: J(Ωin , Ωout , Γ ) = k (out) (x, y, Ωout )dxdy Ωout (in) + k (x, y, Ωin )dxdy + k (b) (x, y)ds (1) Ωin
Γ
(out)
where k is named the “descriptor” of the background region, k (in) the “descriptor” of the object region and k (b) the “descriptor” of the contour. A descriptor is a function that measures the homogeneity of a region. Most of relevant statistical descriptors depend themselves on the region (in that case they are called ”region-dependent descriptors”). This arises when statistical features of a region are selected as region descriptors (for example the mean or the variance). The purpose is then to make an active contour evolve towards a partition of the image, (Ωout , Ωin , Γ ), that minimizes the criterion (1). We propose here a new Eulerian proof for the computation of the evolution equation of the contour. We compute the Eulerian derivative of the criterion by using shape optimization principles. This method provides the advantage that the variation of regiondependent descriptors with the evolution of the active contour can be readily taken into account. The computation of the evolution equation of the active contour is performed in three main steps: 1. Introduction of a dynamical scheme, 2. Derivation of the criterion using shape optimization principles, 3. From the derivative, computation of the evolution equation. This general framework can be applied straightforward to any descriptors. This general framework has been notably applied to moving objects detection in video sequences acquired either with a static [14] or a mobile camera [15]. Descriptors may also be sequentially used to improve the final result as it has been proposed in [16].
368
2.1
S. Jehan-Besson, M. Barlaud, and G. Aubert
Introduction of a Dynamical Scheme
We search for the two domains Ωout and Ωin which minimize the criterion J given by (1). Since the set of all domains has not a structure of vectorial space, we can not compute the derivative of the previous criterion according to domains. Therefore, to compute an optimal solution, a dynamical scheme is introduced where each domain becomes continuously dependent on an evolution parameter τ . Such a method has also been used in [12,13] using the distribution theory but region-dependent descriptors have not be taken into account. To formalize this idea, we may suppose that the evolution process is totally determined by the existence of a family of applications Tτ that transforms the initial domains Ωin (0) and Ωout (0) into the current domains Ωin (τ ) and Ωout (τ ): T
τ Ωi (τ ) Ωi (0) −→
T
τ Γ (0) −→ Γ (τ )
where i = in or out .
Thus the regions evolve towards an optimal solution. The triplet {Ωout (τ ), Ωin (τ ), Γ (τ )} must act as a minimizing sequence for J(Ωout , Ωin , Γ ) as τ evolves. The functional J(Ωout (τ ), Ωin (τ ), Γ (τ )) has to converge towards the minimum value of J as τ → ∞. The criterion then becomes: (out) (in) k (x, y, τ )dxdy + k (x, y, τ )dxdy + k (b) (x, y)ds .(2) J(τ ) = Ωout (τ )
Ωin (τ )
Γ (τ )
The functional J(Ωout (τ ), Ωin (τ ), Γ (τ )) is noted J(τ ), and the descriptors k (in) (x, y, Ωin (τ )) and k (out) (x, y, Ωout (τ )) are respectively noted k (in) (x, y, τ ) and k (out) (x, y, τ ). In contrast with other active contours approaches, the dynamical scheme is here directly introduced in the criterion. With such a scheme, the introduction of an active contour evolving with τ is straightforward. Hence we consider that Γ (τ ) is modelled as an active contour that converges towards the final expected segmentation. Let Γ (0) be the initial curve defined by the user. We recall that we search for Γ (τ ) as a curve evolving according to the following PDE: ∂Γ (s, τ ) =v ∂τ
with Γ (0) = Γ0 ,
(3)
where v is the velocity vector of the active contour and s may be the arc length of the curve. The main problem lies in finding the velocity v from the criterion (1) in order to get the fastest curve evolution towards the final segmentation. 2.2
Computation of the Derivative of the Criterion
In order to obtain the evolution equation, the criterion J(τ ) must be differentiated with respect to τ . The integral bounds depend on τ and the descriptors k (out) () and k (in) () may also depend on τ .
DREAM2 S: Deformable Regions
Let us define the functional k(x, y, τ ) such that: (out) (x, y, τ ) if (x, y) ∈ Ωout (τ ) k . k(x, y, τ ) = if (x, y) ∈ Ωin (τ ) k (in) (x, y, τ )
369
(4)
The functional k (out) () and k (in) () are defined on the whole image domain Ω. Then, the criterion J(τ ) writes as: J(τ ) = k(x, y, τ )dxdy + k (b) (x, y)ds = J1 (τ ) + J2 (τ ) . (5) Ω
Γ (τ )
In order to compute the derivative of the region integral J1 (τ ), we first recall a general theorem concerning the derivative of a time-dependent criterion. Let us define Ωi (τ ) as a region included into Ω. Theorem 1. Let ki be a smooth function on Ω × (0, T ), and let J(τ ) = Ωi (τ ) ki (x, y, τ )dxdy , then: ∂ki dJ = J (τ ) = dxdy − ki (V · N∂Ωi )ds dτ Ωi (τ ) ∂τ ∂Ωi (τ ) where V is the velocity of ∂Ωi (τ ) and N∂Ωi is the unit inward normal to ∂Ωi (τ ). The derivative of J with respect to τ is the Eulerian derivative of J(Ωi ) in the direction of V whose computation is given in appendix A. The variation of J is due to the variation of the functional ki (x, y, τ ) and to the motion of the integral domain Ωi (τ ). As a corollary of Theorem 1, we get: Corollary 1. Let us suppose that the domain Ω(τ ) is made up of two parts, Ωin (τ ) and Ωout (τ ), separated by a moving interface Γ (τ ) whose velocity is v. The function k(x, y, τ ) is supposed to be separately continuous in Ωin (τ ) and Ωout (τ ) but may be discontinuous across Γ (τ ). We note k (in) and k (out) , the value of k in respectively Ωin (τ ) and Ωout (τ ). Thus the derivative of J(τ ) writes as: ∂k dxdy − k(w · N∂Ω )ds + [[k]](v · N )ds J (τ ) = Ω(τ ) ∂τ ∂Ω(τ ) Γ (τ ) where [[k]] represents the jump of k across Γ (τ ): [[k]] = k (out) −k (in) , N the unit normal of Γ (τ ) directed from Ωout (τ ) to Ωin (τ ), N∂Ω the unit inward normal to Ω(τ ) and w the velocity vector of Ω(τ ). Proof : We can apply Theorem 1 to the domain Ωin (τ ) and to the domain Ωout (τ ). Adding the two equations, we obtain the corollary. It is now straightforward to get the derivative of the criterion J1 (τ ): we take Ω(τ ) = Ω the image domain (Fig.2). Then, by explicitly taking the discontinuities into account thanks to the corollary, the derivative of J1 (τ ) is given by: ∂k (in) ∂k (out) [[k]](v · N )ds − k(w · N∂Ω )ds + J1 (τ ) = + (6) ∂τ ∂τ Γ (τ )
∂Ω
Ωin (τ )
Ωout (τ )
370
S. Jehan-Besson, M. Barlaud, and G. Aubert w
Ω out (τ)
N
Ω
N v Ω in (τ) Γ(τ) Ω
Fig. 2. The domains and the vectors involved in the derivation
Obviously, the second term of the derivative (6) is zero since the external boundary ∂Ω of the image is fixed. The derivative of J2 (τ ) is classical [3] and thus, replacing [[k]] by its expression, we find the derivative of the whole criterion: ∂k (in) ∂k (out) dxdy + dxdy (7) J (τ ) = ∂τ ∂τ Ωin (τ ) Ωout (τ ) + (k (out) − k (in) − k (b) · κ + ∇k (b) · N )(v · N )ds Γ (τ )
where κ(x, y, τ ) is the curvature of Γ (x, y, τ ). In order to compute the velocity vector v, we have to make it appear in the first two domain integrals of (7) by expressing them as functions of the velocity v. Here we take the general case where the descriptors may depend on features globally attached to the region and so may depend on τ . We model each descriptor as a combination of features globally attached to the evolving regions Ωin (τ ) or Ωout (τ ): (in)
k (in) (x, y, τ ) = g (in) (x, y, G1
(out)
(τ ), G2
ψj (x, y, τ )dxdy
with
k (out) (x, y, τ ) = g (out) (x, y, G1 where:
(·)
Gj =
(in)
(τ ), G2
(·)
(τ ), ..., G(in) p (τ ))
(out)
(τ ), ..., G(out) (τ )) m
(·) = (in)
or
(8)
(out)
Ω· (τ )
Let first compute the derivative of k (in) according to τ . We find: (in)
p
∂g (in) ∂Gj ∂k (in) (in) = (x, y, G1 (τ ), ..., G(in) p (τ )) (in) ∂τ ∂τ j=1 ∂G
(τ ) .
(9)
j
(in)
We have now to compute the derivative of Gj according to τ . We apply the Theorem 1 and we find: (in) (in) ∂Gj ∂ψj (in) ψj (v · N )ds . (10) = dxdy − ∂τ ∂τ Ωin (τ )
Γ (τ )
DREAM2 S: Deformable Regions (in)
Replacing the derivatives of Gj
371
by their expressions and computing the deriva-
(out) Gj
tives of in the same manner, we obtain the general expression for the derivative of J(τ ): (k (out) − k (in) − k (b) · κ + ∇k (b) · N )(v · N )ds J (τ ) = Γ (τ )
−
p j=1
+
p j=1
where:
(in) Aj
(in) ψj (v
· N )ds +
j=1
Γ (τ ) (in) Aj
Ωin (τ )
(in)
∂ψj ∂τ
m
dxdy +
m j=1
(out) Aj
(out)
ψj
(v · N )ds
Γ (τ )
(out) Aj
Ωout (τ )
(out)
∂ψj ∂τ
dxdy
(11)
∂g(in) (in) (in) (in) = (τ ), ..., Gp (τ ))dxdy (in) (x, y, G1 Aj ∂Gj Ωin (τ ) . ∂g(out) (out) (out) (out) = (τ ), ..., Gm (τ ))dxdy A (out) (x, y, G1 j ∂Gj Ωout (τ )
2.3
From the Derivative Towards the Evolution Equation of the Active Contour
In order to deduce the velocity vector v of the active contour from the derivative of the criterion, we have to make the velocity vector appear in the last two domain integrals by expressing them as boundary integrals. This can easily be done, in the same manner as it has been done for the functionals k (in) and k (out) , since domain integrals can always be expressed as boundary integrals. However, for simplicity, let us consider the cases where these last two integrals are equal to zero, which happens for most region-dependent descriptors. In that case, according to the inequality of Cauchy-Schwartz, the fastest decrease of J(τ ) is obtained from (11) by choosing v = F N . The following evolution equation is obtained: ∂Γ (τ ) = [ k (in) − k (out) + k (b) · κ − ∇k (b) · N ∂τ p m (in) (in) (out) (out) Aj ψj − Aj ψj ]N . + j=1
(12)
j=1
We can notice that the dependence of the descriptors on τ , and so on the curve evolution, induces additional terms in the evolution equation of the active contour. The additional terms found in (12) have not been previously computed. Thanks to these additional terms, we ensure the fastest evolution of the curve towards a minimum of the criterion. These additional terms may also be computed by reducing the whole problem to boundary-based terms and then computing the
372
S. Jehan-Besson, M. Barlaud, and G. Aubert
Gˆ ateaux derivative but the computation is much more difficult and less natural since the region formulation is not kept. In this general framework, we can model the active contour by using an explicit parameterization (Lagrangian formulation) or an implicit one (Eulerian formulation) to implement the evolution equation of the active contour. See [17] and [18] for an interesting comparison between the two methods. Another interesting review on the different active contour methods is provided in [19]. Here, we use the level set method approach proposed by Osher and Sethian [20] and further developed by Caselles et al [21]. 2.4
Importance of the Additional Terms
Region-dependent descriptors based on the minimization of the variances of the two considered regions are implemented for the segmentation of homogeneous regions. We show that the additional terms found in (12) have to be considered in order to make the active contour evolve towards a minimum of the criterion and so to achieve the expected segmentation. The intensity for greyscale images is represented by a function I : Ω ⊂ IR2 → IR. We propose to minimize the variances of the two considered homogeneous regions in order to segment them. The following descriptors are suggested: 2 ), k (out) = ϕ(σout
2 k (in) = ϕ(σin )
and k (b) = λ.
(13)
where ϕ(r) is a positive function of class C 1 (IR), λ a positive constant and: µ· (τ ) = V1· Idxdy with V· = dxdy Ω· (τ ) Ω· (τ ) with · = in or out . 1 2 (I − µ· (τ ))2 dxdy σ· (τ ) = V· Ω· (τ )
This velocity vector is computed by replacing the terms of (11) by their expressions, with (·) = (in) or (out): (·) G (·) (·) k (·) (x, y, τ ) = g (·) (x, y, G1 (τ ), G2 (τ )) = ϕ( 1(·) ) G2 (·) (·) (·) G1 (τ ) = ψ1 (x, y, τ )dxdy with ψ1 (x, y, τ ) = (I(x, y) − µ· (τ ))2 Ω· (τ ) (·) (·) (·) G (τ ) = ψ2 (x, y, τ )dxdy with ψ2 (x, y, τ ) = 1 2 Ω· (τ )
∂ψ
(·)
∂ψ
(·)
Since the domain integrals of ∂τ1 and ∂τ2 are equal to zero, we can directly compute the velocity vector of the active contour from (12) and we find: ∂Γ (τ ) 2 2 2 2 = [ϕ(σin ) − ϕ(σout ) + λκ + ϕ (σin )((I − µin )2 − σin ) ∂τ 2 2 2 − ϕ (σout )((I − µout ) − σout ) ]N .
(14)
DREAM2 S: Deformable Regions
a. Initial contour
b. Final contour PDE (15)
373
c. Final contour PDE (14)
Fig. 3. Figure a: The initial curve, Figure b: The final contour obtained using the PDE (15) without the additional terms, Figure c: The final contour obtained using the PDE (14) including the additional terms.
The single parameter we need to adjust is the smoothing parameter λ. In order to evaluate the importance of the additional terms for the segmentation, the active contour evolves through the PDE (14), including the additional terms. On the other hand, we also make it evolve with the following wrong evolution equation that does not include the additional terms: ∂Γ (τ ) 2 2 = [ϕ(σin ) − ϕ(σout ) + λκ ]N . ∂τ
(15)
For the experiments, we take ϕ(r) = log(1 + r) which gives ϕ (r) = 1/(1 + r) and we choose λ = 10. We use a synthetic image made up of an homogeneous square of intensity 120 on a background of intensity 160. A gaussian noise of variance 20 is added to this synthetic image. The initial contour is given in Fig.3.a. The final contour obtained using the PDE (14) is given in Fig. 3.c, while the one obtained using (15) is given in Fig. 3.b. We can remark that when the PDE includes the additional terms, the square is well segmented, whereas when the PDE does not include these additional terms we obtain a circle instead of the expected square. The previous results show that it is of prime necessity to take the additional terms into account in order to reach a true minimum of the criterion.
3
Segmentation of Homogeneous Regions Based on the Covariance Matrix for Color Images: Application to Face Detection in Video Sequences
This part deals with the segmentation of homogeneous regions in multispectral images [22,23,24]. As an example, we show that the covariance matrix can be a very powerful tool for segmentation of homogeneous color regions. In fact, the determinant of the covariance matrix proves to be a relevant color region homogeneity measurement [25]. Minimizing this quantity means that we want
374
S. Jehan-Besson, M. Barlaud, and G. Aubert
to decrease the complexity of a region [26,27]. This tool may be used for face detection in image sequences. The intensity of color images is represented by a function I : Ω ⊂ IR2 → IR3 where I = [I 1 , I 2 , I 3 ]T . By analogy with the study for greyscale images, we propose to minimize the determinant of the covariance matrix of the considered regions. Let us note Σin and Σout the covariance matrices of respectively Ωin (τ ) and Ωout (τ ): 11 12 13 σ · σ· σ· (16) Σ· = σ·21 σ·22 σ·23 σ·31 σ·32 σ·33
where:
ij σ· = i µ· =
1 (I i − µi· )(I j − V· Ω · (τ ) 1 I i (x, y)dxdy V· Ω· (τ )
µj· )dxdy with · = in or out .
We obviously have σ·ij = σ·ji . As a color region homogeneity measurement, a function of the covariance matrix determinant is used as region descriptors. We choose the following set of descriptors: k (out) = Φ(det(Σout )),
k (in) = Φ(det(Σin ))
and k (b) = λ
(17)
where Φ(r) is a positive function of class C 1 (IR) and λ a positive constant. Let us now compute the evolution equation of the active contour by computing the different terms of (11) [28]. We obtain the following evolution equation: ∂Γ (τ ) = [ Φ(det(Σin )) − Φ(det(Σout )) + λκ ∂τ + Φ (det(Σin ))
3
kl (I k − µkin )(I l − µlin ) det(Min )(−1)k+l
k,l=1
− Φ (det(Σout ))
(18)
3
kl (I k − µkout )(I l − µlout ) det(Mout )(−1)k+l
k,l=1
− 3 det(Σin )Φ (det(Σin )) + 3 det(Σout )Φ (det(Σout )) ]N . The matrix M·kl is deduced from the covariance matrix Σ· by suppressing the k th row and the lth column. The last four lines of the equation are the additional terms coming from the variation of the descriptors with τ . As it has been previously proved for greyscale images, these terms have to be considered in order to make the active contour converge towards a minimum of the criterion. For the experiments, we take Φ(r) = log(1 + r2 ) which gives Φ (r) = 2r/(1 + r2 ). We choose λ = 10. The color space selected is (Y, Cb , Cr ), where
DREAM2 S: Deformable Regions
a. Initial contour
b. Iteration 500
c. Final contour
d. Initial velocity
e. Iteration 500
f. Final velocity
375
Fig. 4. Visualization of the evolution of the contour {a,b,c} and the velocity {d,e,f} (without the boundary-based term) for face segmentation using the covariance matrix.
(I 1 = Y ) represents the luminance and (I 2 = Cb ) and (I 3 = Cr ) represent the two chrominances. We propose to use the descriptors based on the covariance matrix to detect human faces in video sequences. This detection may be used for video coding to encode selectively the human face. For a given compression ratio, the face can be transmitted with a higher rate to the detriment of the background. this interesting property is valuable for videoconferences, where the most important and most variable information is located on the face [29]. In order to segment an homogeneous region, we start with an initial curve inside the region of interest. The curve is then supposed to expand until it reaches the boundary of the homogeneous region. The algorithm using the covariance matrix is performed on an image of the video sequence ”akiyo” in order to detect speaker’s face. The evolution of the curve is given in Fig.4 a,b,c. The contour evolves and finally converges on the boundaries of the face (Fig.4.c). The face is then accurately segmented. The amplitude of the velocity is also given in Fig.4 d,e,f and we can observe that the face region is well separated from other regions. The amplitude of the velocity is normalized between 0 and 255. In order to track the face on the whole sequence, we first initialize the first frame by a circle inside the face to track. Then, we make the active contour evolve using (18). The final contour of the previous image is then chosen as an initial curve for the next image. The results for the sequence “akiyo” are given in Fig.5 (λ = 10) and, for the sequence “erik” are given in Fig.6 (λ = 20). The face is well detected and tracked on the whole sequence.
376
S. Jehan-Besson, M. Barlaud, and G. Aubert
Fig. 5. Detection of the face on the video sequence “akiyo” (final contour and extracted face).
Fig. 6. Detection of the face on the video sequence “erik” (final contour and extracted face).
4
Conclusion
In this paper, we propose a new Eulerian minimization method to compute the velocity vector of an active contour that ensures its fastest evolution towards a minimum of a criterion including both region-based and boundary-based terms. With our approach, we can readily take into account the case of regiondependent descriptors that are globally attached to the evolving regions and so that vary during the propagation of the curve. We show that the variation of these descriptors induces additional terms in the evolution equation of the
DREAM2 S: Deformable Regions
377
active contour. As far as homogeneous color regions segmentation is concerned, we take the covariance matrix determinant as a descriptor. This statistical region-dependent descriptor appears to be a very relevant tool and so it has been successfully applied for face detection in video sequences. Appendix A. The issue is to differentiate the following domain functional: ki (x, y, Ωi )dxdy J(Ωi ) = Ωi
Since the set of all domains has not a structure of vectorial space, let us make the regions evolve through a family of transformations (T (τ, ·))τ ≥0 smooth and bijective. For a point p = [x, y]T , we note: p(τ ) = T (τ, p) with T (0, p) = p, and Ωi (τ ) = T (τ, Ωi ) with T (0, Ωi ) = Ωi Let us then define the velocity vectors field V such that V (τ, p(τ )) = ∂T ∂τ (τ, p). As we are interested in small deformations for the computation of the derivative, we expand the transformation according to first order Taylor formula: T (τ, p) = T (0, p) + τ
∂T (0, p) = p + τ V (p) ∂τ
where
V (p) =
∂T (0, p) ∂τ
We then introduce three main definitions: 1. The Eulerian derivative of J(Ωi ) (in the direction of V ): dJ(Ωi , V ) = lim
τ →0
J(Ωi (τ )) − J(Ωi ) τ
(19)
2. The material derivative of ki (p, Ωi ): ki (p + τ V (p), Ωi + τ V (p)) − ki (p, Ωi ) k˙i (p, Ωi , V ) = lim τ →0 τ
(20)
3. The shape derivative of ki (p, Ωi ): ki (p, Ωi , V ) = lim
τ →0
ki (p, Ωi + τ V (p)) − ki (p, Ωi ) τ
(21)
Obviously, by expanding (20) according to first order Taylor formula, we have: k˙i (p, Ωi , V ) = ki (p, Ωi , V ) + ∇ki (p, Ωi ) · V (p)
(22)
We then compute more precisely the Eulerian derivative of J(Ωi ). We have: J(Ωi (τ )) − J(Ωi ) 1 = ki (p(τ ), Ωi (τ ))dp − ki (p, Ωi )dp (23) τ τ Ωi (τ ) Ωi
378
S. Jehan-Besson, M. Barlaud, and G. Aubert
In the first integral, we make the variable change p(τ ) = p + τ V (p) where V (p) = [V1 (x, y), V2 (x, y)]T , which gives: ki (p(τ ), Ωi (τ ))dp = ki (p + τ V (p), Ωi + τ V (p))| det Jτ (p)|dp Ωi (τ )
Ωi
where Jτ (p) is the following Jacobian matrix: 1 1 τ ∂V 1 + τ ∂V ∂x ∂y Jτ (p) = 2 2 τ ∂V 1 + τ ∂V ∂x ∂y and so, we have: det Jτ (p) = 1 + τ div(V (p)) + τ 2 det(∇V (p)). Then we obtain limτ →0 det Jττ(p)−1 = div(V (p)). The equation (23) then becomes: J(Ωi (τ )) − J(Ωi ) 1 = [ ki (p + τ V (p), Ωi + τ V (p))|detJτ (p)|dp τ τ Ω i − ki (p, Ωi )dp ] (24) Ωi
We can suppose that det(Jτ (p)) = 0∀τ ∀p. We may then take det(Jτ (p)) > 0, and we develop the expression (24) as following by adding a term in the second integral while suppressing the same term in the first integral: J(Ωi (τ )) − J(Ωi ) = I1 + I2 τ where:
(25)
ki (p + τ V (p), Ωi + τ V (p)) − ki (p, Ωi ) det(Jτ (p))dp τ Ωi det(Jτ (p)) − 1 I2 = ki (p, Ωi ) dp τ Ωi
I1 =
(26) (27)
Let us make τ tend towards 0. Using (22) and Definitions (20,21), we get: lim I1 =
τ →0
lim I2 =
τ →0
Ωi
Ωi
k˙i (p, Ωi , V )dp =
Ωi
ki (p, Ωi , V )dp +
Ωi
∇ki (p, Ωi ) · V (p)dp
ki (p, Ωi )div(V )dp
And so for the Eulerian derivative, we find: dJ(Ωi , V ) = =
Ωi
Ωi
ki (p, Ωi , V )dp + ki (p, Ωi , V )dp +
Ωi
Ωi
(∇ki (p, Ωi ) · V (p) + ki (p, Ωi )div(V (p)))dp div(ki V )dp
(28)
DREAM2 S: Deformable Regions
379
When applying the Green-Riemann theorem in (28), we finally obtain: ki (p, Ωi , V )dp − ki (V · N∂Ωi )ds dJ(Ωi , V ) =
(29)
Ωi
∂Ωi
where N∂Ωi is the unit inward normal to ∂Ωi . The Eulerian derivative is noted i J (τ ) in the paper and the shape derivative ki is noted ∂k ∂τ . Remark: For simplicity, we choose the following variation in the proof: p(τ ) = p + τ V (p). The proof may also be developed with more general variations such as: p(τ ) = f (p, τ ) with f a smooth function and f (·, τ ) bijective.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. Int. J. Computer Vision 1 (1988) 321–332 2. Cohen, L.: On active contour models and balloons. Computer Vision, Graphics and Image Processing : Image Understanding 53 (1991) 211–218 3. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. Journal of Computer Vision 22 (1997) 61–79 4. Cohen, L., Bardinet, E., Ayache, N.: Surface reconstruction using active contour models. In: SPIE Conf. on Geometric Methods in Comp. Vis., San Diego, CA (1993) 5. Ronfard, R.: Region-based strategies for active contour models. Int. Journal of Computer Vision 13 (1994) 229–251 6. Zhu, S., Yuille, A.: Region competition: unifying snakes, region growing, and bayes/MDL for multiband image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (1996) 884–900 7. Chakraborty, A., Staib, L., Duncan, J.: Deformable boundary finding in medical images by integrating gradient and region information. IEEE Trans. Medical Imag. 15 (1996) 859–870 8. Paragios, N., Deriche, R.: Geodesic active regions for motion estimation and tracking. In: Int. Conf. on Computer Vision, Corfu Greece (1999) 9. Chesnaud, C., R´efr´egier, P., Boulet, V.: Statistical region snake-based segmentation adapted to different physical noise models. IEEE PAMI 21 (1999) 1145–1156 10. Samson, C., Blanc-F´eraud, L., Aubert, G., Zerubia, J.: A level set model for image classification. International Journal of Computer Vision 40 (2000) 187–197 11. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Processing 10 (2001) 266–277 12. Debreuve, E., Barlaud, M., Aubert, G., Darcourt, J.: Space time segmentation using level set active contours applied to myocardial gated SPECT. IEEE Transactions on Medical Imaging 20 (2001) 643–659 13. Amadieu, O., Debreuve, E., Barlaud, M., Aubert, G.: Inward and outward curve evolution using level set method. In: International Conference on Image Processing, Japan (1999) 14. Jehan-Besson, S., Barlaud, M., Aubert, G.: Video object segmentation using eulerian region-based active contours. In: Int. Conf. on Computer Vision, Vancouver (2001) 15. Jehan-Besson, S., Barlaud, M., Aubert, G.: Region-based active contours for video object segmentation with camera compensation. In: Int. Conf. on Image Processing, Thessaloniki, Greece (2001)
380
S. Jehan-Besson, M. Barlaud, and G. Aubert
16. Jehan-Besson, S., Barlaud, M., Aubert, G.: A 3-step algorithm using region-based active contours for video objects detection. EURASIP Journal on Applied Signal Processing, Special issue on Image Analysis for Multimedia Interactive Services (To appear, june 2002) 17. Sethian, J.: Numerical algorithms for propagating interfaces: Hamilton-jacobi equations and conservation laws. J. of Differential Geometry 31 (1990) 131–161 18. Delingette, H., Montagnat, J.: Topology and shape constraints on parametric active contours. Computer Vision and Image Understanding 83 (2001) 140–171 19. Montagnat, J., Delingette, H., Ayache, N.: A review of deformable surfaces: topology, geometry and deformation. Image and Vision Computing 19 (2001) 1023–1040 20. Osher, S., Sethian, J.: Fronts propagating with curvature-dependent speed: Algorithms based on hamilton-jacobi formulation. J. of Comp. Physics 79 (1988) 12–49 21. Caselles, V., Catt´e, F., Coll, T., Dibos, F.: A geometric model for active contours in image processing. Numerische Mathematik 66 (1993) 1–31 22. Tu, Z., Zhu, S., Shum, H.: Image segmentation by data driven markov chain monte carlo. In: International Conference on Computer Vision, Vancouver, Canada (2001) 23. Kimmel, R., Malladi, R., Sochen, N.: Images as embedded maps and minimal surfaces: movies, color, texture, and volumetric medical images. International Journal of Computer Vision 39 (2000) 111–129 24. Kimmel, R., Sochen, N.: Geometric-variational approach for color image enhancement and segmentation. In: Scale-space theories in computer vision, Lecture Notes in Computer Science. Volume 1682. (1999) 294–305 25. Sangwine, S., Horne, R., eds.: The colour image processing handbook. Chapman & Hall (1998) 26. Gray, R., Young, J., Aiyer, A.: Minimum discrimination information clustering: modeling and quantization with gauss mixtures. In: International Conference on Image Processing, Thessaloniki, Greece (2001) 27. Pateux, S.: Spatial segmentation of color images according to MDL formalism. In: ´ Int. Conf. on Color in Graphics and Image Processing, St Etienne, France (2000) 28. Jehan-Besson, S., Barlaud, M., Aubert, G.: DREAM2 S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. Technical Report RR 2001-14, Laboratoire I3S (2001) 29. Amonou, I., Duhamel, P.: Iterative backward segmentation for hierarchical wavelet image coding. In: ICIP, Vancouver, Canada (2000)
Parsing Images into Region and Curve Processes Zhuowen Tu and Song-Chun Zhu Department of Computer and Information Science The Ohio State University {ztu,szhu}@cis.ohio-state.edu
Abstract. Natural scenes consist of a wide variety of stochastic patterns. While many patterns are represented well by statistical models in two dimensional regions as most image segmentation work assume, some other patterns are fundamentally one dimensional and thus cause major problems in segmentation. We call the former region processes and the latter curve processes. In this paper, we propose a stochastic algorithm for parsing an image into a number of region and curve processes. The paper makes the following contributions to the literature. Firstly, it presents a generative rope model for curve processes in the form of Hidden Markov Model (HMM). The hidden layer is a Markov chain with each element being an image base selected from an over-complete basis, such as Difference of Gaussians (DOG) or Difference of Offset Gaussians (DOOG) at various scales and orientations. The rope model accounts for the geometric smoothness and photometric coherence of the curve processes. Secondly, it integrates both 2D region models, such as textures, splines etc with 1D curve models under the Bayes framework. Because both region and curve models are generative, they compete to explain input images in a layered representation. Thirdly, it achieves global optimization by effective Markov chain Monte Carlo methods in the sense of maximizing a posterior probability. The Markov chain consists of reversible jumps and diffusions driven by bottom up information. The algorithm is applied to real images with satisfactory results. We verify the results through random synthesis and compare them against segmentations with region processes only.
1
Introduction
Natural images consist of a wide variety of stochastic visual patterns. As an example, Figure 1 illustrates how an image is decomposed into point, line, curve processes, regions of coherent color and textures, and objects. Parsing an image into its constituent components is a fundamental problem in image understanding. It augments the current pixel-based image representation to a semantic object-based description, and thus has broad impacts on many applications, including image compression, photo editing, image database retrieval, object recognition in vision, and non-photo-realistic rendering (NPR) in graphics. Solving the image parsing problem needs 1). a number of types of probabilistic models which can characterize various visual patterns and are compatible (or comparable) with each other, and 2). inference algorithms which can effectively handle A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 393–407, 2002. c Springer-Verlag Berlin Heidelberg 2002
394
Z. Tu and S.-C. Zhu
a). A color image
b). A point process
c). A curve/line process
d). A color region
e). 2 texture regions
f). Faces and words
Fig. 1. An example of image parsing and decomposition.
different types of probabilistic models and achieve globally optimal solution. These are general topics of this paper. In the literature, image segmentation work usually assume that images contain regions of homogeneous properties, such as textures, textons (attributed points), colors, and shading. While many visual patterns are represented well by such 2D models, there are many other visual patterns which are fundamentally one dimensional and thus cause major problems in segmentation. For example, see those lines, trees, grids in Figures 6,8, and 10. We call the 2D patterns region processes and 1D patterns curve processes. In this paper, we propose a stochastic algorithm for parsing an image into a number of region and curve processes. The paper makes contributions in the following aspects. Firstly, it presents a generative rope model for curve patterns. The rope model is in the form of hidden Markov model. The hidden layer is a chain of connected knots. Each knot has 1-3 image bases selected from an over-complete basis[9], such as DOG (difference of Gaussians) or DOOG (difference of offset Gaussians) at various scales and orientations. The Markov model accounts for not only the geometric smoothness as in the SNAKE[10] or Elastica models[8], but also the photometric coherence along the curve. Secondly, it integrates both 2D region and 1D curve models for image parsing. We adopt a generative image model which represents an image I as a superposition of two layers I = Ir + Ic . Ir is partitioned into disjoint regions and Ic is a linear sum of some image bases which are grouped into curves. The algorithm consists of two parts. Part I is image segmentation on Ir , and this is done by a recent data driven Markov chain Monte Carlo (DDMCMC) algorithm[11]. Part II infers the curves patterns from Ic , and is the main focus of this paper.
Parsing Images into Region and Curve Processes
395
By virtue of the generative image models, the cooperation of the two parts is governed in the posterior probability. Thirdly, it achieves global optimization by effective Markov chain Monte Carlo. Due to the use of different families of models, the solution space consists of many subspaces of varying dimensions. Therefore the Markov chain consists of reversible jumps[4,5] and stochastic diffusions to explore the complex solution space. The jump dynamics realize the death/birth, split and merge of regions and curves, the switching of models, and so on, while the diffusion process realizes region growing/competition[12], and curve deformation[10]. The algorithm is applied to a set real images and some results are shown in Figures 8 and 10 . We verify the results through random synthesis from the computed solution and compare the results against segmentations with region process only. We organize this paper as follows. We first formulate the problem in Section 2. The Section 3 briefly overviews the region models used by region processes. Section 4 discusses the rope model for curve patterns. Section 5 discusses the structure of the solution space. Then we present the integrated algorithm in Section 6. Some experiments are shown in section 7. We conclude the paper with a discussion in section 8.
2
Problem Formulation in Bayes Statistics
Let Λ = {(i, j) : 1 ≤ i ≤ L, 1 ≤ j ≤ H} be an image lattice, and I be an intensity image defined on Λ. We assume that the image I is a superposition of two layers, I = I r + Ic , (1) where Ir and Ic are called region layer and curve layer respectively. We assume that the region layer Ir consists of a number of K r disjoint regions which form a partition of the lattice Λ. r
∪K i=1 Ri = Λ,
Ri ∩ Rj = ∅, ∀i = j.
Let IrR denote the image intensities in a region R, and a region is said to be coherent in the sense that IrR is a realization from a probabilistic model p (IR ; Θ). indexes the model or a stochastic process, and Θ is a vector valued parameter of the model. We should discuss the families of region models shortly. Therefore the region processes are represented by a vector W r of unknown dimension. W r = (K r , {(Ri , i , Θi ); i = 1, 2, ..., K r }).
(2)
We assume that the curve layer Ic is a linear sum of a number of N image bases, following the literature of image coding[7,9]. Ic =
N j=1
αj Bj ,
Bj ∈ ∆,
(3)
396
Z. Tu and S.-C. Zhu
αj is the coefficient and base Bj is selected from a over-complete basis or dictionary ∆, for example, ∆ includes Difference of Gaussains (DoG) and Difference of Offset Gaussians (DOOG) over a group of transforms (scaling, rotating, and translation). These image bases are grouped into a number of K c ≤ N curves Ci , i = 1, ..., K c based on a probabilistic curve model that we should deliberate in section (4.2). Ci is a list of bases with certain geometric and photometric regularities. Thus the curve processes are denoted by a vector W c and Ic is a deterministic function of W c , W c = (K c , {Ci ; i = 1, 2, ..., K c }),
Ic = Ic (W c ).
(4)
In a Bayesian framework, our objective is to make inference about W = (W r , W c ) from I that maximizes a posterior probability, W ∗ = (W r , W c )∗ = arg max p(W r , W c |I) = arg max p(I − Ic (W c )|W r )p(W r )p(W c ). ΩW
ΩW
In the computation, the region and curve processes W r , W c are coordinated by the generative models expressed in equations (1) and (3), i.e. Ir = I − Ic (W c ). In a language of neuroscience, the two layers W r , W c (supposed they are represented by two cortical areas) have some mutual inhibition as they compete to explain the observed image, like the lateral inhibition between adjacent neurons (which are bases Bj in our representation). This enables the use of multiple families of image models either 2D or 1D , and distinguishes the generative methods from the discriminative methods for image segmentation (e.g. [6]). In the following two sections, we discuss the mathematical models for the region and curve processes.
3
Probabilistic Models for Region Processes
In this section, we briefly overview the probabilistic models for region processes, following the DDMCMC work in [11]. 1. The likelihood for the region layer Ir . This is a product of individual region models, Kr r r p(I |W ) = pi (IrRi ; Θi ). i=1
i ∈ {1, 2, 3, 4} indexes the following four families of intensity models. Family 1: 1 . This is a simple Gaussian model for flat regions. It assumes that pixel intensities in a region R are constant subject to an independently and identically distributed (i.i.d.) Gaussian noise. Family 2: 2 . This is a non-parametric model for cluttered regions. It assumes that pixel intensities are iid distributed according to a histogram which is discretized as a step function expressed by a vector Θ = (h1 , h2 , ..., hG ).
Parsing Images into Region and Curve Processes
397
Family 3: 3 . This is Markov random field model (FRAME) for textured regions. We choose a set of 8 filters in FRAME and formulate the model in pseudo-likelihood form. Family 4: 4 . This is a spline model for regions with gradually changing intensities, such as lighting areas. We refer to [11] for detailed specification of these models. In summary, the algorithm can switch between the families to search for a good fit in terms of high likelihood function for each region IrR . 2. The prior model for region process W r . The prior model p(W r ) penalizes model complexity and ensure boundary smoothness. Let Ai = |Ri | be the area (pixel number) of Ri , and Γi = ∂Ri be the region boundary, then r
p(W r ) = p(K r )
K
r
p(Ri )p( i )p(Θi | i ) = p(K r )
i=1 −λ0 K r
r
p(K ) ∝ e
K
p(Ai )p(Γi )p( i )p(Θi | i ) (5)
i=1 −γAci
, p(Ai ) ∝ e
−µ
, p(Γi ) ∝ e
Γi
ds
, p(Θi ) ∝ e−νlen(Θi )
In these probabilities, γ, c, ν, µ are some constants and len(Θi ) is the number of parameters in Θi . A DDMCMC algorithm using this set of priors and image models are tested in a large set of images with satisfactory results[11]. It was also tested in a benchmark dataset of 50 natural images by the Berkeley group, and achieved the best results among the algorithms that have been tested. See http://www.cs.berkeley.edu /∼dmartin/segbench/BSDS100/html/benchmark. However, it is evident in these experiments that such region models are not suitable for 1D curve processes. This motivates our curve models below.
4
Probabilistic Models for Curve Processes
In this section, we present a rope model for curve processes, and we start with a brief review of existing curve models.
a).
b).
c).
d).
Fig. 2. Four existing curve models. a). SNAKE (Kass et al 1988), b). Random curves sampled from the Elastica model (Mumford, 1994). c). A random curve sampled from a MRF curve model (Zhu, 1999), d). A profile model for modeling faces (Cootes et al, 1995) and contour tracking (Isard and Blake, 1996).
398
4.1
Z. Tu and S.-C. Zhu
Previous Curve Models
There are four interesting curve models in the literature as Figure 2 shows. The first one is the SNAKE model[10]. Let C(s), s ∈ [a, b] be a continuous curve, a SNAKE model has a smooth term plus an image gradient term for intensity discontinuity. p(C) ∼ exp{−
b a
[−|∇I|2 + αC˙ 2 (s) + β C¨ 2 (s)]ds}.
Mumford studied an Elastica model in [8] motivated by a Ulenbeck process for a moving particle with friction, let κ(s) be the curvature, then p(C) ∼ exp{−
a
b
[β + ακ2 (s)]ds}.
Figure 2.b shows some typical curves sampled from the above model. The third model was proposed by Zhu in [13] which integrates model Gestalt properties and symmetry in a Markov random field model for closed contours, and is learned by a maximal entropy principle. Figure 2.c shows a typical sample of curve from this model. The fourth model is used by Cootes et al. [3] for face contour and are also used in tracking[2]. It is a smooth contour but also measures image profiles along lines perpendicular to the contour as Figure 2.d shows. This profile is used to local edges nearby. Recently, August and Zucker [1] constituted a so called curve indicator random field model. As we can see, the gradient term in the SNAKE model or the profile measure are both feature detectors, and they are not generative models for image intensities. 4.2
A Generative Rope Model of Curve Processes
In the real images, the curve patterns are generated by elongated objects, such as trees, stems, cables, strings, rails, and so on. Instead of being curves of constant intensities, such objects often have interesting intensity profiles due to lightings and are blended with the background regions through image formation.
Fig. 3. A rope model of curves consists of a chain of knots. A knot has 1-3 image bases shown by the ellipses.
Parsing Images into Region and Curve Processes
399
The curve processes W c have a number of curves {Ci ; i = 1, 2, ..., K c }, and each curve C is a chain of n knots as shown in Figure 3. C = (n, ζ1 , ζ2 , ..., ζn ) Each knot ζ consists of 1-3 image bases. One major base shown by a large ellipse is often associated with 0-2 small minor bases to achieve smooth blending with background regions, thus ζ = ( (α1 , B1 ), ..., (αk , Bk ) ),
k ≤ 3.
α1 , ..., αk are the coefficients of these bases and thus specifies the photometric properties and B1 , ..., Bk are bases selected from an over-complete basis or dictionary ∆. We utilize 134 base functions shown in figure 4 which are Gaussians, DoG, and DOOG bases at various orientation and scales, and can be translated to any location.
Fig. 4. Base functions used for the disctionary ∆.
For each curve C we have a 2nd order Markov chain model for the knots n p(C) = p(n)p(ζ1 )p(ζ2 |ζ1 ) p(ζi |ζi−1 , ζi−2 ), i=3
where p(n) controls the length of the curve. The conditional probabilities p(ζ2 |ζ1 ) and p(ζi |ζi−1 , ζi−2 ) are defined so that the curve is smooth in geometry (by energy terms for co-linearity and co-circularity of adjacent bases as in the SNAKE/Elastica models) and in appearance (by energy term for the change of base types and coefficients). This model can be learned by a minimax entropy method as in [13]. We draw a set of random curves (ropes) from the model p(C) in Figure 5. Figure 5.a shows the geometric curves for the ropes, and the knots of these chains are shown in Figure 5.b. Then Figure 5.c is the curve layer Ic which is a linear sum of all bases in the rope, as equation (3) defines. To summarize, the curve model for p(WC ) is, c
c
c
p(W ) = p(K )
K i
p(Ci ),
Ic = Ic (W c ).
400
Z. Tu and S.-C. Zhu
a).
b).
c).
Fig. 5. Random ropes sampled from the prior model p(C). a). The geometric curves for the sampled ropes. b). A symbolic representation of the sampled ropes. c). Curve layer I c by the ropes.
5
The Solution Space Ω
Given the image models in the previous sections, we now briefly analyze the solution space Ω for W = (W r , W c ). As K r , K c are unknown numbers, the solution space Ω is r c Ω = [∪K r ΩK r ] × [∪K c ΩK c ], r r c with exactly K r regions and ΩK where ΩK r is the space for W c is the space c with exactly K curves. In image segmentation, there are two ways of defining a region. One represents a region boundary Γ (s) = ∂R as a continuous contour parameterized by s and treats the image domain as a 2D plane. The other considers a segmentation as a label map ψΛ , and a region Ri is a set of pixels sharing the same label, say n, Ri = {(x, y) : ψ(x, y) = n, (x, y) ∈ Λ}, for i = 1, 2, ..., K.
In this paper, we adopt the label map notation and define the solution space r r r ΩK spaces for the image models r as a product of a K -partition space and K as r ΩK r = [ πK r × Θ × · · · × Θ ], Kr
where Θ = ∪4i=1 gi . The set of all K r -partitions, πK r , is a quotient space of the set of all possible K r -labelings divided by a permutation group, PG, for the labels. It can be denoted as πK r = {(R1 , R2 , ..., RK r ) = πK r ; where πK r = (R1 , R2 , ..., RK r ), is a set of K r regions of Λ.
|Ri | > 0, ∀i = 1, 2, ..., K r .}/PG, r
∪K Rj , ∀i = j i=1 Ri = Λ, Ri =
Parsing Images into Region and Curve Processes
401
c 2. Solution space for ΩK c Recall the definition of the curve layer in section 4.2. The curve layer is composed of an unknown number of curves; Each curve consists of an unknown number of knots which are made up of 1-3 bases. In correspondence to this definition, the solution space for the curve layer is defined as c ΩK c = C × . . . × C . Kc
The space for each curve C is of the form |Λ|
C = ∪n=0 [ζ × . . . × ζ ], n
where ζ is the space for each knot and is defined as ζ = B × B × B . In the curve layer, the space for the bases is defined as B = {φ} ∪ {(bi , Bi ), i = 1, 2, ..., Gb × NB × |Λ|}, where Gb is the number of possible values for coefficient bi , NB is the number of base functions in the over-complete basis and equals to 134 in this paper, and |Λ| is the size of the lattice.
6
Integration of Region and Curve Processes by MCMC
Now we turn to the design of the algorithm which forms ergodic Markov chain with reversible jumps[4,5] and diffusion to explore the solution space Ω. 6.1
Dynamics Design
We use the Metroplis-Hasting algorithm in realizing the jump processes. To achieve the detailed balance equation the acceptance rate is computed as α(W → dW ) = min(1,
G(W → dW )p(W |I)dW ), G(W → dW )p(W |I)dW
(6)
where G(W → dW ) and G(W → dW ) are the two proposal probabilities for the Markov chain to jump between solution W and W . The data-driven techniques are used to propose important proposals to guide the Markov chain in traveling in the solution space Ω more efficiently. The diffusion dynamics are realized by steepest ascent algorithms. In the following, we briefly discuss different types of these dynamics for region and curve processes.
402
Z. Tu and S.-C. Zhu
Jump and Diffusion Dynamics for Region Processes I. Boundary diffusion This is a diffusion process used to adjust the region boundaries., which are represented by continuous curves evolving to maximize the posterior probability through a region competition equation [12]. II. Model adaptation. This is simply to fit the parameters of a region by steepest descent equation, ∂ log p(IrRi ; Θi ) dΘi . = dt ∂Θi III and IV. Split a region into two and Merge two regions into one. These are a pair of reversible jumps. Suppose at a certain time step, a region Rk with model Θk is split into two regions Ri and Rj with models Θi , Θj , or vice verse, and this realizes a jump between two states W to W . W = (K, (Rk , k , Θk ), W− ) ←→ (K + 1, (Ri , i , Θi ), (Rj , j , Θj ), W− ) = W , where W− is the remaining variables that are unchanged during the move. V. Switch image models. This switches the image model within the four families for a regionRi . For example, from texture description to a spline surface etc. W = ( i , Θi , W− ) ←→ ( i , Θi , W− ) = W . Jump and Diffusion Dynamics for Curve Processes VI Curve diffusion. This is a diffusion process by curve deformation [10]. VII and VIII. Create a new curve and Delete a curve. They are a pair of reversible jumps. At a time step, a new curve could be created with 1-3 knots or a curve with less than three knots could be killed. The two solution states before and after the change are denoted as W = (KC , W− ) ←→ (KC + 1, CKC +1 , W− ) = W .
(7)
IX and X. Split a curve into two and Merge two curves into one. They are a pair of reversible jumps as well. At a step, two curves could be merged into a new curve or a curve having at least two knots could be split into two curves. We have the two states defined as W = (KC , Ck , W− ) ←→ (KC + 1, Ci , Cj W− ) = W .
(8)
There have been a great deal of work done in the area of perceptual organization about how to group 1D elements, such as straight line segments, which potentially belong to the same object together. These grouping methods are considered as bottom-up techniques and could be utilized in designing proposals to help to achieve a fast convergence rate. XI and XII. Engage a new knot to a curve and Kill a knot from a curve. They are two complementary moves. At a step, a new knot is proposed to engage to one of a curve’s two extremes or one of a proposed curve’s two extreme knots could be killed. The two states can be written as W = ((ζi1 , ..., ζin ), W− ) ←→ ((ζi1 , ..., ζin , ζi(n+1) ), W− ) = W .
(9)
Parsing Images into Region and Curve Processes
403
In the following, we give an example of how to compute the proposal probabilities G(W → dW ) and G(W → dW ) for computing the acceptance rate α(W → dW ) in equation (6). The readers are referred to [11] for a detailed discussion of computing proposal probabilities for region processes. The two proposal probabilities for move type IX (split a curve into two curves) can be computed as G(W → dW ) = q(IX)q(Ck )q(Ci |Ck )q(Cj |Ck )dW , and
G(W → dW ) = q(X)(q(Ci )q(Cj |Ci ) + q(Cj )q(Ci |Cj ))dW,
where q(IX) is the probability for choosing dynamic type IX, q(Ck ) is the probability for proposing curve Ck to split. The two new curves, Ci and Cj , are created with the probabilities q(Ci |Ck ) and q(Cj |Ck ) respectively. Similarly, q(X) is the probability for choosing dynamic type X, q(Ci ) is the probability for choosing curve Ci , and q(Cj |Ci ) is the probability for proposing curve Cj to merge with Ci . Likewise, we could compute the values for q(Cj ) and q(Ci |Cj ). 6.2
Initialization by Image Pyramid and Matching Pursuit
To initialize the algorithm, we decompose an image into two layers I = Iro +Ico by a simply Laplacian pyramid method, as Figure 6 shows. Iro is a lowpass filtered version of the input image I, and Ico = I − Iro is the residual image. In general, Iro contains regions, and Ico contains mostly high frequency components for the curves as well as region boundaries. Then we adopt a match pursuit method by Mallat and Zhang[7] to quickly compute an initial set of image bases from image Ico . In particular, we select the ridge type of bases with coefficients larger than a certain significant threshold τ . Ico = αj Bj + Io . αj ≥τ
These selected bases are often good seeds for the curves. The reconstruction residual image Io is added to the region layer Iro ← Iro + Io . For the region layer Iro , we conduct some edge detection and clustering as the DDMCMC algorithm did in [11].
a) I
b). Iro
c). Ico
Fig. 6. An input image I (a) is initially decomposed into two components: Iro (b) is a lowpass filtered version, and Ico is the residue for high frequency components.
404
6.3
Z. Tu and S.-C. Zhu
Summary of the Algorithm
To summarize, we list the algorithm below. Initialize Ic = Ico and Ir = Iro by the Laplacian pyramid method. Initial W c by match pursuit, and W r by clustering. Select a diffusion dynamic or a jump dynamic at random. If a diffusion dynamic is selected, run the following dynamics at random: • For type I, a region is randomly proposed to execute region competition for its boundaries. • For type II, a region is randomly selected and its corresponding model parameter is adapted by a steepest descent method. • For type VI, a curve is randomly selected and a deformation process is executed as in [10]. 5. If a jump dynamic is selected, • a new solution W is randomly sampled according to the dynamic picked at random below. − For type III, a region is randomly picked to be split into two new regions which are randomly proposed by one of the partition maps. − For type IV, two neighboring regions are randomly chosen to form a new region. − For type V, a different model is randomly picked for a randomly chosen region. − For type VII, 1-3 knots are randomly proposed to form a new curve. − For type VIII, a curve is randomly chosen and is then killed. − For type IX, a curve is randomly picked and is then split into two new curves. − For type X, two neighboring curves are randomly proposed to form a new curve. − For type XI, a new knot is proposed to engage to one of a randomly picked curve’s two extreme knots. − For type XII, one of a randomly picked curve’s two extreme knots is proposed to be killed. • The overall posterior probability, p(W |I)dW , is computed. • The proposal probabilities, G(W → dW ) and G(W → dW ) are then computed. • A decision of acceptance is made according the acceptance probability α(W → dW ) in equation 6. 6. Repeat the above steps to draw samples W from p(W |I).
1. 2. 3. 4.
Fig. 7. The algorithm that integrates the region and curve processes.
7
Experiments
We test the algorithm on many natural images and report some results in Figures 8, 9 and 10. Figure 8 shows the results for a sea shore image where regions and curves are parsed. Figure 8.b shows the regions, and we sample a synthesized image from the likelihood Irsyn ∼ p(I|W r ) given the computed W r , this illustrates the computed region models. Clearly the shading effects in the water
Parsing Images into Region and Curve Processes
405
Fig. 8. Result: parsing a sea shore image into regions and curves.
and the sky are captured by the spline model in family 4. Figure 8.c is the sketch for the computed curves W c , and Figure 8.f is the image Ic = Ic (W c ). Thus we have an overall synthesis Isyn = Irsyn + Ic shown in Figure 8.d. By comparing Irsyn and Ic to the initial decomposition Iro , Ico in Figure 6, we clearly see that it is the more sophisticated region and curve models that refine the decomposition. Figure 9 shows the segmented regions W r and Irsyn by the previous DDMCMC algorithms which assumes region processes only.
a). Synthesis
⇐
b). Region processes
Fig. 9. The segmentation and synthesis with region processes only for fig. 8
Figure 10 displays more examples in a similar way. The improvement of synthesized images for all the examples demonstrate the the advantages of engaging curve models. The parameters for the two methods are set to be the same to have a fair comparison. Although Figure 10.a by the region processes only successfully segmented many tree trunks out, it disconnects the background. The synthesis by region and curve processes for Figure 10.b is more similar to the original image than that by the region processes, thus, the new algorithm acheives a better visual effect.
406
Z. Tu and S.-C. Zhu Input images and results by region and curve processes
Results by region processes only
a).
b).
c). Fig. 10. More image parsing results by the algorithms with the integration of region and curve processes and by the previous DDMCMC algorithm with region processes only.
The semantical representation W by region and curve models can largely reduce the coding length of image. The table below lists the number of bytes used by W for the synthesized images in comparison with the jpeg algorithm for the original images.
Parsing Images into Region and Curve Processes
407
Table 1. A comparion of coding lengths in bytes for the example images by the proposed algorithm and the jpeg algorithm. Region and Curve processes jpeg
8
The image in fig. 8 Image a in fig. 10 Image b in fig. 10 Image c in fig. 10 2,387 2,266 1,347 922 17,001 10,620 10,620 6,656
Future Work
In future study, we plan to engage generative models for faces and texton process to realize the image parsing as Figure 1 demonstrates and apply the algorithm for non-photo-realistic rendering to render stylish paintings. Acknowledgements. This work is supported partially by two NSF grants IIS 98-77-127 and IIS-00-92-664, and an ONR grant N000140-110-535.
References 1. J. August and S.W. Zucker. A generative model of curve images with a completelycharacterized non-gaussian joint distribution. In Proc. of Workshop on Stat. and Comp. Theories of Vis., July, 2001. 2. M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. Proc. ECCV, 1996. 3. T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham. Active shape models - their training and application. Computer Vision and Image Understanding, vol.61, no. 1, Jan. pages 38-59,1995. 4. U. Grenander and M. I. Miller. Representation of knowledge in complex systems. J. R. Stat. Soc., B, vol 56, 549-603, 1994 5. P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, vol. 82, 711-732, 1995. 6. J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image segmentation. IJCV, vo. 43, no.1, June 2001. 7. S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Tranc. on Signal Processing, vol. 41, no. 12, Dec. 1993. 8. D. Mumford. Algebraic geometry and its applications, chaper elastic and computer vision, pp. 491-506. Springer-Verlag, 1994. 9. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision Res., vol. 37, no. 23 3311-25 1997. 10. M. Kass, A. Witkin and D. Terzopoulos. Snakes: active contour models. IJCV, 1, 1988. 11. Z. Tu, S.C. Zhu., and H.Y. Shum. Image segmentation by data driven markov chain monte carlo. Proc. ICCV, 2001. 12. S. C. Zhu and A. L. Yuille. Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multiband Image Segmentation. IEEE Trans. PAMI vol.18, No.9, 1996. 13. S.C. Zhu. Embedding gestalt laws in markov random fields- a theory for shape modeling and perceptual organization. PAMI, vol. 21, no.11, Nov. 1999.
Yet Another Survey on Image Segmentation: Region and Boundary Information Integration J. Freixenet, X. Mu˜ noz, D. Raba, J. Mart´ı, and X. Cuf´ı University of Girona. Institute of Informatics and Applications. Campus de Montilivi s/n. 17071. Girona, Spain {jordif,xmunoz,draba,joanm,xcuf}@eia.udg.es
Abstract. Image segmentation has been, and still is, a relevant research area in Computer Vision, and hundreds of segmentation algorithms have been proposed in the last 30 years. However, it is well known that elemental segmentation techniques based on boundary or region information often fail to produce accurate segmentation results. Hence, in the last few years, there has been a tendency towards algorithms which take advantage of the complementary nature of such information. This paper reviews different segmentation proposals which integrate edge and region information and highlights 7 different strategies and methods to fuse such information. In contrast with other surveys which only describe and compare qualitatively different approaches, this survey deals with a real quantitative comparison. In this sense, key methods have been programmed and their accuracy analyzed and compared using synthetic and real images. A discussion justified with experimental results is given and the code is available on Internet. Keywords: grouping and segmentation, region based segmentation, boundary based segmentation, cooperative segmentation methods.
1
Introduction
One of the most important operations in Computer Vision is segmentation [1]. The aim of image segmentation is the domain-independent partition of the image into a set of regions which are visually distinct and uniform with respect to some property, such as grey level, texture or colour. The problem of segmentation has been, and still is, an important research field and many segmentation methods have been proposed in the literature (see surveys [2,3,4]). Many segmentation methods are based on two basic properties of the pixels in relation to their local neighbourhood: discontinuity and similarity. Methods based on some discontinuity property of the pixels are called boundary-based methods, whereas methods based on some similarity property are called region-based methods. Unfortunately, both techniques, boundary-based and region-based, often fail to produce accurate segmentation results [5]. With the aim of improving the segmentation
This work was partially supported by the Departament d’Universitats, Recerca i Societat de la Informaci´ o de la Generalitat de Catalunya.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 408–422, 2002. c Springer-Verlag Berlin Heidelberg 2002
Yet Another Survey on Image Segmentation
409
process, a large number of new algorithms which integrate region and boundary information have been proposed over the last few years. Among other features, one of the main characteristics of these approaches is the time of fusion: embedded in the region detection or after both processes [6]. – Embedded integration can be described as integration through the definition of new parameters or a new decision criterion for segmentation. In the most habitual strategy, firstly, the edge information is extracted and is then used within the segmentation algorithm which is mainly based on regions. For example, edge information can be used to define the seed points from which regions are grown. The aim of this integration strategy is to use boundary information as the means of avoiding many of the common problems of region-based techniques. – Post-processing integration is performed after processing the image using the two different approaches (boundary-based and region-based techniques). Edge and region information are extracted independently in a preliminary step. A posterior fusion process tries to exploit the dual information in order to modify, or refine, the initial segmentation obtained by a single technique. The aim of this strategy is the improvement of the initial results and the production of a more accurate segmentation. Although many surveys on image segmentation have been published, as stated above, none of them focus on the integration of region and boundary information. To overcome this lack, this paper discusses the most relevant segmentation techniques developed in recent years which integrate region and boundary information. Therefore, neither clustering methods nor texture segmentation have been included in this survey, stated they constitute a separated item. After analyzing more than 50 region-boundary cooperative algorithms, we have clearly identified 7 different strategies. First we distinguish between embedded and post-processing methods. Within the embedded methods we differentiate between those using boundary information for seed placement purposes, and those which use this information to establish an appropriate decision criterion. Within the post-processing methods, we differentiate three different approaches: over-segmentation, boundary refinement, and selection evaluation. After stating a classification, we discuss in depth each one of these approaches, emphasizing in some cases relevant aspects related to the implementation of the methods: for example, in the boundary refinement strategy, the use of snakes or multiresolution techniques. Finally, in order to compare the performance of the analyzed methods, we have implemented such algorithms, so a quantitative and qualitative evaluation of each strategy is given. Therefore, objective conclusions are provided. The remainder of this paper is structured as follows: Section 2 defines and classifies the different approaches to the embedded integration, while Section 3 analyses the proposals for the post-processing strategy. Section 4 discusses the methods analyzing the experimental results obtained by using synthetic and real data. Finally, the conclusions drawn from this study are summarized in Section 5.
410
2
J. Freixenet et al.
Embedded Integration
The embedded integration strategy usually consists of using the edge information, previously extracted, within a region segmentation algorithm. It is well known that in most region-based segmentation algorithms, the manner in which initial regions are formed and the criteria for growing them are set a priori. Hence, the resulting segmentation will inevitably depend on the particular growth chosen [7], as well as the choice of the initial region growth points [8]. Some recent proposals try to use boundary information in order to avoid these problems. According to the manner in which this information is used, it is possible to distinguish two tendencies: 1. Control of decision criterion: edge information is included in the definition of the decision criterion which controls the growth of the region. 2. Guidance of seed placement: edge information is used as a guide in order to decide which is the most suitable position to place the seed (or seeds) of the region-growing process. 2.1
Control of Decision Criterion
The Region Growing and the Split and Merge are the typical region based segmentation algorithms. Although both share the essential concept of homogeneity, the way they carry out the segmentation process is really different in the decisions taken. For this reason, and in order to facilitate the analysis of this approach, we have developed two different algorithms named A1 and A2 based on Split and Merge, and Region growing respectively. A1: Split and Merge. Typical split and merge techniques [9] consist of two basic steps. First, the whole image is considered as one region. If this region does not satisfy a homogeneity criterion the region is split into four quadrants (subregions) and each quadrant is tested in the same way; this process is recursively repeated until every square region created in this way contains homogeneous pixels. Next, in the second step, all adjacent regions with similar attributes may be merged following other (or the same) criteria. The criterion of homogeneity is generally based on the analysis of the chromatic characteristics of the region. A region with small standard deviation in the color of its members (pixels) is considered homogeneous. The integration of edge information allows adding to this criterion another term to take into account. So, a region is considered homogeneous when is totally free of contours. A1 is an algorithm based on the ideas of Bonnin and his colleagues who proposed in [10] a region extraction based on a split and merge algorithm controlled by edge detection. The criterion to decide the split of a region takes into account edge and intensity characteristics. More specifically, if there is no edge point on the patch and if the intensity homogeneity constraints are satisfied, then the region is stored; otherwise, the patch is divided into four sub-patches, and the
Yet Another Survey on Image Segmentation
411
process is recursively repeated. The homogeneity intensity criterion is rendered necessary due to the possible failures of the edge detector. After the split phase, the contours are thinned and chained into edges relative to the boundaries of the initial regions. Later, a final merging process takes into account edge information in order to solve possible over-segmentation problems. In this last step, two adjacent initial regions are merged only if there are no edges on the common boundary. A2: Region Growing. Region growing algorithms are based on the growth of a region whenever its interior is homogeneous according to certain features as intensity, color or texture. The implemented algorithm follows the strategy of a typical Region Growing: it is based on the growth of a region by adding similar neighbours. Region Growing [11] is one of the simplest and most popular algorithms for region based segmentation. The most traditional implementation starts by choosing a starting point called seed pixel. Then, the region grows by adding similar neighbouring pixels according to a certain homogeneity criterion, increasing step by step the size of the region. So, the homogeneity criterion has the function of deciding whether a pixel belongs to the growing region or not. The decision of merging is generally taken based only on the contrast between the evaluated pixel and the region. However, it is not easy to decide when this difference is small (or large) enough to take a decision. The edge map provides an additional criterion on that, such as the condition of contour pixel when deciding to aggregate it. The encounter of a contour signifies that the process of growing has reached the boundary of the region, so the pixel must not be aggregated and the growth of the region has finished. The algorithm implemented A2, is based on the work of Xiaohan et al. [12], who proposed a homogeneity criterion consisting of the weighted sum of the contrast between the region and the pixel, and the value of the modulus of the gradient of the pixel. A low value of this function indicates the convenience of aggregating the pixel to the region. A similar proposal was suggested by Kara et al. [6], where at each iteration, only pixels having low gradient values (below a certain threshold) are aggregated to the growing region. On the other hand, Gambotto [13] proposed using edge information to stop the growing process. His proposal assumes that the gradient takes a high value over a large part of the region boundary. Thus, the iterative growing process is continued until the maximum of the average gradient computed over the region boundary is detected. 2.2
Guidance of Seed Placement
The placement of the initial seed points can be stated as a central issue on the obtained results of a region-based segmentation. Despite their importance, the traditional region growing algorithm chooses them randomly or using a set a priori direction of image scan. In order to make a more reasonable decision, edge information can be used to decide what is the most correct position in which
412
J. Freixenet et al.
to place the seed. It is generally accepted that the growth of a region has to start from inside it (see [14,15]). The interior of the region is a representative zone and allows the obtention of a correct sample of the region’s characteristics. On the other hand, it is necessary to avoid the boundaries between regions when choosing the seeds because they are unstable zones and not adequate to obtain information over the region. Therefore, this approach, named A3, uses the edge information to place the seeds in the interior of the regions. The seeds are launched in placements free of contours and, in some proposals, as far as possible from them. The algorithm proposed by Sinclair [15] has been taken as the basic reference for the implementation of A3. In this proposal, the Voronoi image generated from the edge image is used to derive the placement of the seeds. The intensity at each point in a Voronoi image is the distance to the closest edge. The peaks in the Voronoi image, reflecting the farthest points from the contours, are then used as seed points for region growing. Nevertheless, A3 avoids the necessity of extracting the edge image, which involves the difficult step of binarization, generating the Voronoi image directly from the gradient image. On the other hand, edge information can also be used to establish a specific order for the processes of growing. As is well known, one of the disadvantages of the region growing and merging processes is their inherently sequential nature. Hence, the final segmentation results depend on the order in which regions are grown or merged. The edge based segmentation allows for deciding this order, in some cases simulating the order by which humans separate segments from each other in an image (from large to small) [16], or in other proposals giving the same opportunities of growing to all the regions [17].
3
Post-processing Integration
In contrast to the works analysed up to this point, which follow an embedded strategy, the post-processing strategy carries out the integration a posteriori to the segmentation of the image by region-based and boundary-based algorithms. Region and edge information is extracted separately in a preliminary step, and then integrated. Post-processing integration is based on fusing results from single segmentation methods attempting to combine the map of regions (generally with thick and inaccurate boundaries) and the map of edge outputs (generally with fine and sharp lines, but dislocated) with the aim of providing an accurate and meaningful segmentation. We have identified three different approaches for performing these tasks: 1. Over-segmentation: this approach consists of using a segmentation method with parameters specifically fixed to obtain an over-segmented result. Then additional information from other segmentation techniques is used to eliminate false boundaries which do not correspond with regions. 2. Boundary Refinement: this approach considers the region segmentation result as a first approach, with well defined regions, but inaccurate bound-
Yet Another Survey on Image Segmentation
413
aries. Information from edge detection is used to refine region boundaries and to obtain a more precise result. 3. Selection-Evaluation: in this approach, edge information is used to evaluate the quality of different region-based segmentation results, with the aim of choosing the best. This third set of techniques deal with the difficulty of establishing adequate stopping criteria and thresholds in region segmentation. 3.1
Over-Segmentation
This approach emerged due to the difficulty of establishing an adequate homogeneity criterion for region growing. As Pavlidis and Liow suggested [5], the major reason which explains why region growing produces so much false boundaries is that the definition of region uniformity is too strict, as when they insist on approximately constant brightness while in reality brightness may vary linearly within a region. It is very difficult to find uniformity criteria which exactly match these requirements and not generate false boundaries. Summarizing, they argued that the results can be significantly improved if all region boundaries qualified as edges are checked rather than attempting to fine-tune the uniformity criteria. The methodology of this approach starts with the obtention of an oversegmented result segmentation, which is achieved by properly setting the parameters of the algorithm. This result is then compared with the result from the dual approach: each boundary is checked to see if it is coherent in both results. When this correspondence does not exist the boundary is considered false and is removed. At the end, only real boundaries are preserved. The implemented algorithm A4 follows the most habitual technique, which consists of obtaining the over-segmented result using a region-based algorithm. Every initial boundary is checked by analysing its coherence with the edge map, where real boundaries must have high gradient values, while low values correspond to false contours. Concretely, A4 is based on the work of Gagalowicz and Monga [18], where two adjacent regions are merged if the average gradient on their boundary is lower than a fixed threshold. A similar work was presented by Pavlidis and Liow [5], which includes a criterion in the merging decision in order to eliminate the false boundaries that have resulted from the data structure used. On the other hand, it is also possible to carry out this approach starting with the over-segmented result obtained from a boundary based approach [19,20]. Then, region information allows differentiation between true and false contours. The boundaries are checked analyzing the chromatic and textural characteristic at both sides of the contour. A real boundary limits with two regions, so it has different characteristics on both sides. An exemplar work is that proposed by Philipp and Zamperoni [19], who proposed starting with a high-resolution edge extractor, and then, according to the texture characteristics of the extracted regions, deciding whether to suppress or prolong a boundary.
414
3.2
J. Freixenet et al.
Boundary Refinement
As described above, region-based segmentation yields a good detection of true regions, although as is well known that the resultant sensitivity to noise causes the boundary of the extracted region to be highly irregular. This approach, which we have called boundary refinement, considers region-based segmentation as a first approximation to segmentation. Typically, a region-growing procedure is used to obtain an initial estimate of a target region, which is then combined with salient edge information to achieve a more accurate representation of the target boundary. As in the over-segmentation proposals, edge information permits here, the refinement of an initial result. Examples of this strategy are the works of Haddon and Boyce [21], Chu and Aggarwal [22] or the most recent of Sato et al. [23]. Nevertheless, two basic techniques can be considered as common ways to refine the boundary of the regions: 1. Multiresolution: this technique is based on the analysis at different scales. A coarse initial segmentation is refined by increasing the resolution. 2. Boundary Refinement by Snakes: another possibility is the integration of region information with dynamic contours, concretely snakes. The refinement of the region boundary is performed by the energy minimization of the snake. A5: Multiresolution. The multiresolution approach is an interesting strategy to carry out the refinement. The analysis operates on the image at different scales, using a pyramid or quadtree structure. The algorithm consists of an upward and a downward path; the former has the effect of smoothing or increasing the resolution in class space, at the expense of a reduction in spatial resolution, while the latter attempts to regain the lost spatial resolution, preserving the newly won class resolution. The assumption underlying this procedure is invariance across scales: those nodes in an estimate considered as interior to a region are given as the same class as their “fathers” at lower resolution. The A5 algorithm is based on the work of Spann and Wilson [24], where the strategy uses a quadtree method using classification at the top level of the tree, followed by boundary refinement. A non-parametric clustering algorithm is used to perform classification at the top level, yielding an initial boundary, followed by downward boundary estimation to refine the result. A recent work following the same strategy can be found in [25]. A6: Boundary Refinement by Snakes. The snake method is known to solve such problems by locating the object boundary from an initial plan. However, snakes do not try to solve the entire problem of finding salient image contours. The high grey-level gradient of the image may be due to object boundaries as well as noise and object textures, and therefore the optimization functions may have many local optima. Consequently, in general, active contours are sensitive to initial conditions and are only really effective when the initial position of
Yet Another Survey on Image Segmentation
415
the contour in the image is sufficiently close to the real boundary. For this reason, active contours rely on other mechanisms to place them somewhere near the desired contour. In first approximations to dynamic contours, an expert is responsible for putting the snake close to an intended contour; its energy minimization carries it the rest of the way. However, region segmentation could be the solution of the initialization problem of snakes. Proposals concerning integrated methods consist of using the region segmentation result as the initial contour of the snake. Here, in the design of A6, the segmentation process is typically divided into two steps: First, a region growing with a seed point in the target region is performed, and its corresponding output is used for the initial contour of the dynamic contour model; Second, the initial contour is modified on the basis of energy minimization. The A6 algorithm is implemented following the ideas of Chan et al. [26], where the greedy algorithm is used to find the minimum energy contour. This algorithm searches for the position of the minimum energy by adjusting each point on the contour during iteration to a lower energy position amongst its eight local neighbours. The result, although not always optimal, is comparable to that obtained by variational calculus methods and dynamic programming. The advantage is that their method is faster. Similar proposals are the works of V´erard et al. [27] and Jang et al. [28]. Curiously, all these algorithms are tested on Magnetic Resonance Imaging (MRI) images, but this is not a mere coincidence. Accurate segmentation is critical for diagnosis in medical images. However, it is very difficult to extract the contour which exactly matches the target region in MRI images. Integrated methods seem to be a valid solution to achieve an accurate and consistent detection. 3.3
Selection-Evaluation
In the absence of object or scene models or ground truth data, it is critical to have a criterion which enables evaluation of the quality of a segmentation. In this sense, a set of proposals have used edge information to define an evaluation function which qualifies the quality of a region-based segmentation. The purpose is to achieve different results by changing parameters and thresholds on a region segmentation algorithm, and then to use the evaluation function to choose the best result. This strategy permits solution of the traditional problems of region segmentation, such as the definition of an adequate stopping criterion or the setting of appropriate thresholds. The evaluation function measures the quality of a region based segmentation according to the coherence with the edge map. The best region segmentation is that in which the boundaries of the regions correspond in major measure to the contours. The A7 algorithm is based on the work of Siebert [29] where edge information is used to adjust the criterion function of a region-growing segmentation. For each seed, A7 creates a whole family of segmentation results (with different criterion functions) and then, based on the local quality of the region’s contour, picks the best one. The contrast between both sides of the boundary is proposed as a measure of contour strength to evaluate the segmentation quality. More formally,
416
J. Freixenet et al.
the contour strength is expressed as the sum of the absolute differences between each pixel on the contour of a region and the pixels in the 4-neighbourhood of these contour points which are not part of the region. However, Siebert suggests that slightly improved results at higher computational costs can be expected if the contour strength is based on the gradient at each contour pixel rather than on the intensity difference. Hence, this second option has been the solution adopted in our implementation. Similar algorithms are proposed by Fua and Hanson [30] (a pioneer proposal), LeMoigne and Tilton [31], or Hojjatoleslami and Kittler [32].
4
Experimental Results
The methods surveyed (A1-A7) have been programmed and their accuracy analyzed over synthetic and real test images such as like the ones shown in figure 1. The wide range of numerous and indeterminate characteristics of real images makes it very complicated to achieve an accurate comparison of the experimental results. As a rule, the segmentation results can only be judged either by using manually segmented images as reference, which implies a tedious and subjective task [33], or by visual comparison to the original images [4], or just applying quality measures corresponding to human intuition [34]. Hence, the use of carefully designed synthetic images appears to be a more suitable benchmark for an objective and quantitative evaluation of different segmentation algorithms [35]. Despite that suitability, the use of real images is also highly advisable as they provide useful results when realistic characteristics arise. Therefore, the algorithms evaluation has been performed jointly using real and synthetic images. The results obtained from the real ones have been evaluated by comparing them with manual segmentation, due to the subjective nature of the segmentation of real images. The set of synthetic images generated to test the algorithms follows the method proposed by Zhang [35], where the form of the objects of the images changes from a circle to an elongated ellipse. To make synthetic images more realistic, a 5x5 average low-pass filter is applied to produce a smooth transition between objects and background. Then, a zero-mean Gaussian white noise is added to simulate noise effect. The noise samples have been generated with different variance parameters. On the other hand, selected real images are well-known standard test images extracted from the USC-SIPI image database (University of Southern California-Signal and Image Processing Institute). All test images are size 256 × 256 pixels. 4.1
The Evaluation Method
The evaluation of image segmentation is performed with several quantitative measures proposed by Huang and Dom [36]. Concretely, boundary-based and region-based performance evaluation schemes are proposed. The boundary-based approach evaluates segmentation in terms of both localization and shape accuracy of extracted regions, while the region-based approach assesses the segmentation quality in terms of both size and location of the segmented regions.
Yet Another Survey on Image Segmentation
417
Fig. 1. A subset of the real and synthetic test images used in the trials.
Boundary-Based Evaluation. The boundary-based scheme is intended to evaluate segmentation quality in terms of the precision of the extracted region boundaries. Let B represent the boundary point set derived from the segmentation and GB the boundary ground truth. Two distance distribution signatures B are used, one from ground truth to the estimated, denoted by DG , and the other G from the estimated to ground truth, denoted by DB . A distance distribution signature from a set B1 to a set B2 of boundary B2 points, denoted by DB , is a discrete function whose distribution characterizes 1 the discrepancy, measured in distance, from B1 and B2 . Define the distance from an arbitrary point x in set B1 to B2 as the minimum absolute distance from x to all the points in B2 , d(x, B2 ) = min{dE (x, y)}, ∀yB2 , where dE denotes the Euclidean distance between points x and y. The discrepancy between B1 and B2 is described by the shape of the signature, which is commonly measured by B2 its mean and standard deviation. As a rule, a DB with a near-zero mean and a 1 small standard deviation indicates high quality of the image segmentation.
Region-Based Evaluation. The region-based scheme evaluates the segmentation accuracy in the number of regions, the locations and the sizes. Let the segmentation be S and the corresponding ground truth be GS . The goal is to quantitatively describe the degree of mismatch between them. Measures are based on the concept of directional Hamming distance from one segmentation S1 = {R11 , R12 , . . . R1n } to another segmentation S2 = {R21 , R22 , . . . R2n }, denoted by DH (S1 =⇒ S2 ). First, the correspondence between the labels of both segmentation results is established: each region R2i from S2 is associated with a region R1j from S1 such that R2i ∩ R1j is maximal. So, the directional Hamming distance from S1 to S2 is defined as: DH (S1 =⇒ S2 ) =
R2i S2
Rik = Rij ,Rik ∩R2i = ∅
|R2i ∩ R1k |
(1)
418
J. Freixenet et al.
Original image
A1
A2
A3
A4
A5
A6
A7
Fig. 2. Segmentation results over the peppers image. From top to bottom and left to right: original image and results when using the implemented algorithms (A1-A7).
where |.| denote the size of a set. Therefore, DH (S1 =⇒ S2 ) is the total area under the intersections between all R2i S2 and their non-maximal intersected regions from S1 . A region-based performance measure based on normalized Hamming distance S )+DH (GS =⇒S) is defined as p = 1 − DH (S=⇒G 2×|S| , where |S| is the image size and p[0, 1]. The smaller the degree of mismatch, the closer the p is to one. Moreover, f two types of errors are defined: missing rate em R and false alarm rate ER . The S former indicates the percentage of the points in G being mistakenly segmented into the regions in S which are non-maximal with respect to the corresponding region GS ; while the latter describes the percentage of points in S falling into the regions of GS which are non-maximal intersected with the region under consideration. We therefore have em R = 4.2
DH (S =⇒ GS ) DH (GS =⇒ S) , and efR = |S| |S|
(2)
The Results
The implemented algorithms A1 to A7 have been applied to a set of 22 test images, including real and synthetic ones. Due to the limited space, we only show the detailed results of two images, while a summary of the results is provided for the remaining. A more complete report including the code, description and full details of the behaviour of each algorithm over the whole set of test images, can be accessed on http://eia.udg.es/∼xmunoz/seg.html. The results obtained with the 7 algorithms over the peppers image are shown in figure 2. On the other hand, tables 1 and 2 show a set of quantitative results expressed in terms of the
Yet Another Survey on Image Segmentation
419
Table 1. Segmentation results over synthetic test images. Performance of the 7 algorithms over the image 1(2nd row, 1st column), and the average of the results of the algorithms over 12 synthetic images. Algorithm
Region-based em efR p R Circle Image Evaluation A1 0,0519 0,0333 0,9574 A2 0,1273 0,0286 0,9221 A3 0,0329 0,0048 0,9812 A4 0,0106 0,0079 0,9907 A5 0,0175 0,0338 0,9744 A6 0,0812 0,0502 0,9343 A7 0,0267 0,0217 0,9758 Summary of Synthetic Images Evaluation A1 0,0453 0,0913 0,9317 A2 0,2027 0,0111 0,8931 A3 0,0590 0,0259 0,9576 A4 0,0437 0,0203 0,9680 A5 0,0343 0,0668 0,9495 A6 0,0610 0,0602 0,9394 A7 0,0587 0,0431 0,9491
B µDG
Boundary-based B G G σDG µDB σDB
Time
1,6030 1,2717 1,3423 0,8789 0,4125 0,2529 0,6318
1,9592 1,7487 2,0876 1,0762 0,7562 0,5737 1,4629
1,1899 2,0803 1,7654 0,9443 0,5601 0,3314 0,6338
2,2242 1,4658 1,3234 1,3240 0,7977 0,5992 1,4882
10,2500 0,3000 0,5200 0,2500 0,0800 7,7000 29,0500
1,1245 0,9547 1,0030 0,9601 0,4049 0,3345 0,9587
0,7015 0,6582 0,7083 1,0256 0,5067 0,5083 0,7632
1,0296 0,9654 0,9935 1,0054 0,7655 0,5459 1,1470
1,0449 0,8778 0,8435 1,1245 0,7713 0,6069 0,6103
41,2300 0,1800 0,3400 0,1100 0,0600 12,0200 25,0400
Table 2. Segmentation results over real test images. Performance of the 7 algorithms over the image 1(1st row, 1st column), and the average of the results of the algorithms over 10 real images. Algorithm
Region-based em efR p R Peppers Image Evaluation A1 0,0232 0,0358 0,9705 A2 0,1803 0,0138 0,9029 A3 0,0266 0,0262 0,9736 A4 0,0276 0,0262 0,9731 A5 0,0175 0,0349 0,9738 A6 0,1154 0,0419 0,9213 A7 0,0884 0,0306 0,9405 Summary of Real Images Evaluation A1 0,2080 0,0536 0,8692 A2 0,1740 0,0446 0,8907 A3 0,0665 0,0338 0,9499 A4 0,0918 0,0125 0,9478 A5 0,0749 0,0885 0,9183 A6 0,1579 0,0448 0,8987 A7 0,1677 0,0087 0,9118
B µDG
Boundary-based B G G σDG µDB σDB
Time
1,1323 0,9745 0,9765 0,9357 0,8335 0,6515 0,8065
0,7800 0,8768 0,7721 1,0224 0,7605 0,7721 0,7862
1,1901 1,0399 1,8322 1,0099 0,8782 0,8322 1,0200
1,2118 1,0707 0,9057 1,0755 0,9114 0,9057 0,9057
34,2000 0,2100 0,2800 0,2100 0,0700 10,0300 24,0300
3,8169 1,8285 2,1218 1,7327 0,6989 0,4354 0,8806
7,0020 1,2978 3,0183 3,1896 0,9687 0,5641 0,9918
4,9236 1,5440 2,1606 2,1238 0,4120 0,3457 1,4320
7,9250 3,1251 3,3891 4,2362 0,9826 0,5687 2,7155
9,0300 0,3600 0,4300 0,1900 0,0760 7,4500 31,6000
region and boundary evaluation parameters (described in section 4.1), as well as the execution time outlining the complexity of each algorithm. Taking into account the quality of the results from a region-based scheme of evaluation, it will be noticed that the best results are reached by the A3 and A4 algorithms. Moreover, the importance of an appropriated placement of the starting seed points has been proved, which is generally forgotten or placed on a secondary priority in many region-based algorithms. On the other hand, the validity of the over-segmentation strategy has been proved by the good rates provided by the A4 algorithm. As a general rule, it will be noticed that the missing f rate em R is bigger than the false alarm rate eR . This is mainly due to the presence
420
J. Freixenet et al.
of noise in images, which causes the appearance of holes inside the regions of the segmentation result. It can be easily avoided by either pre-processing the image with a smoothing filter, or post-processing by merging the smallest regions. The exception to this problem is the A5 algorithm (multiresolution strategy), where an initial coarse region segmentation is performed on lower resolution achieving the effect of smoothing. The analysis of the segmentation results from a boundary-based scheme of evaluation yields the assumption that the boundary refinement strategy (A5 and A6 algorithms) is best. In fact, these results corroborate the expected ones stated that the obtention of a precise boundary is the main target for these methods. In this sense, the accuracy obtained by the A6 algorithm, which is based on the energy minimization of a snake, is remarkable. The computational cost for each algorithm is another relevant feature to consider. After analyzing the experimental results, the high cost of A1 and A7 algorithms will be noticed. The reason of the high cost of A1 can be found in the recursive nature of its split and merge based algorithm. The “slowness” of A7 is due to the necessity of generating different region-based segmentation results in order to choose the best. So, finding a balance between the computational cost and the final accuracy of the results is mandatory. Nevertheless, both algorithms could be easily transported and executed over a parallel multiprocessor, which would considerably reduce the time of execution. In contrast, A6 does not have an excessively high cost. This can be easily explained when you consider that the placement of the snake from the region-based segmentation results allows initiation of the energy minimization very close to the final position. Hence, the suitable boundary is reached with few iterations.
5
Conclusions
The objective of this paper is a comparative survey of 7 of the most frequently used strategies to perform segmentation based on region and boundary information. The different methods have been programmed and their accuracy analyzed with real and synthetic images. Experimental results demonstrate that there is a strong similarity between the results obtained from synthetic and real images. The performance of the different algorithms over the set of synthetic images can be extrapolated to the results obtained over real ones, which seems to corroborate the remark made by Zhang [35], who noticed the convenience of using synthetic images in order to achieve an objective comparison of segmentation algorithms. Nevertheless, this statement can only be affirmed for the studied images and the studied algorithms. The experimental results point out that, in general, post-processing algorithms give better results than embedded ones. Concretely, based on the region evaluation parameters, the algorithms A4 and A5 (post-processing) and A3 (embedded) are the ones which produce better results, while based on the boundary evaluation parameters, the algorithms A5 and A6 are the best. In conclusion, the best results were obtained with the Multiresolution strategy (algorithm A5)
Yet Another Survey on Image Segmentation
421
which provides the best performance according to the simplicity of the algorithm and the accuracy of the results. Further work is to validate these results over a wide set of different images, such as medical and satellite images.
References 1. Haralick, R., Shapiro, L.: Computer and Robot Vision. Volume 1 & 2. AddisonWesley Inc, Reading, Massachussets (1992 & 1993) 2. Fu, K., Mui, J.: A survey on image segmentation. Pattern Recognition 13 (1981) 3–16 3. Haralick, R., Shapiro, L.: Image segmentation techniques. Computer Vision, Graphics and Image Processing 29 (1985) 100–132 4. Pal, N., Pal, S.: A review on image segmentation techniques. Pattern Recognition 26 (1993) 1277–1294 5. Pavlidis, T., Liow, Y.: Integrating region growing and edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 225–233 6. Falah, R., Bolon, P., Cocquerez, J.: A region-region and region-edge cooperative approach of image segmentation. In: International Conference on Image Processing. Volume 3., Austin, Texas (1994) 470–474 7. Kohler, R.: A segmentation system based on thresholding. Computer Vision, Graphics and Image Processing 15 (1981) 319–338 8. Kittler, J., Illingworth, J.: On threshold selection using clustering criterion. IEEE Transactions on Systems, Man, and Cybernetics 15 (1985) 652–655 9. Chen, P., Pavlidis, T.: Image segmentation as an estimation problem. Computer Graphics and Image Processing 12 (1980) 153–172 10. Bonnin, P., Blanc Talon, J., Hayot, J., Zavidovique, B.: A new edge point/region cooperative segmentation deduced from a 3d scene reconstruction application. In: SPIE Applications of Digital Image Processing XII. Volume 1153. (1989) 579–591 11. Zucker, S.: Region growing: Childhood and adolescence. Computer Graphics and Image Processing 5 (1976) 382–399 12. Xiaohan, Y., Yla-Jaaski, J., Huttunen, O., Vehkomaki, T., Sipild, O., Katila, T.: Image segmentation combining region growing and edge detection. In: International Conference on Pattern Recognition. Volume C., The Hague, Netherlands (1992) 481–484 13. Gambotto, J.: A new approach to combining region growing and edge detection. Pattern Recognition Letters 14 (1993) 869–875 14. Benois, J., Barba, D.: Image segmentation by region-contour cooperation for image coding. In: International Conference on Pattern Recognition. Volume C., The Hague, Netherlands (1992) 331–334 15. Sinclair, D.: Voronoi seeded colour image segmentation. Technical Report 3, AT&T Laboratories Cambridge (1999) 16. Moghaddamzadeh, A., Bourbakis, N.: A fuzzy region growing approach for segmentation of color images. Pattern Recognition 30 (1997) 867–881 17. Cuf´ı, X., Mu˜ noz, X., Freixenet, J., Mart´ı, J.: A concurrent region growing algorithm guided by circumscribed contours. In: International Conference on Pattern Recognition. Volume I., Barcelona, Spain (2000) 432–435 18. Gagalowicz, A., Monga, O.: A new approach for image segmentation. In: International Conference on Pattern Recognition, Paris, France (1986) 265–267
422
J. Freixenet et al.
19. Philipp, S., Zamperoni, P.: Segmentation and contour closing of textured and nontextured images using distances between textures. In: International Conference on Image Processing. Volume C., Lausanne, Switzerland (1996) 125–128 20. Fjørtoft, R., Cabada, J., Lop`es, A., Marthon, P., Cubero-Castan, E.: Complementary edge detection and region growing for sar image segmentation. In: Conference of the Norwegian Society for Image Processing and Pattern Recognition. Volume 1., Tromsø, Norway (1997) 70–72 21. Haddon, J., Boyce, J.: Image segmentation by unifying region and boundary information. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 929–948 22. Chu, C., Aggarwal, J.: The integration of image segmentation maps using region and edge information. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (1993) 1241–1252 23. Sato, M., Lakare, S., Wan, M., Kaufman, A., Nakajima, M.: A gradient magnitude based region growing algorithm for accurate segmentation. In: International Conference on Image Processing. Volume III., Vancouver, Canada (2000) 448–451 24. Wilson, R., Spann, M.: Finite prolate spheroidial sequences and their applications ii: Image feature description and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 10 (1988) 193–203 25. Hsu, T., Kuo, J., Wilson, R.: A multiresolution texture gradient method for unsupervised segmentation. Pattern Recognition 32 (2000) 1819–1833 26. Chan, F., Lam, F., Poon, P., Zhu, H., Chan, K.: Object boundary location by region and contour deformation. IEE Proceedings-Vision Image and Signal Processing 143 (1996) 353–360 27. V´erard, L., Fadili, J., Ruan, S., Bloyet, D.: 3d mri segmentation of brain structures. In: International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam, Netherlands (1996) 1081–1082 28. Jang, D., Lee, D., Kim, S.: Contour detection of hippocampus using dynamic contour model and region growing. In: International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, Ilinois (1997) 763–766 29. Siebert, A.: Dynamic region growing. In: Vision Interface, Kelowna, Canada (1997) 30. Fua, P., Hanson, A.: Using generic geometric models for intelligent shape extraction. In: National Conference on Artificial Intelligence, Seattle, Washington (1987) 706–711 31. Lemoigne, J., Tilton, J.: Refining image segmentation by integration of edge and region data. IEEE Transactions on Geoscience and Remote Sensing 33 (1995) 605–615 32. Hojjatoleslami, S., Kittler, J.: Region growing: A new approach. IEEE Transactions on Image Processing 7 (1998) 1079–1084 33. Vincken, K., Koster, A., Viergever, M.: Probabilistic multiscale image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 109–120 34. Sahoo, P., Soltani, S., Wong, A., Chen, Y.: A survey of thresholding techniques. Computer Vision, Graphics and Image Processing 41 (1988) 233–260 35. Zhang, Y.: Evaluation and comparison of different segmentation algorithms. Pattern Recognition Letters 18 (1997) 963–974 36. Huang, Q., Dom, B.: Quantitative methods of evaluating image segmentation. In: International Conference on Image Processing. Volume III., Washington DC (1995) 53–56
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D Mircea Nicolescu and Gérard Medioni Integrated Media Systems Center University of Southern California Los Angeles, CA 90089-0273 {mnicoles, medioni}@iris.usc.edu
Abstract. We present a novel approach for motion grouping from two frames, that recovers the dense velocity field, motion boundaries and regions, based on a 4-D Tensor Voting computational framework. Given two sparse sets of point tokens, we encode the image position and potential velocity for each token into a 4-D tensor. The voting process then enforces the motion smoothness while preserving motion discontinuities, thus selecting the correct velocity for each input point, as the most salient token. By performing an additional dense voting step we infer velocities at every pixel location, motion boundaries and regions. Using a 4-D space for this Tensor Voting approach is essential, since it allows for a spatial separation of the points according to both their velocities and image coordinates. Unlike other methods that optimize a specific objective function, our approach does not involve initialization or search in a parametric space, and therefore does not suffer from local optima or poor convergence problems. We demonstrate our method with synthetic and real images, by analyzing several difficult cases – opaque and transparent motion, rigid and non-rigid motion, curves and surfaces in motion.
1 Introduction A traditional formulation of the motion analysis problem is the following: given two or more image frames, the goal is to determine three types of information – a dense velocity field, motion boundaries, and regions. Computationally, the problem can be decomposed in two processes - matching and motion capture. The matching process identifies the elements (tokens) in successive views that represent the same physical object, thus producing a (possibly sparse) velocity field. The motion capture process infers velocity vectors at every image location, thus producing a dense velocity field, and groups tokens into regions separated by motion boundaries. Here we focus on the problem of matching and motion capture from sparse sets of point tokens in two frames. Two examples of such input are shown in Fig. 1 and Fig. 2. If the frames in each pair are presented in a properly timed succession, a certain motion of image regions is perceived from one frame to the other. However, while in one case the regions can be detected even without motion, only from monocular cues (here, different densities of points), in the other case no monocular information is available. This example shows that analysis is possible even from motion cues only. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 423-437, 2002. © Springer-Verlag Berlin Heidelberg 2002
424
M. Nicolescu and G. Medioni
Fig. 1. Translating circle
Fig. 2. Translating disk
Another interesting aspect is the fact that the human vision system not only establishes point correspondences, but also perceives regions in motion, although the input consists of sparse points only. This demonstrates that both processes of matching and motion capture are involved in motion analysis. Ullman presents an excellent analysis of the correspondence problem, from both a psychological and a computational perspective [1]. Here we are following his conclusion, that correspondence formation is a low-level process which expresses mutual token affinities, and takes place prior to any 3-D interpretation. Tokens involved in matching are non-complex elements, such as points, blobs, edge and line fragments. Optical flow techniques [2] – such as differential methods, region-based matching, energy-based or phase-based methods – rely on local, raw estimates of the optical flow field to produce a partition of the image. However, the flow estimates are very poor at motion boundaries and cannot be obtained in uniform areas. Past approaches have investigated the use of Markov Random Fields (MRF) in handling discontinuities in the optical flow [3]. Another research direction uses regularization techniques, which preserve discontinuities by weakening the smoothing in areas that exhibit strong intensity gradients [4][5]. Significant improvements have been achieved more recently by using layered representations [6][7]. There are many advantages of this formalism – mainly because it represents a natural way to incorporate motion field discontinuities, and it allows for handling occlusion relationships between different regions in the image. Establishing an association between input tokens and layers is still difficult. Some methods perform an iterative fitting of data to parametric models [8][9][10]. The difficulties involved in this estimation process range from a severe restriction in motion representation (as rigid or planar), to over-fitting and instability due to highorder parameterizations. Other approaches employ the rigidity constraint [1] to solve the correspondence problem, but besides the inability to handle non-rigid motions, such methods require three or more frames, while we process input from two frames only. Little et al. [11] developed a parallel algorithm for computing the optical flow by using a local voting scheme based on similarity of planar patches. However, their methodology cannot prevent motion boundary blurring due to over-smoothing and is restricted to short-range motion only. From a computational point of view, one of the most powerful and most often used constraints is the smoothness of motion. Usually, previous techniques encounter traditional difficulties in image regions where motion is not smooth (i.e., around motion boundaries). To compute the velocity field, knowledge of the boundaries is
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
425
required so that the smoothness constraint can be relaxed around the discontinuities. But the boundaries cannot be computed without first having determined the pixel velocities. This “chicken-and-egg” problem has lead to numerous inconsistent methods, with ad-hoc criteria introduced to account for motion discontinuities. A computational framework that successfully enforces the smoothness constraint in a unified manner, while preserving smoothness discontinuities is Tensor Voting [12]. This approach also benefits from the fact that it is non-iterative, it does not depend on critical thresholds, does not involve initialization or search in a parametric space, and therefore it does not suffer from local optima and poor convergence problems. The first to propose using Tensor Voting to determine the velocity field were Gaucher and Medioni [13]. They employ successive steps of voting, first to determine the boundary points as the tokens with maximal motion uncertainty, and then to locally refine velocities near the boundaries by allowing communication only between tokens placed on the same side of the boundary. However, in their approach the voting communication between tokens is essentially a 2-D process that does not inhibit neighboring elements with different velocities from influencing each other. In this paper we propose a novel approach based on a layered 4-D representation of data, and a voting scheme for token communication. Our methodology is formulated as a 4-D Tensor Voting computational framework. The position (x,y) and potential velocity (vx,vy) of each token are encoded as a 4-D tuple. By letting the tokens propagate their information through voting, distinct moving regions emerge as smooth surface layers in this 4-D space of image coordinates and pixel velocities. In the next section we examine the voting framework by first giving an overview of the Tensor Voting formalism, then we discuss how the voting concepts are generalized and extended to the 4-D case. In Section 3 we present our approach and show how this formalism is applied to the problem of matching and motion capture. In Section 4 we present our experimental results, while Section 5 summarizes our contribution and provides further research directions.
2 Voting Framework 2.1 Tensor Voting Overview The use of a voting process for feature inference from sparse and noisy data was introduced by Guy and Medioni [14] and then formalized into a unified tensor framework [12]. This methodology is non-iterative and robust to considerable amounts of outlier noise. The only free parameter is the scale of analysis, which is indeed an inherent property of visual perception. The input data is encoded as tensors, then support information (including proximity and smoothness of continuity) is propagated by convolution-like voting within a neighborhood. In the 2-D case, the salient features to be extracted are points and curves. Each token is encoded as a second order symmetric 2-D tensor, which is geometrically equivalent to an ellipse. This ellipse can be described by a 22 eigensystem, where the eigenvectors e1 and e2 give the ellipse orientation and the eigenvalues l1 and l2 (l1 l2) represent the ellipse size. The tensor is internally represented as a matrix S:
426
M. Nicolescu and G. Medioni
S = λ1 ⋅ e1 e1T + λ 2 ⋅ e 2 e 2T
(1)
An input token that represents a curve element is encoded as a stick tensor, where e2 represents the curve tangent and e1 the curve normal, while l1=1 and l2=0. An input token that represents a point element is encoded as a ball tensor, with no preferred orientation, while l1=1 and l2=1. The communication between tokens is performed through a voting process, where each token casts a vote at each site in its neighborhood. The size and shape of this neighborhood, and the vote strength and orientation are encapsulated in predefined voting fields (kernels), one for each feature type – there is a stick voting field and a ball voting field in the 2-D case. The fields are generated based on a single parameter – the scale factor s. Vote orientation corresponds to the best (smoothest) possible r local curve continuation from the voter to the recipient, while vote strength VS (d )
r
decays with distance | d | from voter to recipient, and with curvature r:
r
2
|d | + ρ − r 2 σ VS (d ) = e
2
(2)
Fig. 3 shows how votes are generated to build the 2-D stick field. A tensor P where r curve information is locally known (illustrated by curve normal N P ) casts a vote at its neighbor Q. The vote orientation is chosen so that it ensures a smooth curve continuation (through a circular arc) from voter P to recipient Q. The curve normal r r N thus obtained needs to be propagated by the vote from P to Q. The vote Vstick (d )
r
received from P at Q is encoded as a tensor according to (3), where d = Q − P .
r r rr Vstick (d ) = VS (d ) ⋅ NN T
(3)
Note that vote strength at both Q’ and Q” is smaller than at Q – because Q’ is farther, and Q” requires a higher curvature than Q. Fig. 4 shows the 2-D stick field, with its color-coded strength. When the voter is a ball tensor, with no information known locally, the vote is generated by rotating a stick vote in the 2-D plane and integrating all contributions, according to equation (4). The corresponding 2-D ball field is shown in Fig. 5. Q’
r
N
r
NP
Q”
Q
P
Fig. 3. Vote generation
Fig. 4. 2-D stick field
Fig. 5. 2-D ball field
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
r 2π r Vball (d ) = ∫ Rθ Vstick ( Rθ−1d ) RθT dθ
427
(4)
0
At each receiving site, the collected votes are combined through simple tensor r addition (sum of matrices V (d ) ), thus producing generic 2-D tensors. During voting, tokens that lie on a smooth curve reinforce each other, and the tensors deform according to the prevailing orientation. Each such tensor encodes the local orientation of geometric features such as curves (given by the tensor orientation), and confidence of this knowledge (also called saliency, given by the tensor shape and size). For a general tensor, its curve saliency is given by (l1-l2) and the curve normal orientation by e1, while its point saliency is given by l2. After voting, each resulting tensor S in general form is decomposed into a stick and a ball component, each being weighted by the corresponding saliency:
S = (λ1 − λ2 )e1e1T + λ2 (e1e1T + e2 e2T )
(5)
Therefore, the voting process infers curves and junctions (points) simultaneously, while also identifying outlier noise (tokens that receive very little support). In the 3-D case, the salient features are points, curves and surfaces. A point element corresponds to a ball tensor, with l1=l2=l3=1 and no preferred orientation, a curve element is represented by a plate tensor, where two eigenvalues co-dominate (l1=1, l2=1, l3=0) and the eigenvector e3 gives the curve tangent, and a surface element is represented by a stick tensor, where one eigenvalue dominates (l1=1, l2=0, l3=0), the surface normal is given by e1, and e2 and e3 are surface tangents. 2.2 Tensor Voting in 4-D The Tensor Voting framework is general enough to be extended to any dimension readily, except for some implementation changes, mainly for efficiency purposes [15]. The issues to be addressed are the tensorial representation of the features in the desired space, the generation of voting fields, and the data structures used for vote collection. Table 1 shows all the geometric features that appear in a 4-D space and their Table 1. Elementary tensors in 4-D
Feature
l1 l2 l3 l4
e1 e2 e3 e4
Tensor
point
1 1 1 1
Any orthonormal basis
Ball
curve
1 1 1 0
n1 n2 n3 t
C-Plate
surface
1 1 0 0
n1 n2 t1 t2
S-Plate
volume
1 0 0 0
n t1 t2 t3
Stick
428
M. Nicolescu and G. Medioni Table 2. A generic tensor in 4-D
Feature
Saliency
Normals
Tangents
point
l4
none
none
curve
l3 - l4
e1 e2 e3
e4
surface
l2 - l3
e1 e 2
e3 e4
volume
l1 - l2
e1
e2 e3 e4
representation as elementary 4-D tensors, where n and t represent normal and tangent vectors, respectively. Note that a surface in the 4-D space can be characterized by two normal vectors, or by two tangent vectors. From a generic 4-D tensor that results after voting, the geometric features can be extracted as shown in Table 2. The voting fields are a key part of the formalism – they are responsible of the size and shape of the neighborhood where the votes are cast, and also control how the votes depend on distance and orientation. The 4-D voting fields are obtained as follows. First the 4-D stick field is generated in a similar manner to the 2-D stick field, as it was explained in Section 2.1 and illustrated in Fig. 3. Then, the other three voting fields are built by integrating all the contributions obtained by rotating a 4-D stick field around appropriate axes. In particular, the 4-D ball field – the only one directly used here – is generated according to: 2π r r Vball (d ) = ∫ ∫ ∫ Rθ xyθ xuθ xv Vstick ( Rθ−xy1θ xuθ xv d ) RθTxyθ xuθ xv dθ xy dθ xu dθ xv
(6)
0
where x, y, u, v are the 4-D coordinates axes, qxy, qxu, qxv are rotation angles in the specified planes, and the stick field corresponds to the orientation (1 0 0 0). In the 2-D or 3-D case, the data structure used to store the tensors during vote collection was a simple 2-D grid or a red-black tree. Because we need a data structure that is gracefully scalable to higher dimensions, the solution used in our approach is an approximate nearest neighbor (ANN) k-d tree [16]. Since we use efficient data structures to store the tensors, the space complexity of the algorithm is O(n), where n is the input size. The average time complexity of the voting process is O(mn) where m is the average number of tokens in the neighborhood. Therefore, in contrast to other voting techniques, such as the Hough Transform, both time and space complexities of the Tensor Voting methodology are independent of the dimensionality of the desired feature. The running time for an input of size 700 is about 20 seconds on a Pentium III (600 MHz) processor.
3 Our Approach The main difficulties in visual motion analysis appear at motion boundaries, where velocity estimates are very poor. This happens because the problem is typically cast
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
429
as a two-dimensional process. As a result, along boundaries tokens have a strong mutual affinity because they are close in the image, despite the fact that they may belong to different regions, with different velocities. Accordingly, we believe that the desirable representation should be based on a layered description, where regions in motion are represented as smooth and possibly overlapping layers. In any method that seeks to solve the motion analysis problem, each token is characterized by four attributes – its image coordinates (x,y) and its velocity with the components (vx,vy). We encapsulate each token into a (x,y,vx,vy) tuple in the 4-D space, this being a natural way of expressing the spatial separation of tokens according to both velocities and image coordinates. It is especially helpful for eliminating the problem of uncertainty along motion boundaries, where although tokens are close in image space, their interaction is now inhibited by their separation in velocity space. In general, there may be several candidate velocities for each point (x,y), so each tuple (x,y,vx,vy) represents a (possibly wrong) candidate match. Both matching and motion capture are based on a process of communicating the affinity between tokens. In our representation, this affinity is expressed as the token preference for being incorporated into a smooth surface layer in the 4-D space. A necessary condition is to enforce strong support between tokens in the same layer, and weak support across layers, or at isolated tokens. A suitable computational framework that enforces the smoothness constraint while preserving discontinuities is Tensor Voting, here performed in the 4-D space. The affinities between tokens are embedded in the concept of surface saliency exhibited by the data. By letting the tokens propagate their information through voting, wrong matches are eliminated as they receive little support, and layers are extracted as salient smooth surfaces. Essentially, the matching problem is expressed as an outlier rejection process, while motion capture is performed mainly as a layer densification process. We demonstrate the contribution of this work by addressing the problems of matching and motion capture. Given two sparse sets of point tokens, we first use 4-D voting to select the correct match for each input point, as the most salient token, thus producing a sparse velocity field. By using the same voting framework during the motion capture process, we then generate a dense layer representation in terms of motion boundaries and regions. We illustrate our method with both synthetic and real images, by analyzing several cases – opaque and transparent motion, rigid and nonrigid motion, curves and surfaces in motion. 3.1 Matching We take as input two frames containing identical point tokens, in a sparse configuration. For illustration purposes, we give a step-by-step description of our approach by using a specific example – the point tokens represent an opaque translating disk (Fig. 2) against a static background. Later we also show how our method performs on several other examples. Before proceeding, we need to make a brief comment on how we display the intermediate results (i.e. those in 4-D). In order to allow for a three-dimensional display, the last component of each 4-D point has been dropped, so that the 3
430
M. Nicolescu and G. Medioni
dimensions shown are image coordinates x and y (in the horizontal plane), and the vx component of image velocity (the height). We do not assume any a priori information about the scene. Candidate matches are generated as follows: in a pre-processing step, for each token in the first frame we simply create a potential match with every point in the second frame that is located within a neighborhood (whose size is given by the scale factor) of the first token. The resulted candidates appear as a cloud of (x,y,vx,vy) points in the 4-D space. In our translation example we have 400 input points, and by using the procedure described above we generate an average of 5.3 candidate matches per point, among which at most one is correct. Fig. 6(a) shows the candidate matches. Note that the correct matches can be already visually perceived as they are grouped in two parallel layers surrounded by noisy matches. Since no information is initially known, each potential match is encoded into a 4-D ball tensor - the eigenvalues and eigenvectors are the following: l1=1 e1 = ( 0 0 0 1 ) T l2=1 e2 = ( 0 0 1 0 ) T l3=1 e3 = ( 0 1 0 0 ) T l4=1 e4 = ( 1 0 0 0 ) T After encoding, each token casts votes in its neighborhood (given by the scale factor). This is a sparse voting process, in the sense that votes are cast only at input token locations. Votes are generated by using the 4-D ball voting field, where no particular orientation is preferred. During voting there is strong support between tokens that happen to lie on a smooth surface (layer), while communication between layers is reduced by the spatial separation in the 4-D space of both image coordinates and pixel velocities. Wrong matches appear as isolated points, which receive little or no support. A measure of this support is given by the surface saliency. The next step is to eliminate wrong matches. For each group of tokens that have common (x,y) coordinates but different (vx,vy) velocities, we retain the token with the strongest surface saliency (that is, with the maximum value for l2-l3), while rejecting the others as outliers. For the translating disk example, a comparison with the ground truth shows that matching was 100% accurate - all 400 matches have been recovered correctly, despite the large amount of approximately 500% noise present. Fig. 6(b) shows the recovered sparse velocity field, while Fig. 6(c) shows a 3-D view of the recovered matches, where the height represents the vx velocity component. 3.2 Motion Capture We start with a sparse velocity field described by (x,y,vx,vy) tuples in the 4-D space, that has been produced by the matching process. In the first stage we need to obtain an estimation of the layer orientations as accurate as possible. Although local layer orientations have been already determined as a by-product during the matching process (after voting, the eigenvectors e1 and e2 represent the normals to layers), they may have been corrupted by the presence of wrong correspondences.
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
(a) Candidate matches
(b) Sparse velocity field
431
(c) Recovered vx velocities
(d) Refined velocity field (sparse)
(e) Dense velocity field
(f) Regions
(g) Boundaries
Fig. 6. Translating disk
Therefore, we perform here an orientation refinement through another sparse voting process, but this time with the correct matches only. To this purpose, every 4D tuple is again encoded as a ball tensor. After voting, the desired orientations – as normals to layers – are found at each token as the first two eigenvectors e1 and e2. We remind the reader that a surface in 4-D is characterized by two normal vectors. In Fig. 6(d) we show a 3–D view of the tokens with refined layer orientations (only one of the normals is shown at each token). In order to attain the very goal of the motion capture problem – that is, to recover boundaries and regions as continuous curves and surfaces, respectively – it is necessary to first infer velocities and layer orientations at every image location. The previously developed Tensor Voting framework allows for a densification procedure that extracts surfaces or curves from sparse data. However, since the algorithm is based on a marching process that grows the surface around a seed, if the surfaces are not closed they will be over-extended. This happens because the growing process stops only when saliency drops below a certain level, due to the decay with distance from supporting tokens. Since in our case it is crucial to obtain accurate motion boundaries, we devised a different densification scheme. The key fact is that our 4-D space is not isotropic – we need to obtain a tensor value at every (x y) image location, but certainly not at every (vx,vy) location in velocity space. In our approach, for each discrete point (x y) in the image we try to find the best (vx,vy) location at which to place a newly generated token. The candidates considered
432
M. Nicolescu and G. Medioni
are all the discrete points (vx,vy) between the minimum and maximum velocities in the sparse tokens set, within a neighborhood of the (x y) point. At each candidate position (x,y,vx,vy) we accumulate votes from the sparse tokens, according to the same Tensor Voting framework that we have used so far. The candidate token with the largest surface saliency after voting is retained, and its (vx,vy) coordinates represent the best velocity at (x y). By following this procedure at every (x y) image location we generate a dense velocity field. Note that in this process, along with velocities we simultaneously infer layer orientations, given by eigenvectors e1 and e2. In Fig. 6(e) we show a 3-D view of the dense set of tokens and their associated layer orientations. The next step is to group tokens into regions that correspond to distinct moving objects, by using again the smoothness constraint. We start from an arbitrary point in the image, assign a region label to it, and try to recursively propagate this label to all its image neighbors. In order to decide whether the label must be propagated, we use the smoothness of both velocity and layer orientation as a grouping criterion. Having both pieces of information available is especially helpful in situations where neighboring pixels have very similar velocities, and yet they must belong to different regions. Most methods that are based only on velocity discontinuities would fail on these cases. We will show such an example later. After assigning region labels to every token, for illustration purposes we perform a triangularization of each of the regions detected. The resulting surfaces are presented in Fig. 6(f). Finally, we have implemented a method to extract the motion boundary for each region as a “partially convex hull”. The process is controlled by only one parameter – the scale factor – that determines the perceived level of detail (that is, the departure from the actual convex hull). The resulting boundary curves are shown in Fig. 6(g).
4 Results The case illustrated so far may be considered too simple since the only motion involved is translation. However, no assumption – such as translational, planar, or rigid motion – has been made. The only criterion used is the smoothness of image motion. To support this argument, we show next that our approach also performs very well for several other configurations. Expanding disk (Fig. 7). The input consists of two sets of 400 point tokens each, representing an opaque disk in expansion against a static background. The average number of candidate matches per point is 6.1. Comparing the resulting matches with the true motion shows that only 1 match among 400 is wrong. This example demonstrates that, without special handling, our framework can easily accommodate non-rigid image motion. Rotating disk – translating background (Fig. 8). The input consists of two sets of 400 point tokens each, representing an opaque rotating disk against a translating background. The average number of candidate matches per point is 5.8. After processing, only 2 matches among 400 are wrong. This is a very difficult case even for human vision, due to the fact that around the left extremity of the disk the two motions (of the disk and the background) are almost identical. In that part of the image there are points on different moving objects that are not separated, even in the
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
433
Fig. 7. Expanding disk
Fig. 8. Rotating disk – translating background
4-D space. In spite of this inherent ambiguity, our method is still able to accurately recover velocities, regions and boundaries. The key fact is that we rely not only on the 4-D positions, but also on the local layer orientations that are still different and therefore provide a good affinity measure. Rotating ellipse (Fig. 9). The input consists of two sets of 100 point tokens each, representing a rotating ellipse. The average number of candidate matches per point is 5.9. After processing, all 100 matches have been correctly recovered. Many methods would fail on this example (used in the literature to illustrate the aperture effect) – one clear difficulty is that at the points where the rotated ellipse “intersects” the original one the velocity could be wrongly estimated as zero. Transparent motion (Fig. 10). The input consists of two sets of 500 point tokens each, representing a transparent disk in translation against a static background. The average number of candidate matches per point is 8.9. After processing, all 100 matches have been correctly recovered. This example is extremely relevant to illustrate the power of our approach. If the analysis had been performed in a twodimensional space, it would have failed because the two motion layers are superimposed in 2-D. In our framework, using the 4-D space provides a very natural separation between layers, separation that is consistent with human perception. For
434
M. Nicolescu and G. Medioni
Fig. 9. Rotating ellipse
Fig. 10. Transparent motion
Fig. 11. Rotating square
Fig. 12. Translating circle
our matching method, the transparent motion does not appear as a special case and it does not create any more difficulties than the opaque motion. Rotating square (Fig. 11). The input consists of two sets of 100 point tokens each, representing a rotating square. The average number of candidate matches per point is 5.7. After processing, all 100 matches have been correctly recovered. This example is similar to the rotating ellipse and shows that the presence of non-smooth curves does not produce additional difficulty for our methodology. Translating circle (Fig. 12). The input consists of two sets of 400 point tokens each, representing a translating circle against a static background. The average number of candidate matches per point is 6. After processing, all 100 matches have been correctly recovered. This example shows that we can successfully handle both curves and surfaces in motion. So far we have presented only cases where no monocular information (such as intensity) is available, and the entire analysis has been performed based on motion cues only. Human vision is able to handle these cases remarkably well, and their study is fundamental for understanding the motion analysis process. Nevertheless they are very difficult from a computational perspective – most existing methods cannot handle such examples in a consistent and unified manner. To further validate our approach we have also analyzed several standard image sequences, where both monocular and motion cues are available. In order to incorporate monocular information into our framework, we only needed to change the pre-processing step where candidate matches are generated. We ran a simple intensity-based cross-correlation procedure, and we retained all peaks of correlation as candidate matches. The rest of our framework remains unchanged. Yosemite sequence (Fig. 13). We analyzed the motion from two frames of the Yosemite sequence (without the sky) to quantitatively estimate the performance of our approach. The average angular error obtained is 3.74° ± 4.3° for 100% field coverage, result which is comparable with those in the literature [2]. This example also shows that our method successfully recovers non-planar motion layers.
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
(a) an input frame
(b) x-velocities
435
(c) y-velocities
(d) motion layer (x-velocities)
Fig. 13. Yosemite
(a) an input frame
(b) x-velocities
(c) y-velocities
(d) regions
Fig. 14. Flower Garden
Flower Garden sequence (Fig. 14). For a qualitative estimation, we also analyzed the motion from two frames of the Flower Garden sequence. It is worth mentioning that wrong candidates generated due to occlusion are corrected during the densification step. Scale sensitivity. Since the only parameter involved in our voting framework is the scale factor that defines the voting fields, we analyzed how it influences the quality of the analysis. We ran our algorithm on the expanding disk example for a large range of
436
M. Nicolescu and G. Medioni
Fig. 15. Scale factor influence
scale values and we found that the method is remarkably robust to varying scale factors. Fig. 15 shows the number of wrong matches (for an input of 400 points) obtained for different values of the voting field size. Comparatively, the image size is 200 by 200. Note that when the field is too small, tokens do not communicate any more.
5 Conclusions and Future Work We have presented a novel approach for the problem of perceptual grouping from motion cues, based on a layered 4-D representation of data, and a voting scheme for token communication. Our methodology is formulated as a 4-D Tensor Voting computational framework. The moving regions are conceptually represented by smooth layers in the 4-D space of image coordinates and pixel velocities. Within this data representation, we employed a voting scheme for token affinity communication. Token affinities are expressed by their preference for being incorporated into smooth surfaces, as statistically salient features. Communication between sites is performed by tensor voting. From a possibly sparse input consisting of identical point tokens in two frames, without any a priori knowledge of the motion model we determine a dense representation in terms of accurate velocities, motion boundaries and regions, by enforcing the smoothness constraint while preserving motion discontinuities. Using a 4-D space for our Tensor Voting approach is essential, since it allows for a spatial separation of the points according to both their velocities and image coordinates. Consequently, the proposed framework allows tokens from the same layer to strongly support each other, while inhibiting influence from other layers, or from isolated tokens. Despite the high dimensionality, our voting scheme is both time and space efficient. Its complexity depends on the input size only. Our approach does not involve initialization or search in a parametric space, and therefore does not suffer from local optima or poor convergence problems. The only free parameter is scale, which is an inherent characteristic of human vision, and its setting is not critical. We demonstrated the contributions of this work by analyzing several cases – opaque and transparent motion, rigid and non-rigid motion, curves and surfaces in motion. We showed that our method successfully addresses the difficult problem of
Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D
437
motion analysis from motion cues only, and is also able to incorporate the use of monocular cues that are present in real images. We plan to extend our approach for real image sequences by using a more elaborate procedure for generating the initial candidates, rather than a simple crosscorrelation technique. Other research directions include studying the occlusion relationships and incorporating information from multiple frames.
Acknowledgements. This research has been funded in part by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152, and by National Science Foundation Grant 9811883.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
S. Ullman, “The Interpretation of Visual Motion”, MIT Press, 1979. J. Barron, D. Fleet, S. Beauchemin, “Performance of Optical Flow Techniques”, IJCV, 1994, 12:1, pp. 43-77. F. Heitz, P. Bouthemy, “Multimodal Estimation of Discontinuous Optical Flow Using Markov Random Fields”, PAMI, December 1993, 15: 12, pp. 1217-1232. S. Ghosal, “A Fast Scalable Algorithm for Discontinuous Optical Flow Estimation”, PAMI, 1996, 18:2, pp. 181-194. R. Deriche, P. Kornprobst, G. Aubert, “Optical Flow Estimation while Preserving its Discontinuities: A Variational Approach”, ACCV, 1995, pp. 290-295. S. Hsu, P. Anandan, S. Peleg, “Accurate Computation of Optical Flow by Using Layered Motion Representations”, ICPR, 1994, pp. 743-746. A. Jepson, M. Black, “Mixture Models for Optical Flow Computation”, CVPR, 1993, pp. 760-761. M. Irani, S. Peleg, “Image Sequence Enhancement Using Multiple Motions Analysis”, CVPR, 1992, pp. 216-221. J. Wang, E. Adelson, “Representing Moving Images with Layers”, IEEE Trans. On Image Processing Special Issue: Image Sequence Compression, 1994, 3:5, pp. 625-638. Y. Weiss, “Smoothness in Layers: Motion Segmentation Using Nonparametric Mixture Estimation”, CVPR, 1997, pp. 520-526. J. Little, H. Bulthoff, T. Poggio, “Parallel Optical Flow Using Local Voting”, ICCV, 1988, pp. 454-459. G. Medioni, M.-S. Lee, C.-K. Tang, “A Computational Framework for Segmentation and Grouping”, Elsevier Science, 2000. L. Gaucher, G. Medioni, “Accurate Motion Flow Estimation with Discontinuities”, ICCV, 1999, pp. 695-702. G. Guy, G. Medioni, “Inference of Surfaces, 3-D Curves and Junctions from Sparse, Noisy 3-D Data”, IEEE Trans. PAMI, 1997, 19: 11, pp. 1265-1277. C.-K. Tang, G. Medioni, M.-S. Lee, “Epipolar Geometry Estimation by Tensor Voting in 8D”, ICCV, 1999, pp. 502-509. S. Arya, D. Mount, N. Netanyahu, R Silverman, A. Wu, “An optimal algorithm for approximate nearest neighbor searching in fixed dimensions”, Journal of the ACM, 1998, 45:6, pp. 891-923.
Deformable Model with Non-euclidean Metrics Benjamin Taton and Jacques-Olivier Lachaud Laboratoire Bordelais de Recherche en Informatique (LaBRI) 351, cours de la Lib´eration, 33405 TALENCE, FRANCE
Abstract. Deformable models like snakes are a classical tool for image segmentation. Highly deformable models extend them with the ability to handle dynamic topological changes, and therefore to extract arbitrary complex shapes. However, the resolution of these models largely depends on the resolution of the image. As a consequence, their time and memory complexity increases at least as fast as the size of input data. In this paper we extend an existing highly deformable model, so that it is able to locally adapt its resolution with respect to its position. With this property, a significant precision is achieved in the interesting parts of the image, while a coarse resolution is maintained elsewhere. The general idea is to replace the Euclidean metric of the image space by a deformed non-Euclidean metric, which geometrically expands areas of interest. With this approach, we obtain a new model that follows the robust framework of classical deformable models, while offering a significant independence from both the size of input data and the geometric complexity of image components. Keywords: image segmentation, deformable model, non-Euclidean geometry, topology adaptation, optimization
1
Introduction
Deformable models were first introduced in the field of image segmentation by Kass, Witkin and Terzopoulos [5]. Their behavior is governed by the minimization of a functional (energy), which depends both on the matching of the curve to the contour and on the smoothness of the curve. This segmentation framework remains valid even with poor quality images with weak contours. Furthermore, the energy formulation is rather intuitive. This makes it easier to include user interaction as well as other kinds of constraints in the segmentation process. The deformable model framework has been extended with various tools that improve both convergence speed and segmentation quality (e.g. [13] and [2]). However, deformable models in their original formulation cannot dynamically change their topology. This means that they can only extract image components with the same topology as the initial model. As a consequence, the model initialization requires hypotheses about its final expected shape. In domains like biomedical image analysis, this kind of hypotheses is sometimes difficult to justify. Indeed, the objects in the images often have complex shapes with several connected components and holes. Furthermore they are likely to be abnormal and hence may have an unexpected topology. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 438–452, 2002. c Springer-Verlag Berlin Heidelberg 2002
Deformable Model with Non-euclidean Metrics
439
Several methods have been proposed to overcome this limitation. These highly deformable models are able to achieve automated or supervised topology adaptations, based on their geometric evolution. Nevertheless, although they largely extend the application field of deformable models, these new methods come in general with an important increase of computational costs. The following paragraphs describe some of these techniques and emphasizes their time complexities. From now on we will assume that the size of a bi-dimensional image is n2 and that the number of vertices necessary to describe the shape of the image components is O(n). The first approach keeps the energy formulation as well as the explicit shape representation. The idea is to check the model at each step of its evolution. Then, if its topology is no longer consistent, mesh reconfigurations are performed to solve the problems. Delingette and Montagnat [4] analyse the intersections of their model with a regular grid. Detecting special configurations allows them to detect and handle topological changes. McInerney and Terzopoulos [9] propose a similar method that use a simplicial regular subdivision of the space to detect and solve topological problems. This algorithm is less efficient than Delingette’s (O(n2 ) per iteration instead of O(n)), but has the advantage of being easy to extend to the three-dimensional case. The authors of [6] propose a different approach based on distance constraints on the edges of the deformable model. Although they formulate it for the three-dimensional case, the principle is valid for the bi-dimensional case too. With an appropriate data structure the time complexity of this algorithm is reduced to O(n log n) per iteration. Another approach [10] consists in formulating the problem in term of front propagation instead of minimizing an energy. In this context the model is no longer represented explicitly: it is viewed as a particular level set of a scalar function f defined on the image space and which evolves with the time. This level set propagates in the image space with respect to two main constraints: (i) the propagation slows down in the neighborhood of high image gradients, (ii) the level set propagates faster in places where its curvature is important (this is to preserve the contour smoothness, see [1] and [8] for details). These constraints are expressed as differential equations involving f . Iteratively solving these equations makes the level set approach image components. With this formalism, the topological changes are automatically embedded in the evolution of f . In addition it is very easy to extend this kind of model to higher dimensional spaces. Nevertheless these advantages come with heavy computational costs: theoretically, the new values of f have to be re-computed over the whole image at each step of the algorithm (each step would thus cost O(n2 ) operations). However optimizations based on quad-trees or on the narrow band algorithm may reduce this complexity to O(n log n) [11,12]. Note also that, user interactions or other constraints are more difficult to embed in this kind of models compared to explicit ones. Finally, in a third approach, the energy minimization problem is formulated in a purely discrete context [7]. Image components are viewed as sets of pixels (or voxels). There is therefore no need to write specific algorithms to handle topology changes. The energy of the components is computed according to the local configuration of each element of its boundary (called a bel). The model
440
B. Taton and J.-O. Lachaud
evolution is then performed by adding or removing pixels (or voxels) so as to minimize the global energy. Complexity per iteration linearly depends on the size of the model boundary. Namely, for a bi-dimensional image the time cost is O(n) bel energy computations for each iteration. However, the time needed to compute one bel energy is constant but rather important. This argumentation shows that the computational costs of highly deformable models largely depend on the size of input data. The images produced by acquisition devices have higher and higher resolutions. As a consequence, computational cost of these segmentation algorithms becomes prohibitive. In this paper we propose a highly deformable model which dynamically adapts its resolution depending on its position in the image, and automatically changes its topology according to the changes of its geometry. A fine resolution is thus achieved in the interesting parts of the image while a coarse one is kept elsewhere. With this technique, the complexity of the model is made significantly more independent of the image resolution. More precisely, the model we propose is an extension of the one presented in [6]. In this model, the topological consistency is maintained using distances estimations. However, to work properly, the model needs to have a regular density. To achieve adaptative resolution, our idea is to change the Euclidean metric with a locally deformed metric that geometrically expands the interesting parts of the image. The original model was designed to work with three-dimensional images. In this paper, while we propose to extend a bi-dimensional version of this model, our real goal is to later extend it to segment three-dimensional images. A particular attention is thus paid to using only methods that remain available in three-dimensional spaces. In the first section of this paper, we describe more in details the model we propose to extend. In the second section we define a more general notion of distance that is used in the third section to define our new model. The last section deals with the choice of the metric with respect to the image.
2
Initial Model
In this section we first make some recalls about the classical snake model, as it was introduced by Kass, Witkin and Terzopoulos [5]. Then we describe the highly deformable model we propose to extend. The last part of the section explains how changing the metric of the image space allows us to locally change the resolution of this model. 2.1
Classical Snake Model Formulation
Snakes are defined as curves C : [0, 1] → IR2 that evolve in the image space so as to minimize an energy functional expressed as 1 E(C) = Eimage (C(u)) + Einternal (C, u) du . (1) 0
The energy functional is composed of two terms: the first term Eimage ensures that the curve matches the contours of the image, the second term Einternal
Deformable Model with Non-euclidean Metrics
441
forces the curve to remain rather smooth. The contour matching is ensured by choosing Eimage so as to make it low in the neighborhood of image gradients. The smoothness of the curve is obtained by penalizing both its length and curvature. These remarks lead to the following choice for these energies Eimage (x, y) = −∇G ∗ I(x, y) 2 2 β d2 C + Einternal (C, u) = α2 dC 2 du 2 du where I and G respectively denote the image function and a Gaussian smoothing filter. To find the curve that minimizes (1), the most common method consists in discretizing the curve. Once this has been done, each discretized point is considered as a particle having its own energy, and therefore evolving under forces derivating from this energy. Namely these forces are – an image-based force which attracts the particle towards image gradients, – an elastic force which attracts each particle towards its neighbors, – a curvature minimizing force which attracts each particle toward the straight line joining its neighbors, – a friction force that makes the system more stable. This point of view has a significant advantage over the first formulation: it makes it possible to introduce new forces that speed up the convergence of the model or improve the segmentation (for examples see [13] and [2]). Moreover, such a mechanical system is easy to extend to three dimensions [3,6]. To determine the displacement of each particle at a given step, many authors directly apply Newton’s laws of motion and then solve the resulting differential equations (e.g. [13,2,4]). However, we propose to use the Lagrangian approach to express the dynamics of each particle, since it can be formulated in non-Euclidean spaces. Now, consider a particle of mass m, its position being described by x = (x1 , . . . , xn ), moving under the action of the force F . Let T denote its kinetic ˙ Let δx be a small variation of the particle trajectory energy (T = 12 m x˙ · x). (δx(t1 ) = δx(t2 ) = 0). Then, the dynamics of the particle is described by the least action principle which postulates (for all possible δx): t2 (δT + F · δx) dt = 0 . (2) t1
In particular, it has to be true for δx = (0, . . . , 0, δxi , 0, . . . , 0). Equation (2) may then be written as t2 ∂T d ∂T δxi − δxi + F · δx dt = 0 . (3) ∀i, ∂xi dt ∂ x˙ i t1 In an Euclidean space with the usual dot product, F · δx = Fi δxi . Hence equation (3) leads to ∂T d ∂T ∀i, − + Fi = 0 . (4) ∂xi dt ∂ x˙ i
442
B. Taton and J.-O. Lachaud
n Replacing T with 12 m i=1 x˙ 2i (which is true in Euclidean spaces only) and writing the equations for each possible value of i result in the well known Newton’s laws of motion: m¨ x = F . The Lagrangian formalism may thus be considered as a variational principle equivalent to the Newton’s laws, but with the advantage of being independent of the Euclidean structure of the space. 2.2
Highly Deformable Model
As said in the introduction, classical snake models are unable to handle topological changes. Highly deformable models are extensions to the basic snake model that overcome this limitation. This section describes the bi-dimensional version of the highly deformable model we propose to extend [6], and how it is able to dynamically adapt its topology to keep it consistent with its geometry. General Description. Our deformable model is an oriented (not necessarily connected) closed polygonal curve that evolves in the image space. To preserve a consistent orientation of the model, it is necessary to avoid self-intersections. The main idea is to detect them before they occur. When a problem is expected to arise, it is solved using appropriate local topological operators. By this way, we ensure that the orientation of the model remains consistent at each step of its evolution. The following subsections detail each step of this algorithm. Collision Detection. Detection is made possible by imposing distance constraints between the vertices of the model. Let δ and ζ be real numbers satisfying 0 < δ and 2 ≤ ζ. Then, suppose that for each edge (a, b) of the model, the Euclidean distance dE (a, b) between a and b satisfies δ ≤ dE (a, b) ≤ ζδ .
(5)
That means that all vertices are rather evenly spaced on the model mesh (provided that ζ is close to 2). Now let us show how this is helpful to detect self intersections of the model. Let (u, v) be an edge of the deformable model. Suppose that a vertex p crosses over (u, v). Then it is easy to verify that there is a time interval during which the distance from p to the edge (u, v) is less than a particular threshold. It follows that, during this time interval, the distance from p to either u or v is less than λE ζδ, where λE is an appropriate real number (see Fig. 1a-b). Theoretically, without additional hypothesis, λE should be greater than 1 to ensure that no self-intersection may occur without being detected. However, if the initial model configuration is consistent, and if the motion of the particles is small enough, smaller values of λE may be chosen. In particular, if the vertex speed is assumed to be less than vmax , λE = 1/2 + vmax /ζδ is a suitable choice. With this property, if the inequality (5) is satisfied at each step of the evolution of the model, then self-intersections are detected by checking the inequality λE ζδ ≤ dE (u, v)
(6)
Deformable Model with Non-euclidean Metrics
u
Cu p
v a
u
p
λEζδ Cv
Cu
u
443
v b
Cv
c
Fig. 1. Collision detection and handling: a point p cannot crosses over (u, v) without entering circle Cu or Cv (a) and (b). When a collision is detected (b), a local reconfiguration is performed (c): new vertices are created around v and p. These vertices are connected in the appropriate way. Then the vertices v and p are deleted.
for each pair of non-neighbor vertices of the model. Each time this inequality is not verified, a local reconfiguration of the model is performed in order to restore the constraint. Let us describe these reconfigurations. Local Reconfigurations. As said in the previous paragraph, two kinds of constraints have to be maintained: – constraints between neighbor vertices, given by (5), – constraints between non-neighbor vertices, given by (6). Consider an edge of the deformable model that does not respect the constraint (5). If it is shorter than the δ threshold, then the constraint is recovered by merging the two vertices at the extremities of the edge. Symmetrically, if the edge is longer than the ζδ threshold, then it is split into two equal edges by inserting a new vertex in the middle of the original edge. Note that this kind of transformations affect the local resolution of the model, but not its topology. If we now consider a pair of vertices that does not respect (6), then the transformations described on Fig. 1c have to be performed. The topology of the model is thus changed. Detecting all these pairs of vertices with a naive method would have a time complexity of O(n2 ) for a model with O(n) vertices. Nevertheless a hierarchical subdivision of space with quad-trees reduces the costs to an average time complexity of O(n log n). It is important to note that this approach to topology adaptation also works for three-dimensional image segmentation. In this context, the deformable model is a triangulated surface, and with an adapted value of λE , all the statements made before remain true. 2.3
Influence of Distance Estimations
As said before, the evolution of the model, and especially its resolution, largely depend on distance estimations. When the length of an edge is to small, vertices are merged, and when the length of an edge is too high, new vertices are inserted
444
B. Taton and J.-O. Lachaud
in the mesh. Therefore, the parameter δ determines the resolution of the model: the smaller it is, the greater the resolution becomes. Since (5) has to be true for each edge of the model, increasing the resolution of the deformable model only in a restricted interesting part of the image cannot be achieved without increasing the resolution everywhere, that is, even where it is not needed. This implies a significant increase of the time and memory costs of the segmentation algorithm. Suppose now that we change the way distances are measured instead of changing the value of δ. Namely, suppose that distances are overestimated in the interesting parts of the image, while they are underestimated elsewhere. In all areas where distances are overestimated, the lengths of the edges seem to be greater so that they tend to exceed the ζδ threshold. Therefore, new vertices are inserted to split the edges. As a consequence, the model resolution increases. Symmetrically, in places where distances are underestimated, the lengths of the edges seem to be smaller. To prevent the edge length from falling under δ, neighbor vertices are merged, so that the resolution decreases. The extension we propose is based on these remarks. Our idea is to replace the Euclidean metric with a deformed one, in order to geometrically expand the areas of interest. With this new definition of distances, it becomes possible to perceptibly increase the resolution of the model in restricted parts of the image without significantly increasing the cost of the segmentation algorithm. Nevertheless, changing the distance notion rises many problems: – first we have to find a more general definition of distances and to design algorithms to measure these distances between arbitrary points, – then the motion equations of the deformable model particles have to be rewritten in this context, – lastly, it is important to discuss how distances are defined with respect to the image. The following paragraphs successively deal with each of these problems.
3
Riemannian Geometry
The new definition of distances is based on Riemannian geometry. Giving all the background on this mathematical theory is beyond the scope of this article, therefore we only describe the notions which are used to define our new deformable model. Moreover, to clarify the presentation, everything will be described in IR2 . The given statements however remain true in higher dimensional spaces. 3.1
Metrics
In an Euclidean space, the length of an elementary displacement ds = (dx, dy) is given by the relation ds = dx2 + dy 2 .
Deformable Model with Non-euclidean Metrics
445
It only depends on the values of dx and dy. In particular, the origin of the displacement has no influence on its length. In a non-Euclidean space this is no longer true. Suppose for example that the rectangle [−π, π[×] − π2 , π2 [ is the map of a sphere, obtained with the usual parameterization using the longitude θ and latitude φ. Then the length of a elementary displacement ds = (dθ, dφ) changes whether the displacement occurs near one pole or near the equator of the sphere (actually it may be written ds = R2 cos2 φ dθ2 + R2 dφ2 , where R denotes the sphere radius). To measure the length of arcs in a non-Euclidean space, a mapping which measures the lengths of elementary displacements anywhere in this space must be given. This mapping is derived from the notion of metric. We call a metric, or a Riemannian metric a mapping g that associates a dot product g(x,y) with each point (x, y) of the space, provided that g is of class C 1 . At a given point of the space, the dot product g(x,y) induces a norm which may be used to measure the lengths of elementary displacements starting from (x, y): 1 ds = g(x,y) (ds, ds) 2 . Then, let c : [0, 1] → IR2 be an arc. Its lengths L(c) may be expressed as L(c) =
0
1
1 gc(u) (c (u), c (u)) 2 du .
(7)
Metrics are conveniently defined with a matrix form. Indeed, g(x,y) is a dot product, and hence it is a symmetric positive definite bilinear form. In the bidimensional case, it may therefore be expressed as g11 (x, y) g12 (x, y) T ×v , (8) (u, v) → u × g12 (x, y) g22 (x, y) where the gij are functions of class C 1 . From now on, in order to make the equations simpler and since no confusion may arise, gij will denote gij (x, y). Note that if we choose the metric given by the identity matrix, an elementary displacement ds = (dx, dy) has a length given by ds = dx2 + dy 2 . Namely, the Euclidean metric may be interpreted as a particular Riemannian metric. 3.2
New Definition of Distances
Given this metric definition, it becomes easy to build a new definition of the distances between two points: the Riemannian distance between two points A and B is denoted dR (A, B) and its value is given by
dR (A, B) = inf L(c) | c ∈ C 1 ([0, 1], IR2 ), c(0) = A, c(1) = B where C 1 ([0, 1], IR2 ) denotes the set containing all the arcs of class C 1 in the IR2 space. Informally speaking, the Riemannian distance between two points is the length of one shortest path between these points. When such a path exists, which is always true in a compact space, it is called a geodesic.
446
3.3
B. Taton and J.-O. Lachaud
Effective Distance Estimations
Although this definition complies with intuition, its major drawback is that computing exact distances is difficult. Indeed, contrary to what occurs in the Euclidean space, shortest paths are no longer straight lines. Finding a minimal path between two given points may be efficiently achieved by using a relaxation method to minimize the discretized version of (7). The length of the discretized path is then easy to compute. However, one may verify that in an area of the space where g remains constant, geodesics are straight lines. Since g is of class C 1 , g may be considered as constant on small neighborhoods of space. Therefore, the geodesic connecting two given points may be viewed as a straight line, provided these two points are close enough from each other (which is the case for neighbor vertices of our model).
4
New Formulation of the Deformable Model
Changing the space metric has two consequences, each of which is described in the following subsections: – the motion equations are modified, which affects the model dynamics, – the collision detection method has to take the metric change in account. 4.1
New Laws of Motion
To define the new dynamics of our deformable model, we have to write the differential equations which describe the motion of each particle. Therefore, we rewrite the least action principle (2) with the new dot product defined by the metric. With this dot product, we have: n T = 12 mx˙ · x˙ = 12 m i,j=1 gij x˙ i x˙ j n . F · δx = i,j=1 gij Fi δxj Choosing δx = (0, . . . , 0, δxi , 0, . . . , 0) leads to t2 n
∂T d ∂T ∀i, δxi − δxi + gki Fk δxi dt = 0 . ∂xi dt ∂ x˙ i t1
(9)
k
Once δxi has been factored, one may deduce n
∀i,
∂T d ∂T − + gki Fi = 0 . ∂xi dt ∂ x˙ i
(10)
k=1
Using now the new expression of T leads to the motion equations ∀k, m¨ xk + m
n
i,j=1
Γijk x˙ i x˙ j = Fk ,
(11)
Deformable Model with Non-euclidean Metrics
447
where the Γijk coefficients are known as the Christoffel’s symbols, and are given by n 1 kl ∂gil ∂glj ∂gij k , (12) g + − Γij = 2 ∂xj ∂xi ∂xl l=1
kl
(g denotes the coefficient at position (k, l) in the inverse matrix of (gij )1≤i,j≤n ). Note that the Newton’s laws are obtained from (11) by choosing the identity matrix for gij (i.e. by making the space Euclidean). From these equations our new model can be defined as a set of moving particles each of which follows (11). Its behavior is thus totally determined by the choice of F . As usual F defines the sum of the image force, the elastic force, the curvature force, and the friction force. These forces do not need to be adapted to the Riemannian framework, since the metric is already embedded into the motion equations. The only necessary change is to replace the Euclidean distance measurement with the Riemannian one in the computations of the elastic forces. Of course, new forces such as those proposed in [13] or [2] are likely to be introduced to enhance the behavior of the model. 4.2
Topology Constraints
In this section we show that the constraint (6) remains a good way of detecting where and when topology changes have to be performed. We will thus assume that the regularity constraint (5) written for a Riemannian metric holds: δ ≤ dR (u, v) ≤ ζδ ,
(13)
for each edge of the deformable model. Now, let us prove that self-intersections may be detected in the same way as before, namely, by comparing the distances between non-neighbor vertices with a particular threshold. To do so, we use a relation that compares the Riemannian metrics to the Euclidean metric. We need to consider the lower and upper bounds of the eigenvalues of the metric matrix used in (8). They will respectively be denoted µmin and µmax . With these notations, the relation between the Euclidean and the Riemannian distances is expressed as µmin dE (M , N )2 ≤ dR (M , N )2 ≤ µmax dE (M , N )2 .
(14)
This inequality is very useful since we can deduce from it that the constraint (6) can still be used to detect topology changes even in non-Euclidean spaces. Furthermore, it allows us to check efficiently whether this constraint holds. Let (u, v) be an edge of our deformable model, and p be a vertex that moves towards (u, v). Let m denote the point of (u, v) which minimizes dR (m, p). Without loss of generality, we may assume that dR (m, u) ≤ dR (m, v). The triangular inequality is expressed as dR (p, u) ≤ dR (p, m) + dR (m, u) . From (13) and (14), and from the fact that dE (u, m) ≤ 12 dE (u, v) we obtain dR (p, u) ≤ dR (p, m) +
µmax µmax dR (u, v) ≤ dR (p, m) + ζδ . 2µmin 2µmin
448
B. Taton and J.-O. Lachaud
If p crosses over (u, v), there is a time interval during which dR (p, m) ≤ &. µmax ζδ detects self intersections. If vmax Therefore, comparing dR (p, u) with &+ 2µ min is the upper bound of the particles displacement between two steps of the model evolution, a good choice for & is vmax . The constraint between non-neighbor vertices is hence expressed as µmax ζδ + vmax ≤ dR (u, v) . 2µmin With λR =
µmax 2µmin
+
vmax ζδ
(15)
this is written as λR ζδ ≤ dR (u, v) .
(16)
Note that with the Euclidean metric, µmin = µmax , so that we get back the value λE that was used for the original model.
5
Metric Choice
In this section, we explain how the knowledge of the local metric eigenvalues and eigenvectors is useful to properly choose the metric according to the image. 5.1
Metrics Properties
In this paragraph we adopt a local point of view, and it is assumed that metrics are smooth enough to be considered as locally constant. At a given point of the image space, the metric is given as a symmetric positive definite matrix G with two eigenvectors v1 and v2 and their associated eigenvalues µ1 and µ2 . If the vectors v1 and v2 are choosen with a unitary Euclidean length, then the Riemannian length of x = x1 v1 + x2 v2 is given as (17) xR = µ1 x21 + µ2 x22 . Then, as shown in Fig. 2, the local unit ball is an ellipse with an orientation and a size that depend on the local spectral decomposition of G. One may easily verify that the unit balls of a metric with two equal eigenvalues are Euclidean unit balls up to a scale factor. These metrics are thus called isotropic metrics, while metrics with two different eigenvalues are called anisotropic metrics. Reciprocally, it is easy to get the symmetric positive definite matrix corresponding to given eigenvectors and eigenvalues. The eigenvectors are used to define locally the main directions of the metric, and the eigenvalues determine to what extent the space is expanded along to these directions. In the next section we explain how we use these statements to choose the image space metrics. 5.2
Metric Definition over the Image
Isotropic Metrics. Isotropic metrics are the easiest to use and define (they may indeed be written g(x, y)×I, where I denote the identity matrix). Therefore,
Deformable Model with Non-euclidean Metrics
vM v1
√1 µ1
v
θ
v2 √1 µ2
vm
449
The unit ball induced by the norm G is an ellipse with its semi axes oriented according to v1 and v2 . The lengths of √ the semi axes are given by 1/ µ1 and √ 1/ µ2 . The length of a vector depends on its direction: √ vM R = kv1 R = k µ1 √ vm R = kv2 R = k µ2 vθ R = k
µ1 cos2 θ + µ2 sin2 θ
Fig. 2. Local behavior of metrics
a
b
c
d
e
Fig. 3. Results with a user-defined metric. The images (a), (b) and (c) show different evolution steps of the segmentation algorithm. The image (d) shows details of the upper left and lower right parts of the image (c), and emphasizes the resolution difference between the two parts of the image. The metric is isotropic and has high values in the upper right corner of the image. In the lower left corner, it is equivalent to the Euclidean metric. This is shown in the picture (e), where the circles represent the local unit ball of the metric. Note that this experience also validates the automated topological changes of the model in a space with a deformed metric.
in a first approach, we will only use metrics of this kind. Two methods to define metrics are then considered. The first method is to let the user himself choose the interesting parts of the image. The metric g is set to high values in these areas, and remains low elsewhere (namely equal to 1, to get back the classical Euclidean behavior of the model). An example of segmentation using this method is shown in Fig. 3. The second method is to automatically determine the areas of interest. In the deformable contours framework, these places are typically characterized by high values of the gradient norm. Therefore, a good choice for the function g is g(x, y) = 1 + ∇I ∗ Gσ , where Gσ denotes the usual Gaussian smoothing filter. This ensures that the model resolution increases nearby strong image contours, while remaining coarse elsewhere. Anisotropic Metrics. The metrics previously described are simple and intuitive. However, they do not take advantage of all the properties of Riemannian metrics. With anisotropic metrics, space deformations depend on the direction.
450
B. Taton and J.-O. Lachaud
The following paragraph shows how that kind of metric improves the segmentation. Consider the metric g defined as the Euclidean metric in the image places where the gradient norm is low or nul. Suppose that elsewhere, g is the metric with the spectral decomposition (v1 , µ1 ), (v2 , µ2 ), with v1 = ∇I, µ1 = ∇I, v2 = v1 ⊥ and µ2 = 1. In places where there are no strong contours, the deformable model behaves as if the metric were Euclidean. Therefore, no new vertices are created, and the model resolution remains coarse. In the neighborhood of a contour, the deformable model mesh may either cross over or follow the contour. Suppose that it crosses over the contour. In a small neighborhood the mesh is roughly parallel to v1 . With our choice for µ1 edge lengths are overestimated. Therefore the model locally increases its resolution, which provides it with more degrees of freedom. In contrast, suppose that the model mesh follows the contour. Then the mesh is roughly parallel to v2 (i.e. orthogonal to ∇I), and with our choice for µ2 , distances are estimated as in the Euclidean case. Consequently, the model resolution remains coarse, and the number of vertices used to describe the image component remains low.
a
b
c
d
e
Fig. 4. Segmentation with different automatically computed metrics: the model is initialized around the figure. The image (a) shows the result obtained with the Euclidean metric with the lowest model resolution that allows to distinguish the two connected components of the image. The image (c) shows the result obtained with the isotropic Riemannian metric described in image (b). The picture (e) shows the segmentation obtained with the anisotropic Riemannian metric described in Fig. (d).
The different methods considered before are experimented and compared on computer generated images in Fig. 4 and 5. Additional results with a MR brain image are shown on Fig. 6.
6
Conclusion
We have proposed a bi-dimensional highly deformable model that dynamically and locally adapt its resolution. Hence, a fine resolution is achieved in the interesting parts of the image while a coarse resolution is kept elsewhere. By this way, the segmentation algorithm complexity is made significantly more independent from the size ot input data. In a bi-dimensional image, dividing twice the model resolution divides the number of vertices by two. For three-dimensional
Deformable Model with Non-euclidean Metrics
451
300
250
Metric Time (s) Complexity (a) Euclidean 3.88 186, 567 (b) Isotropic 49.72 152, 860 (c) Anisotropic 8.33 55, 258
a 200
150
b 100
c 50
0
100
200
300
400
500
600
700
800
Fig. 5. Statistics: the graph represents the evolution of the number of vertices of the model with respect to the iteration number, and for different metrics. The segmented image is the same as in Fig. 4. This emphasizes that our new model is able to extract image components with significantly less vertices than the initial model, especially if anisotropic metrics are used (curve c on the graph). It also shows that the use of deformed metrics speeds up the model convergence. Computation times remain higher than with an Euclidean metric, however, many optimizations are likely to be performed to reduce the computational cost per vertex. 1600
1400
a
1200
1000
800
600
b
400
200
a
b
c
0
250
500
750
1000
1250
1500
1750
2000
2250
2500
Fig. 6. Results with a biomedical image. These images show the segmentation of image (a) using an Euclidean metric (b) and an automatically computed anisotropic Riemannian metric (c). The segmentation quality is rather equivalent with the two metrics, whereas the number of vertices used to describe the shape is far smaller with the anisotropic metric (only 596 vertices for image (c), instead of 1410 for image (b)). Figure (c) shows the evolution of the number of vertices with the iteration number. This graph illustrates that using deformed metrics speeds up the model convergence.
image, the same loss of resolution divides four times the number of vertices. Consequently, highly deformable models with adaptative resolution would be even more useful for three-dimensional image segmentation. The model we have extended was initially designed for this purpose. Moreover, all the tools that are used to define our new deformable model remain available in three-dimensional spaces. We are thus currently working on a three-dimensional version of this model.
References 1. V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. In ICCV95, pages 694–699, 1995. 2. L. D. Cohen. On active contour models and balloons. CVGIP: Image Understanding, 53(2):211–218, March 1991.
452
B. Taton and J.-O. Lachaud
3. H. Delingette. General object reconstruction based on simplex meshes. Research Report 3111, INRIA, Sophia Antipolis, France, February 1997. 4. H. Delingette and J. Montagnat. New algorithms for controlling active contours shape and topology. In D. Vernon, editor, European Conference on Computer Vision (ECCV’2000), number 1843 in LNCS, pages 381–395, Dublin, Ireland, June 2000. Springer. 5. M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1987. 6. J.-O. Lachaud and A. Montanvert. Deformable meshes with automated topology changes for coarse-to-fine 3D surface extraction. Medical Image Analysis, 3(2):187– 207, 1999. 7. J.-O. Lachaud and A. Vialard. Discrete deformable boundaries for the segmentation of multidimensional images. In C. Arcelli, L. P. Cordella, and G. Sanniti di Baja, editors, Proc. 4th Int. Workshop on Visual Form (IWVF4), Capri, Italy, volume 2059 of Lecture Notes in Computer Science, pages 542–551. Springer-Verlag, Berlin, 2001. 8. R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape Modelling with Front Propagation: A Level Set Approach. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(2):158–174, February 1995. 9. T. McInerney and D. Terzopoulos. Medical Image Segmentation Using Topologically Adaptable Snakes. In Proc. of Computer Vision, Virtual Reality and Robotics in Medicine, pages 92–101, Nice, France, 1995. Springer-Verlag. 10. S. Osher and J. A. Sethian. Fronts propagating with curvature dependent speed: algorithms based on hamilton-jacobi formulation. Journal of Computational Physics, 79:12–49, 1988. 11. J.A. Sethian. Level Set Methods, volume 3 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 1996. 12. J. Strain. Tree methods for moving interfaces. Journal of Computational Physics, 15(2):616–648, May 1999. 13. C. Xu and J. L. Prince. Snakes, shapes, and gradient vector flow. IEEE Trans. on Image Processing, 7(3):359–369, March 1998.
Finding Deformable Shapes Using Loopy Belief Propagation James M. Coughlan1 and Sabino J. Ferreira2 1
2
Smith-Kettlewell Institute, 2318 Fillmore St., San Francisco, CA 94115, USA,
[email protected] Department of Statistics, Federal University of Minas Gerais UFMG, Av. Antonio Carlos 6627, Belo Horizonte MG, 30.123-970 Brasil,
[email protected]
Abstract. A novel deformable template is presented which detects and localizes shapes in grayscale images. The template is formulated as a Bayesian graphical model of a two-dimensional shape contour, and it is matched to the image using a variant of the belief propagation (BP) algorithm used for inference on graphical models. The algorithm can localize a target shape contour in a cluttered image and can accommodate arbitrary global translation and rotation of the target as well as significant shape deformations, without requiring the template to be initialized in any special way (e.g. near the target). The use of BP removes a serious restriction imposed in related earlier work, in which the matching is performed by dynamic programming and thus requires the graphical model to be tree-shaped (i.e. without loops). Although BP is not guaranteed to converge when applied to inference on non-tree-shaped graphs, we find empirically that it does converge even for deformable template models with one or more loops. To speed up the BP algorithm, we augment it by a pruning procedure and a novel technique, inspired by the 20 Questions (divide-and-conquer) search strategy, called ”focused message updating.” These modifications boost the speed of convergence by over an order of magnitude, resulting in an algorithm that detects and localizes shapes in grayscale images in as little as several seconds on an 850 MHz AMD processor.
1
Introduction
A promising approach to the detection and recognition of flexible objects involves representing them by deformable template models, for example, [7,15,17]. These models specify the shape and intensity properties of the objects. They are defined probabilistically so as to take into account the variability of the shapes and their intensity properties. The flexibility of such models means that we have a formidable computational problem to determine if the object is present in the image and to find where it is located. In simple images, standard edge detection techniques may be sufficient to segment the objects from the background, though we are still faced with the A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 453–468, 2002. c Springer-Verlag Berlin Heidelberg 2002
454
J.M. Coughlan and S.J. Ferreira
difficult task of determining how edge segments should be grouped to form an object. In more realistic images, however, large segments of object boundaries will not be detected by standard edge detectors. It often seems impossible to do segmentation without using high level models like deformable templates [7]; an important challenge that remains is to find efficient algorithms to match deformable templates to images. We have devised a Bayesian deformable template for finding shapes in cluttered grayscale images, and we propose an efficient procedure for matching the template to an image based on the belief propagation (BP) algorithm [11] used for inference on graphical models. The template is formulated as a Bayesian graphical model of a two-dimensional shape contour that is invariant to global translations and rotations, and which is designed to accommodate significant shape deformations. BP can be applied directly to our deformable template model, yielding an iterative procedure for matching each part of the deformable template to a location in the image. Like the dynamic programming algorithm used in earlier related work [3], BP is guaranteed to converge to the optimal solution if the graphical model is tree-shaped (i.e. without loops). However, an important advantage of BP over dynamic programming is that it may also be applied to graphical models with one or more loops, which arise naturally in deformable template models. Although convergence is no longer guaranteed when loops are present, researchers have found empirically that BP does converge for a variety of graphical models with loops [10], and our experiments with deformable templates corroborate these findings. Although BP can be applied straightforwardly to the deformable template model, an important additional contribution of our work is to augment BP by two procedures which speed it up by over an order of magnitude. The first procedure, called belief pruning, removes matching hypotheses that are deemed extremely unlikely from further consideration by BP. (This is very similar to the “beam search” technique used to prune states in hidden Markov models (HMM’s) in speech recognition [9].) The speed-up of belief pruning is enhanced greatly by the second procedure, called ”focused message updating.” This procedure, inspired by the 20 questions (divide-and-conquer) search strategy [6], uses a carefully chosen sequence of ”questions” to guide the operation of BP. The first stage of BP is devoted to matching a ”key feature” of the template – corners or T-junctions which are relatively rare in the background clutter. A second key feature is chosen for BP to process next, in such a way that the conjunction of the two features is expected to be an even rarer occurrence in the background. By the time BP has processed these first two key features, substantial pruning has occurred – even though BP may still be far from convergence – which significantly speeds up the subsequent iterations of BP. The result is a deformable template algorithm that detects and localizes shapes in grayscale images in as little as several seconds on an 850 MHz AMD processor. We demonstrate our algorithm on three deformable shape models – the letter “A,” a stick-figure shape, and an open hand contour – for a variety of
Finding Deformable Shapes Using Loopy Belief Propagation
455
images, showing the ability of our algorithm to cope with a wide range of shape deformations, extensive background clutter, and low contrast between target and background. Our work and related work on algorithms for deformable template matching [21], which emphasize the use of a generative Bayesian model that uses separate models to account for shape variability and for variations in appearance, may be compared with other recent work on object detection and shape matching. A variety of object detection techniques have been developed which find instances of deformable shapes by applying sophisticated tests to decide whether each local region in an image contains a target or belongs to the background. In [5, 13] these tests are based on a strategy of posing a series of questions designed to reject background regions as quickly as possible (similar in spirit to the 20 Questions strategy), while [12] relies on powerful statistical likelihood models of the appearance of target and background in order to discriminate between them. Although these object detection algorithms are fast and effective, they lack the explicit models of shape and appearance variability used in Bayesian deformable template models, and it seems difficult to extend these algorithms to highly articulated shapes. Finally, we cite recent work on point set matching [2] and matching using shape context [1], which are very effective matching algorithms for use with point sample targets. These algorithms are designed to perform highly robust matching in limited clutter, while our algorithm is intended to address the problem of visual search in more highly cluttered grayscale images.
2
Deformable Template
Our deformable template is designed to detect and localize a given shape amidst clutter in a grayscale image. The template is defined as a Bayesian graphical model consisting of a shape prior and an imaging model, and an inference algorithm based on BP is used to find the best match between the template and the image. More specifically, the shape prior is a graphical model that describes probabilistically what configurations (shapes) the template is likely to assume. The imaging model describes probabilistically how any particular configuration will appear in a grayscale image. Given an image, the Bayesian model assigns a posterior probability to each possible shape configuration. A BP-based algorithm is used to find the most likely configuration that optimally fits the image data. 2.1
The Shape Prior
The variability of the template shape is modelled by the shape prior, which assigns a probability to each possible deformation of the shape. The shape is represented by a set of points x1 , x2 , . . . , xN in the plane which trace the contours of the shape, and by an associated chain θ1 , θ2 , · · · , θN of normal orientations which describe the direction of outward-pointing normal vectors at the points (see Figure (1)). Each point xi has two components xi and yi . For brevity we
456
J.M. Coughlan and S.J. Ferreira
define qi = (xi , θi ), which we also refer to as “node” i. The configuration Q, defined as Q = (q1 , q2 , · · · , qN ), completely defines the shape. −1
0 1
−2
10
3
−4
16 −5
4 5
−6
6
−7
7
−8 −4
−3
−2
−1
0
1
2
3
8 9
2
−3
11 17 18
19 12 13 14 15
4
Fig. 1. Left, “letter A” template reference shape, with points drawn as circles and line segments indicating normal directions. Right, the associated connectivity graph of the template, with lines joining interacting points (the dashed line denotes a long-distance connection) and numbers labeling the nodes. Note that the connectivity graph has loops.
The shape prior is defined relative to a reference shape so as to assign high probability to configurations Q which are similar to the reference configuration ˜ = (˜ ˜2 , · · · , q ˜N ) and low probability to configurations that are not. This Q q1 , q is achieved using a graphical model (Markov random field) which penalizes the ˜ in a way that is invariant to amount of deviation in shape between Q and Q global rotation and translation. (The scale of the shape prior is fixed and we assume knowledge of this scale when we execute our algorithm.) Deviations in shape are measured by the geometric relationship of pairs of points qi and qj on the template, and are expressed in terms of interaction energies Uij (qi , qj ). Low interaction energies occur for highly probable shape configurations, for which the geometric relationships of pairs of points tend to ˜ and high interaction energies are obtained be faithful to the reference shape Q, for improbable configurations (the precise connection to probabilities is formulated in Equation (2)). Two kinds of shape similarities are used to calculate Uij (qi , qj ). First, the relative orientation of θi and θi+1 should be similar to that of θ˜i and θ˜i+1 , meaning that we typically expect θj − θi ≈ θ˜j − θ˜i . This motivates the inclusion in the interaction energy Uij (qi , qj ) of the following term: C Uij (qi , qj ) = sin2 (
θj − θi − Cij ) 2
where Cij = θ˜j − θ˜i . This energy attains a minimum when θj − θi = Cij (and a maximum when θj − θi = Cij + π). Second, we note that the location of point xj relative to qi is also invariant to global translation and rotation. As a result, the location xj relative to qi should typically be similar to the location x ˜j relative to q ˜i (and in fact if qi is known then it is possible to predict the approximate location of xj ). In other words, x ˜i and θ˜i define a local coordinate system, and the coordinates of x ˜j in that coordinate system are invariant to global translation and rotation. If we
Finding Deformable Shapes Using Loopy Belief Propagation
457
˜ i = (cos θ˜i , sin θ˜i ) and define the unit normal vectors ni = (cos θi , sin θi ) and n = (− sin θ , cos θ ) and n ˜⊥ vectors perpendicular to them n⊥ i i i = (− sin θi , cos θi ), i then the dot product of xj − xi with ni and n⊥ should have values similar to i xj − x ˜i ) · n ˜i the corresponding values for the reference shape: (xj − xi ) · ni ≈ (˜ ⊥ and (xj − xi ) · n⊥ ≈ (˜ x − x ˜ ) · n ˜ . Now we can define the remaining two terms j i i i in Uij (qi , qj ), the energies A (qi , qj ) = [(xj − xi ) · ni − Aij ]2 Uij
and
B 2 (qi , qj ) = [(xj − xi ) · n⊥ Uij i − Bij ]
where Aij = (˜ xj − x ˜i ) · n ˜ i and Bij = (˜ xj − x ˜i ) · n ˜⊥ i . The full interaction energy is then given as: Uij (qi , qj ) =
1 B B C C Uij (qi , qj ) + Kij Uij (qi , qj )} {K A U A (qi , qj ) + Kij 2 ij ij
(1)
A B C where the non-negative coefficients Kij , Kij and Kij define the strengths of the interactions and are set to 0 for those pairs i and j with no direct interactions A B C , Kij and Kij produce a stiffer (less (the majority of pairs). Higher values of Kij deformable) template. Noting that in general Uij (qi , qj ) =Uji (qj , qi ), we symmetrize the intersym (qi , qj ) = Uij (qi , qj ) + Uji (qj , qi ). We use the action energy as follows: Uij symmetrized energy to define the shape prior:
P (Q) =
1 −Uijsym (qi ,qj ) e Z i<j
(2)
where Z is a normalization constant and the product is over all pairs i and j, with the restriction i < j to eliminate double-counting and self-interactions. Note that the prior is a Markov random field or graphical model that has pairwise connections between all pairs i and j whose coefficients are non-zero. The values of the coefficients were chosen experimentally by stochastically sampling the prior using a Metropolis MCMC sampler, i.e. generating samples from the prior distribution to illustrate what shapes have high probability (see Figures (2,4) for examples). Learning techniques such as maximum likelihood estimation could be employed to determine more accurate values. 2.2
The Imaging Model
The imaging (likelihood) model explains what image data may be expected given a specific configuration. Two forms of image data are derived from the raw grayscale image I(x): an edge map Ie (x) to provide local evidence for the presence or absence of an object boundary at each pixel, and an orientation map φ(x) to provide estimates of the orientation of edges throughout the image. The filter that defines Ie (x) was chosen appropriate to the problem domain,
458
J.M. Coughlan and S.J. Ferreira
Fig. 2. Stochastic samples from shape prior.
but in this section we will assume it is the magnitude of the image gradient: Ie (x) = |G ∇I(x)| where G(.) is a smoothing Gaussian. The orientation map φ(x) is calculated as arctan(gy /gx ) where (gx , gy ) = G ∇I(x). (Both filters can be evaluated very rapidly on an entire image.) First consider the behavior of the edge map. The likelihood model quantifies the tendency for edge strength values to be high on edge and low off edge. Following the work of Geman and Jedynak [6] and Konishi et al. [8], we define two conditional distributions of edge strength that are empirically obtained: Pon (Ie (x)) = P (Ie (x)|x ON edge) and Pof f (Ie (x)) = P (Ie (x)|x OF F edge). See Figure (3) for samples of Pon and Pof f in a particular domain. Note that Pof f is a simple model of the background which expresses the fact that edge pixels are rare off the target shape. (A more realistic background model, which would capture the fact that background clutter is structured, seems unnecessary for our application.)
Fig. 3. Empirical distributions that make up the imaging model. Left panel shows Pon (Ie ) = P (Ie | ON edge), dashed line, and Pof f (Ie ) = P (Ie | OF F edge), solid line, as a function of Ie (x) = |G ∇I(x)|. Note that Pon (.) peaks at a higher edge strength than does Pof f (.), since true edge boundaries tend to have higher image gradient magnitude. Right panel shows Pang (φ|θ), the distribution of measured edge direction φ (obtained from the G ∇I(x) direction) given the true edge normal direction θ, plotted as a function of the angular error φ − θ (in degrees). Note the sharp fall-off about 0 (the plot is periodic with period 180◦ ), showing that the angular error tends to be small.
Next we turn to the orientation map. Since φ(x) estimates the angle that ∇I makes with the x-axis, we can expect that on a true object boundary the value
Finding Deformable Shapes Using Loopy Belief Propagation
459
of φ(x) should point roughly perpendicular to the tangent of the boundary. (For instance, if the intensity is uniformly high on one side of a straight boundary and uniformly low on the other, then φ(x) should describe the vector perpendicular to the boundary that points from the darker to the lighter region.) Denoting the true normal orientation of the boundary as θ, we expect that φ(x) is approximately equal to either θ or θ + π. This relationship between θ and φ(x) may also be quantified as a conditional distribution Pang (φ|θ), which was measured in previous work [3], see Figure (3). If x is not on an edge then we may assume that the distribution of φ(x) is uniform in all directions: U (φ) = 1/2π. The imaging model is a distribution of the edge map and image orientation data across the entire image, conditioned on the template shape configuration Q. Defining d(x) = (Ie (x), φ(x)) and letting D denote the values of d(x) across the entire image, we can express the likelihood model as a distribution that factors over every pixel: P (D|Q) = P (d(x)|q1 · · · qN ) (3) all pixels
x
where P (d(x)|q1 · · · qN ) is set to Pon (Ie (x))Pang (φ(x) − θi ) if there exists an i such that x = xi – i.e. if x lies somewhere on the configuration shape – or to Pof f (Ie (x))U (φ(x)) otherwise. In other words, pixels on the target contour have edge strengths distributed according to Pon (Ie ) and orientations drawn from Pang (φ|θ); pixels off the target contour have edge strengths distributed according to Pof f (Ie ) and orientations drawn from U (φ). 2.3
The Posterior Distribution
The shape configuration Q is determined by the posterior distribution P (Q|D) = P (Q)P (D|Q)/P (D). As an approximation we can assume that no points in the template qi and qj map to the same point x in the image (a reasonable approximation since the prior would make such a deformation unlikely), so we can re-express the likelihood function in a more convenient form: P (D|Q) =
N Pon (Ie (xi ))Pang (φ(xi ) − θi ) i=1
Pof f (Ie (xi ))U (φ(xi ))
F (D)
(4)
where F (D) = all pixels x Pof f (Ie (x))U (φ(x)) is a function independent of Q. Thus the posterior can be written as: P (Q|D) ∝ P (Q)
N Pon (Ie (xi )) Pang (φ(xi ) − θi ) P (I (x )) U (φ(xi )) i=1 of f e i
(5)
Note that the ratio Pon (Ie (xi ))/Pof f (Ie (xi )) will be high for pixels with high image gradients, and that Pang (φ(xi ) − θi )/U (φ(xi )) will be high where the image gradient is approximately aligned parallel with (or 180◦ opposite to) the direction specified by θi .
460
J.M. Coughlan and S.J. Ferreira
Finally, we can express the posterior as a Markov random field with pairwise interactions (i.e. a graphical model): P (q1 , · · · , qN |D) =
1 ψi (qi ) ψij (qi , qj ) Z i i<j
(6)
where ψi (qi ) is the local evidence for qi (from the likelihood function, Equation (4)) and ψij (qi , qj ) is the compatibility (or binary potential) between qi and qj (from the shape prior, Equation (2)).
3
Inference by Efficient Belief Propagation
The posterior distribution expresses our knowledge of what shape configurations are most likely given the image data. One way to estimate the most likely shape configuration is to calculate the MAP (maximum a posterior) estimate, i.e. Q∗ = arg maxQ P (Q|D). However, we find empirically in our application that the MAP marginal estimate suffices, which is the MAP estimate applied separately to each variable qi : q∗i = arg max P (qi |D) qi
(7)
for all i, where P (qi |D) = dq1 dq2 . . . dqi−1 dqi+1 . . . dqN P (Q|D) is the marginal posterior distribution of qi . (If the posterior is sufficiently peaked about its mode then the MAP marginal estimate will approach the MAP estimate, and we speculate that this is the case for our application.) We can estimate the MAP marginal using the BP (or alternate algorithms such as CCCP [20]). (An alternate version of BP, the max-product belief propagation algorithm, may also be used to estimate the MAP directly.) We note that the MAP marginal estimate may not suffice for multiple targets (which produce multiple modes in the posterior), but techniques similar to those used to find multiple targets in [3] may be adapted to solve this problem in future research. If the connectivity of the Markov random field contains no loops (cycles) – i.e. the associated connectivity graph forms a tree – then algorithms such as BP are guaranteed [11] to find the exact marginals of the posterior in an amount of time linear in N . These algorithms are not guaranteed to find a correct solution when the connections have loops, which will naturally arise for the shapes we will consider in this paper. However, recent work shows that, empirically, BP works successfully in certain domains even when loops are present [10,16], and our experimental results corroborate this result. In this section we describe how the BP algorithm is used to compute the MAP marginals of the posterior distribution. We note that BP requires the use of discrete variables, so it is first necessary to quantize the configuration variable Q by restricting it to assume only a discrete set of values (referred to as the “state space”). In principle this quantization can be made as fine as necessary to approximate the original posterior arbitrarily well, but for computational reasons we will use as coarse a quantization as possible.
Finding Deformable Shapes Using Loopy Belief Propagation
461
We will require that qi ∈ S, where S is a finite, discrete state space, for all i = 1, . . . , N . In general we will define S according to an adaptive quantization scheme to be the set of all triples (x, y, θ) such that (x, y) are the coordinates of pixels on the image lattice (or only those pixels having sufficiently high edge values), and θ belongs to a sampling of orientations in the range [0, 2π). See Section (4) for details. 3.1
The Standard BP Algorithm
The Markov Random Field framework represents the template as a set of nodes with a particular pairwise connectivity structure and a posterior probability distribution given by Equation (6). In order to estimate the marginal distributions the BP algorithm introduces a set of message variables, where mij (qj ) corresponds to the ”message” that node i sends to node j about the degree of its belief that node j is in state qj . The BP algorithm to update the messages is given by: (t+1)
mij
(qj ) =
1 ψij (qi , qj )ψi (qi ) Zij q i
(t)
mki (qi )
(8)
k∈N (i)\j
(t+1) (qi ) qi mij
where Zij = is a normalization factor and the neighborhood N (i) denotes the set of nodes directly coupled to i (excluding i itself), and the superscripts (t + 1) and t are discrete time indices. The product is over all nodes (0) k directly coupled to i except for j. The messages mij (.) may be initialized to uniform values (see next subsection for details). The updates can be done in parallel for all connected pairs of nodes i and j, or they can be applied to different subsets of pairs of nodes at different times, without changing the fixed point of Equation (8). We will exploit this freedom in our focused message updating scheme, described in section (3.3), in order to boost the speed of convergence. The belief variable bi (qi ) is an estimate of the marginal distribution P (qi |D) derived from the message variables as follows (omitting the time superscripts for simplicity): bi (qi ) =
1 ψi (qi ) mki (qi ) Zi k∈N (i)
(9)
is a normalization factor chosen so that qi bi (qi ) = 1. If the connecwhere tivity graph is a tree, then the belief variables bi (qi ) are guaranteed to equal the true marginals P (qi |D) when the update equations have converged to a fixed point[11]. In general the computational cost of performing a message update is proportional to |S|2 , where |S| is the size of the set S, since the sum over qi in Equation (8) must be evaluated for all qj on the right-hand side of the equation. However, we exploit the fact that the compatibility function ψij (qi , qj ) is vanishingly small for most combinations of qi and qj . We have experimented with exploiting this “sparsity” constraint in two ways. One way is to consider only Zi
462
J.M. Coughlan and S.J. Ferreira
all (xj , yj ) that are sufficiently close to (xi , yi ) (no more than a few times the distance between (xi , yi ) and (xj , yj ) on the reference shape), i.e. all qj within a ball B centered on qi . The other is to use a smaller ball B centered on a prediction of qj given qi , using the reference shape to make the prediction. Both of these schemes reduce the computational cost of performing a message update from |S|2 to |S||B|. 3.2
Belief Pruning (0)
We initialize the messages to uniform values: mij (qi ) = 1/|S|, which corresponds to all of the beliefs being initialized to uniform distributions across S. We then perform message updates followed by calculations of the beliefs, and iterate this process. If at any point the (normalized) belief of a state qi associated with node i falls below a threshold value # (which we typically set to # = 10−8 /|S|), we remove this state from further consideration for node i. More (t) precisely, if bi (qi ) is the belief of node i in state qi at time t, then we define a (t) pruned state space Si for node i: (t)
Si
(t)
= {q ∈ S|bi (q) > #}
Using this notation we can write the following message update equation: (t+1)
mij
(qj ) =
1 Zij
ψij (qi , qj )ψi (qi )
(t)
mki (qi )
(10)
k∈N (i)\j
(t) qi ∈Si
(t)
for all qj ∈ Sj . The expression for the beliefs is unchanged from Equation (9) (t)
(t)
except that bi (qi ) isonly evaluated for qi ∈ Si , and the normalization factor Zi is chosen so that qi ∈S (t) bi (qi ) = 1. i
3.3
Focused Message Updating
The BP message update equation can be executed either by updating all the messages mij (qi ) (we are omitting time superscripts in this section for convenience) in parallel or by updating certain messages before others. The idea of focused message updating is to update messages in a particular sequence so as to speed up BP. The sequence of message updates corresponds to a particular sequence of nodes i1 , i2 , . . . iM . The first messages to be updated are mi1 ,i2 (qi2 ), followed by a calculation of the belief bi2 (qi2 ). This causes the state space Si2 , which was initialized to the full space S, to be pruned. Then messages mi2 ,i3 (qi3 ) are updated, and since Si2 has just been pruned this update will be faster than the first update. Additional pruning will occur when bi3 (qi3 ) is calculated, and the process repeats until miM −1 ,iM (qiM ) is updated. Subsequent sequences of updates are performed until messages mij (qj ) have been updated for all neighboring pairs i and j at least once and the updates have converged. In practice, we have obtained good solutions for our deformable template by calculating the
Finding Deformable Shapes Using Loopy Belief Propagation
463
beliefs after a sufficiently long fixed sequence of updates, without needing to check for convergence to decide when to halt the updates. For our deformable template the initial sequence of nodes has been chosen consistent with the 20 Questions (divide-and-conquer) search strategy [5,13]. Nodes corresponding to “key features” of the target shape – portions of the contour, such as T-junctions and corners, that are less likely than other portions to appear in the background clutter – are visited in the message updates first. This can be illustrated in the case of the letter “A” template, whose connectivity graph is shown in Figure (1). We begin by passing messages from node 3 to 16, which is a pair of neighbors chosen as defining a portion of a T-junction on the left side of the letter. The normals corresponding to these neighboring nodes are roughly perpendicular to each other, and we expect this configuration to be somewhat rarer in background clutter than the straight-line configurations that predominate in the structure of the letter “A” (e.g. nodes 0 through 3 lie along a roughly straight section). Continuing the 20 Questions strategy, the best portion of the template to search for next should be the other T-junction (on the right side of the letter), because we expect that the conjunction of the two T-junction features should be quite rare in the background. Accordingly we next pass messages from node 16 to 19 (via the long-distance connection) and then from 19 to 11, and by the time messages are passed to node 11 a significant amount of pruning has taken place. Messages are then passed in the opposite direction, from nodes 11 to 19 to 16 to 3, so that the state space at all four of those nodes has been substantially reduced, and we can treat these four nodes as a foundation from which we can send out messages to the rest of the nodes in the template. There are many possible ways to continue the updates, and we refer to the technical report [4] for details of the procedure we use. Briefly, the bridge of the letter is traversed, followed by the two legs, followed by the main loop at the top of the letter, and then the entire process is repeated. We experimented with removing the long distance connection from the letter “A” prior (which required slight modifications of the message update sequence) and found that, although BP still converged to the correct solution, it took about 50 percent longer to do so. This shows that the long-distance connection causes substantial pruning, and since the bulk of the computations take place in the message updates among the first several pairs of nodes, increased pruning early on can speed up the overall convergence significantly. Empirically we found that the combined use of belief pruning and focused message updating speeds up BP by a factor of about 30.
4
Experimental Results
We tested our deformable template on three shape models: the letter ”A,” a stick-figure shape and the contour of an open hand. For each shape, the shape prior was created by fixing a suitable reference shape and a sparse connectivity A B C , Kij and Kij from Equation (1) were then chosen structure. The coefficients Kij
464
J.M. Coughlan and S.J. Ferreira
so that the stochastic samples from the shape prior represented the desired shape variability reasonably well (see Figures (1,2,4) to see the reference shapes and stochastic samples for the letter “A,” stick-figure and hand shapes).
Fig. 4. Stick-figure shape on top row, hand shape on bottom row. Reference shapes are shown in leftmost column with lines showing the associated connectivity graph, followed by stochastic shape samples in other columns. In the hand shape sample figures, the thumb is shown on the bottom, and long-distance connections are shown between adjacent fingertips.
The letter “A” and stick-figure templates were tested on grayscale images of a whiteboard with handwritten characters and shapes (see Figures (5,6,7) for sample results), and the hand template was tested on grayscale indoor images taken of the hand. The images, which were taken by a digital camera at a resolution of 1280 x 1000 pixels, were decimated by an appropriate amount (typically by a factor of 4 in each dimension) to reduce the size of the state space S. The edge map Ie (x) and orientation map φ(x) were then computed on the decimated images, which took less than a second of CPU time. To reduce the state space S further, we pre-pruned the images by removing pixels that were unlikely to be edges: any pixel location x such that the ratio Pon (Ie (x))/Pof f (Ie (x)) < 0.5 was removed from consideration in the state space S. Finally, we experimented with the simplest possible quantization of template angles: for each pixel location (x, y) remaining after pre-pruning, the triples (x, y, φ(x, y)) and (x, y, φ(x, y)+π) were included in S. A finer sampling of angles in the range [0, 2π), perhaps including several angles focused in the neighborhood of φ(x, y) and φ(x, y) + π, may be more desirable, though this has not yet been tested. Finally, the template was scaled manually to match the scale of the target shape in each image before running BP (it sufficed to set the scale to a precision of about ±10%.) For the whiteboard images, the usual edge map Ie (x), based on the magnitude of the image gradient (see section (2.2)) was replaced by a Laplacian operator in order to minimize the number of edge responses: the image gradient is high on both sides of a handwritten stroke, whereas the magnitude of the Laplacian attains a high value only in the center of the stroke. Since the Laplacian map had high contrast on the whiteboard images, we were able to safely quantize its
Finding Deformable Shapes Using Loopy Belief Propagation
465
values into two bins, so that the Pon (Ie (x))/Pof f (Ie (x)) ratio assumed either a low or a high value (see [4] for details). Results for the “A” template are shown in Figures (5) and (6). Figure (5) illustrates the deformable template’s ability to handle significant shape deformations (left panel) and global rotations (right panel). Figure (6) demonstrates that matching is successful even with substantial amounts of clutter. Figure (7) shows successful matches for the stick-figure template, which demonstrates the high amount of variability that is allowed (in this case in the arms of the stickfigure). Finally, Figure (8) shows that the deformable template can detect shapes even when the target has relatively low contrast relative to the background. The CPU time needed to perform the deformable template matching by BP is roughly proportional to the number of nodes on the template and the number of pixels that survive pre-pruning. However, how much pruning occurs during BP, as well as when it occurs, also affects the speed significantly, and these factors vary from image to image. On an 850 MHz AMD processor, the images in Figure (5) took 2.8 and 21 seconds of CPU time, respectively; the total CPU time divided by the number of initial states |S| for both was on the order of 1 msec per state. The times for the stick-figure and hand results were not measured precisely but ranged from several seconds to half a minute for the stick figures to a few minutes for the hands. Iteration # 93
Iteration # 93
50 50 100 150
100
200 150
250 300
200 100
200
300
400
500
600
700
800
50
100
150
200
250
300
350
400
450
500
Fig. 5. Two sample letter “A” results, with black-and-white crosses drawn to indicate the location of the “A” shape as estimated by the deformable template algorithm. In the example on the left, note that the template locates the upper-case “A” and not the lower-case version (a separate shape prior would need to be used to detect lower-case “A”’s). On the right, note that the template is rotation-invariant and thus is able to detect the upside-down “A.”
5
Conclusion
We have described a promising algorithm for detecting flexible objects in real images. The algorithm can handle substantial shape deformations, large amounts of background clutter and low contrast between target and background. It is desirable to speed up the algorithm further and to deal with certain limitations of the model. Alternatives to BP for estimating posterior marginals, such as CCCP [20] and tree reparameterization [14], enjoy certain advantages
466
J.M. Coughlan and S.J. Ferreira
Fig. 6. Letter “ A” located by deformable template algorithm in images with considerable background clutter.
Fig. 7. Stick-figure matching results. Note the high variability of the shape, especially in the arms.
over BP (for instance, convergence to a locally optimal solution is guaranteed for CCCP even on a loopy graph) and may prove faster than BP if techniques analogous to belief pruning and focused message updating can be adapted to them. We expect that augmenting the existing edge and orientation maps by other low-level cues such corner, T-junction and endstop detectors would significantly speed up template matching by providing a quick method for zeroing in on key features.
Finding Deformable Shapes Using Loopy Belief Propagation
467
Fig. 8. Top row, results of matching hand template in images. Matching succeeds even for fairly low contrast between the target against background, as shown for second result in bottom row (original image on left, Pon (Ie (x))/Pof f (Ie (x)) ratio on right).
The biggest limitation of the current model seems to be the lack of scale invariance. One way to overcome this limitation is to use a scale-invariant shape representation. Another solution may be to use a hierarchical representation that detects parts of the target shape – such as straight-line segments and fingertips in the case of a hand shape – at a range of scales before grouping them together. Acknowledgements. We would like to thank Alan Yuille for his useful comments and suggestions. Both authors were supported by the National Institute of Health (NEI) grant number RO1-EY 12691-01. Sabino Ferreira was also supported by the CNPq- Brazilian National Council of Research.
References 1. S. Belongie, J. Malik and J. Puzicha, “Shape Matching and Object Recognition Using Shape Contexts,” accepted for publication in PAMI. 2. H. Chui and A. Rangarajan, “A new algorithm for non-rigid point matching,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2000. 3. J. Coughlan, A.L. Yuille, C. English, D. Snow. ”Efficient Deformable Template Detection and Localization without User Initialization.” Computer Vision and Image Understanding, Vol. 78, No. 3, pp. 303-319. June 2000.
468
J.M. Coughlan and S.J. Ferreira
4. J. Coughlan and S. Ferreira. “Finding Deformable Shapes using Loopy Belief Propagation.” Technical report in preparation. 5. F. Fleuret and D. Geman, “Coarse-to-fine face detection,” Inter. Journal of Computer Vision, 41, 85-107, 2001. 6. D. Geman. and B. Jedynak. “An active testing model for tracking roads in satellite images”. IEEE Trans. Patt. Anal. and Machine Intel. Vol. 18. No. 1, pp 1-14. January. 1996. 7. U. Grenander, Y. Chow and D. M. Keenan, Hands: a Pattern Theoretic Study of Biological Shapes, Springer-Verlag, 1991. 8. S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. “Fundamental Bounds on Edge Detection: An Information Theoretic Evaluation of Different Edge Cues.” Proc. Int’l conf. on Computer Vision and Pattern Recognition, 1999. 9. B. Lowerre and R. Reddy, “The Harpy Speech Understanding System.” In Trends in Speech Recognition. Ed. W. Lea. Prentice Hall. 1980. 10. K.P. Murphy,Y. Weiss and M.I. Jordan. ”Loopy belief propagation for approximate inference: an empirical study”. In Proceedings of Uncertainty in AI. 1999. 11. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman. 1988. 12. H. Schneiderman and T. Kanade. ”A Statistical Method for 3D Object Detection Applied to Faces and Cars”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000). 13. P. Viola and M. Jones. “Rapid Object Detection using a Boosted Cascade of Simple Features,” CVPR 2001. 14. M. Wainwright, T. Jaakkola and A. Willsky. “Tree-based reparameterization for approximate estimation on loopy graphs.” To appear in NIPS 2001. 15. L. Wiscott and C. von der Marlsburg, “A Neural System for the Recognition of Partially Occluded Objects in Cluttered Scenes”. Neural Computation. 7(4):935948. 1993. 16. J.S. Yedidia, W.T. Freeman, Y. Weiss. “Bethe Free Energies, Kikuchi Approximations, and Belief Propagation Algorithms”. 2001. MERL Cambridge Research Technical Report TR 2001-16. 17. A. L. Yuille, “Deformable Templates for Face Recognition”. Journal of Cognitive Neuroscience. Vol 3, Number 1. 1991. 18. A.L. Yuille and J. Coughlan. “Twenty Questions, Focus of Attention, and A*: A theoretical comparison of optimization strategies.” In Proceedings EMMCVPR’ 97. Venice. Italy. 1997. 19. A.L. Yuille and J. Coughlan. ”Visual Search: Fundamental Bounds, Order Parameters, and Phase Transitions.” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 22, No. 2, pp. 1-14. February 2000. 20. A. L. Yuille, “CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation.” Neural Computation. In press. 21. S. C. Zhu, R. Zhang, and Z. W. Tu, ”Integrating Top-down/Bottom-up for Object Recognition by Data Driven Markov Chain Monte Carlo”, Proc. of Int’l Conf. on Computer Vision and Pattern Recognition, Hilton Head, SC, 2000.
Probabilistic and Voting Approaches to Cue Integration for Figure-Ground Segmentation Eric Hayman and Jan-Olof Eklundh Computational Vision and Active Perception Laboratory (CVAP) Dept. of Numerical Analysis and Computer Science KTH, SE-100 44 Stockholm, Sweden {hayman, joe}@nada.kth.se http://www.nada.kth.se/˜hayman/ECCV2002 Abstract. This paper describes techniques for fusing the output of multiple cues to robustly and accurately segment foreground objects from the background in image sequences. Two different methods for cue integration are presented and tested. The first is a probabilistic approach which at each pixel computes the likelihood of observations over all cues before assigning pixels to foreground or background layers using Bayes Rule. The second method allows each cue to make a decision independent of the other cues before fusing their outputs with a weighted sum. A further important contribution of our work concerns demonstrating how models for some cues can be learnt and subsequently adapted online. In particular, regions of coherent motion are used to train distributions for colour and for a simple texture descriptor. An additional aspect of our framework is in providing mechanisms for suppressing cues when they are believed to be unreliable, for instance during training or when they disagree with the general consensus. Results on extended video sequences are presented.
1
Introduction
A key capability for active observers is to segment foreground objects out from the background. This enables higher level processes such as object recognition, object grasping and vehicle navigation to take place. It is also a topic of significant interest in video coding where the MPEG-4 format requires different objects to be represented separately for efficient transmission of data and for interactive editing of sequences. To achieve this goal robustly, systems can make use of multiple cues, the cues chosen in this work are motion, colour, contrast and prediction, although cues based on stereo or more advanced models of texture could readily be incorporated. The use of multiple cues not only provides more measurements when estimating a state, but, crucially, different cues may be complementary in that one may succeed when another fails. For example segmentation based on motion or stereo tends to suffer in untextured regions, but it is precisely in such patches that colour analysis is at its most effective. A second potential cause of failure is outlying measurements, which plague many computer vision algorithms. Overall system performance may be improved in two main ways, either by enhancing each individual cue, or by improving the scheme for integrating the information from
Visit the website to see the movies accompanying this paper.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 469–486, 2002. c Springer-Verlag Berlin Heidelberg 2002
470
E. Hayman and J.-O. Eklundh
the different modalities. The contributions of this paper concern the latter, we do not attempt to improve on existing techniques for the individual components. We present and compare two alternative integration techniques. The first is a probabilistic approach where the likelihood of observing the data given a model of each layer is computed before Bayes rule is applied to classify the pixels. The second scheme is a voting method, the key difference being that each cue makes an independent decision regarding layer membership before these decisions are combined using a weighted sum. The advantage of voting in data fusion is that measurements drawn from very different spaces can easily be combined [6]. With probabilistic methods more care must be taken in designing the model of each so that the different cues combine in the desired manner. However, therein also lies their strength since it forces the user to be explicit in terms of designing the model and specifying what parameters are used and what assumptions are made. As in [2,21,23] the final output from both algorithms is a measure of belief of membership to a layer and is a continuous number between 0 and 1. The sum of these over all layers must be equal to 1 at all pixels. Layers are allowed to compete for pixels rather than assigning a pixel to the first layer which can describe the observations. A contaminated mixture of Gaussians (described in Section 3) is used to model both the foreground and background layers in each modality. The use of mixture models permits a single layer to encode an entire object consisting of a number of distinct, homogeneous, regions. Any additional cue which can also be modelled in this manner can easily be inserted into the framework. A key contribution of our work is that our system not only integrates information over different modalities, but also over time by learning models for cues based on the output of the overall algorithm in previous frames. During initialization, the motion cue provides an estimate of the segmentation, and this classification of pixels is then used to train models for other modalities via the Expectation-Maximization (EM) algorithm [8, 4]. This builds on the notion that figure-ground separation is simpler if there is a solution from previous time-steps, and it ensures temporal stability, which cannot be guaranteed by repeatedly applying traditional single-frame segmentation methods based on colour, texture and proximity. Thus the approach adopted in this paper strongly resembles the tracking paradigm where new images are searched for evidence supporting a particular hypothesis. However, our purely data driven approach is in contrast to most work with multiple cues in the tracking community which relies on prior models of at least some cues in terms of colour distributions or shape or appearance models, although [25] deals with learning non-parametric distributions for skin colour and head orientation for face-tracking. Our work is motivated largely by applications in autonomous mobile robots which may encounter a large variety of objects and where assistance by a human in bootstrapping is at best inconvenient and at worst impossible. This scenario further constrains our approach to being sequential and feasible for real-time implementation. The models are adapted over time, initially so that more data is used to train the models, and subsequently to handle changes such as illumination variations. In terms of Clark and Yuille’s classification scheme for data fusion algorithms [7], our techniques are recurrent strongly-coupled fusion since the outputs of modalities affect other modalities via the training mechanism. Both schemes incorporate reliability measures. In the Bayesian approach these are modelled via hyper-priors on the model parameters, whereas in the voting algorithm this capability is provided by the weights in the final integration stage. The reliabili-
Probabilistic and Voting Approaches
471
ties are used to suppress cues while their models are being learnt, and to incorporate self-evaluation outputs from the different modalities. For instance, if little independent motion is detected, motion computation, and subsequently motion segmentation, could be unreliable. A further use of reliabilities is based on history; if a cue has been unreliable in recent frames, then it should be given less weight than the more reliable modalities also in the current frame [26]. In the future we also envisage using reliabilities to express belief concerning pixels being occluded in motion and stereo algorithms. Previously a number of different techniques have been used to assign pixels to layers according to some coherent property. Particularly well studied is motion segmentation both in 2D [11,5,27,2], and in 2.5 and 3D [10,17] whereas [23] used depth information. With regard to combing cues, in independent work Khan and Shah [12] proposed a framework with many similarities to ours. They use optic flow, colour and a spatial cue (similar to our prediction cue) and train colour distributions. Each region is only assigned a single colour. Thus the foreground and background risk being split into several regions rather than being assigned to a single foreground or background layer. They weight cues depending on reliability measures for optic flow. Other work closely related to ours is that of Tao et al. [21] who computed posterior estimates of pixel assignment as well as model parameters for real-time vehicle tracking and segmentation from airborne cameras using motion, appearance and shape information. Altunbasak et al. [1] iteratively enhanced the output of a motion segmentation algorithm with results from a colour or intensity based segmentation step. After motion segmentation, entire regions from the colour or intensity segmentation are assigned to layers, and then the motion parameters are recomputed. In [22] colour segmentation was used in a similar manner to improve stereo depth maps. Nordlund and Eklundh [15] combine motion and stereo by detecting and tracking peaks in histograms in a combined motion-stereo space. Global techniques such as normalized cuts [19] can incorporate any kind of information available by setting the edge weights in a graph, but as yet this and related algorithms are slow. For single images Malik et al. [14] provide an illuminating discussion of the integration of texture and contour information whereas Belongie et al. [3] considers fusion of texture and colour cues. A number of these techniques are based on the EM algorithm, or a generalization of it, to provide mathematically rigorous algorithms [2,3,22,23]. The segmentation problem is closely related to tracking, and our work is also inspired by literature in that field. Whereas in tracking the goal is usually to recover the image of a fixed point (x, y) on the tracked object, in segmentation a dense map of membership to each layer is required. Impressive real-time tracking has been demonstrated by Triesch and von der Malsburg [26], Kragic [13] and Spengler and Schiele [20] who recover saliency maps from each independent cue and combine these using a weighted sum. The coordinates of the fixation point are then given by the peak in the combined saliency map. The weights are continuously updated according to the general consensus. The voting technique presented in the current paper is based on this integration scheme. In figure-ground segmentation the problem may be posed in much the same manner, except that in the final stage the weighted sum of saliency maps is itself the output, or alternatively a binary mask derived from it by a simple thresholding technique. Furthermore, while in tracking it is frequently just the foreground object which is modelled, segmentation is most naturally posed as finding which of two or more layers best explains the measurements at each pixel. In this paper we demonstrate how reweighting cues in voting also applies to segmentation, and we present an adaptation suitable for the
472
E. Hayman and J.-O. Eklundh
Bayesian integration scheme. For probabilistic tracking with multiple cues, Toyama and Horvitz [24] and Sherrah and Gong [18] use Bayesian networks to incorporate reliability measures. The former work uses discrete variables whereas the latter generalizes the technique to continuous variables while maintaining computational tractability by using Gaussian mixture models. Also of note is the co-inference approach of Wu and Huang [28] where shape and colour distributions are adapted online with impressive results. The rest of the paper is organized as follows. The next two sections discuss the voting and probabilistic fusion schemes and the choice of probability density functions before Section 4 presents the inclusion of reliability measures. Section 5 describes the particular cues used in this work while Section 6 outlines how their models are automatically initialized and subsequently updated. Experimental results are presented in Section 7. Conclusions are drawn and avenues for future research discussed in Section 8.
2 The Integration Schemes 2.1 Weighted Voting The key feature of the voting method is that each cue makes a decision regarding layer membership independent of all other cues before the results are combined. Let the likelihood of the observations Zi,k from cue k at pixel i given the model Mj,k of layer j be denoted by pi,k (Zi,k | Mj,k ). Then for each cue, the posterior probability of layer membership is found using Bayes’ Rule pi,k (Zi,k | Mj,k ) p(j) pi,k (j | Zi,k ) = j pi,k (Zi,k | Mj,k ) p(j)
(1)
where p(j) is the prior probability of layer j. These priors may be used to express belief concerning the size of the foreground relative to the background. The cues are combined to obtain an overall value for degree of membership to a given layer using a weighted sum of the responses from each cue scorei (j) = wk pi,k (j | Zi,k ) k
where these scores are normalized to sum to one over all layers. The weights are used to express the confidence in each cue. This data fusion scheme is also known as a linear opinion pool. An important property of voting is that the influence of a single cue is bounded. Considering equation (1) it matters little if the ratio pi,k (Zi,k | Mj,k )/pi,k (Zi,k | Mj ,k ) is 10 or infinite for all layers j =j since the posterior degree of membership from that cue alone subsequently varies only between 0.9 and 1. Regardless how strongly the evidence from a single cue supports a given layer, this layer assignment may still be defeated if two or more other cues vote in favour of an alternative hypothesis, even if they have significantly lower ratios pi,k (Zi,k | Mj,k )/pi,k (Zi,k | Mj ,k ). This can be a desirable property since it provides some robustness to outliers. It is worthwhile commenting on how the final result is influenced by uncertain cues, that is cues with equal pi,k (j | Zi,k ) for all layers j at a certain pixel. This is best illustrated
Probabilistic and Voting Approaches
473
with an example. Assume that there are two cues, k and k , and two layers j and j and that the measurements yield pi,k (j | Zi,k ) = 0.8, pi,k (j | Zi,k ) = 0.2, pi,k (j | Zi,k ) = 0.5, and pi,k (j | Zi,k ) = 0.5, implying that cue k is completely uncommitted to either layer. Combining the cues assuming equal weights gives scorei (j) = 0.65 and scorei (j ) = 0.35. The uncertain cue k “blurs” the final response, although a binary mask derived as arg max(j) remains unchanged. Similarly one may verify that when all cues provide the same output, e.g. pi,k (j | Zi,k ) = 0.8, pi,k (j | Zi,k ) = 0.2, pi,k (j | Zi,k ) = 0.8, and pi,k (j | Zi,k ) = 0.2, the final scores also take these values, scorei (j) = 0.8 and scorei (j ) = 0.2. 2.2 A Bayesian Approach to Segmentation From a probabilistic point of view, the voting algorithm described above is invalid, as will be highlighted below while presenting our second approach to cue integration. When deriving the probabilistic integration scheme we make the assumption that the observations from the cues are independent, and so the total likelihood of the observations given the combined model Mj = {Mj,1 ...Mj,k } over all cues k for layer j at pixel i is pi (Zi | Mj ) =
pi,k (Zi,k | Mj,k ).
k
A posterior estimate of layer membership is then given by Bayes’ Rule as k pi,k (Zi,k | Mj,k ) p(j) . pi (j | Zi ) = j k pi,k (Zi,k | Mj,k ) p(j)
(2)
This is also known as an independent opinion pool. The important difference between this and the voting scheme is that the observations from different cues are now combined before layer membership is evaluated as opposed to computing layer membership for each cue and only then combining the results. Upon consideration of equation (2) it is easily seen that completely uncertain cues have no effect on the posterior estimate, in contrast with voting which blurred the layer assignment scores. Repeated measurements reinforce each other: with uniform priors and pi,k (Zi,k | Mj,k ) = 0.8, pi,k (Zi,k | Mj ,k ) = 0.2, pi,k (Zi,k | Mj,k ) = 0.8, and pi,k (Zi,k | Mj ,k ) = 0.2, the posterior layer assignments are pi (j | Zi ) ≈ 0.941 and pi (j | Zi ) ≈ 0.059. If a measurement coincides exactly with a sharp peak in one cue’s probability density function (pdf), the likelihood ratio pi,k (Zi,k | Mj,k )/pi,k (Zi,k | Mj ,k ) can be very large, and equation (2) shows that this cue can completely overpower the remaining cues, unless they too have measurements coinciding with equally sharp peaks. Sharply peaked distributions can arise when a cue is trained from a compact cluster of data. Hence different cues can have very different looking pdf’s, and it is by no means trivial to ensure that the resulting interaction between the cues is the desired one. Voting avoids this problem by to a great extent only permitting interplay between competing layer models, and not so much between the models for different cues. The flip side of the coin is that it is difficult to ascertain precisely what that interaction between cues is.
474
3
E. Hayman and J.-O. Eklundh
Choosing Probability Distribution Functions
In our work we assume that the likelihood probability density functions may be described by a contaminated mixture of H Gaussians H
pi,k (Zi,k | Mj,k ) =
p0 + (1 − p0 ) αh N (µh , Σ h ) , A h=1
−1 1 1 e− 2 (Zi,k −µh ) Σ h (Zi,k −µh ) N (µh , Σ h ) = (2π)d/2 |Σ h |1/2
(3)
where d is the dimensionality of the measurement space and each Gaussian is described by its mean µ and covariance matrix Σ. The Gaussians are weighted by factors αh where h αh = 1. | · | denotes the matrix determinant. The mixture of Gaussians is augmented by a uniform distribution which integrates to p0 over the hyper-volume A of the domain of the d-dimensional measurements Zi,k . Note that the probability density function integrates to unity as required. The contamination with a uniform distribution is crucial. Without it, if no layer fits a measurement well such that pi,k (Zi,k | Mj,k ) is very small for all j, the ratio of likelihoods for that cue for layers j and j , pi,k (Zi,k | Mj,k )/pi,k (Zi,k | Mj ,k ), can still be several orders of magnitude. Since it is unreasonable that the posterior probabilities should be biased so much by an unreliable measurement (an outlier) we require probability density functions with heavy tails, and adding the uniform distribution has indeed the desired effect. Thus p0 represents the expected proportion of data poorly described by the Gaussians. In our experiments p0 was set to 0.1. Contaminations were also used in [23,21].
4
Incorporating Reliabilities
One of the main advantages of voting is that the weights provide a simple way of expressing confidence in the output of each cue. Below we review how the weights may be continuously updated before discussing how such features may be included in probabilistic cue integration. 4.1
Self-Organized Adaptation in Weighted Voting
In their tracking paper, Triesch and von der Malsburg [26] proposed a scheme Democratic integration for adapting the weights online by investigating which cues were in agreement with the final result. A score qk is computed for each cue based on the rms ¯ of the difference between the membership maps computed by that cue alone values, E, and the membership maps obtained by fusing the outputs from all cues 1 . In our imple¯ mentation the scores are computed as qk = e−aE , with a = 2, since this gives highest scores for zero error and tails off suitably towards the worst rms error of 1. These qk are 1
This is one of the error measures suggested by Triesch and von der Malsburg, but their smaller state space permits them to use a simpler scheme in their tracking implementation.
Probabilistic and Voting Approaches
475
then normalized to sum to one over all k, k qk = 1. Assuming that also the weights wk used in the integration sum to one, the weights are adapted according to τ w˙k = qk − wk
(4)
where τ is a parameter which determines the rate of change of the weights. The normalization of wk and qk assures that the updated weights also sum to one. 4.2
Incorporating Reliabilities into the Probabilistic Fusion Scheme
A rigorous scheme for expressing belief in the validity of cues is through hyper-priors on the model parameters. We choose to apply hyper-priors to the means µ in the Gaussian mixture models in equation (3). Dropping the subscript h, let one such component be N (µ, Σ). The mean µ of this distribution is itself assumed to be a Gaussian random variable centred on its current estimate µ∗ and with covariance matrix Σ ∗ , in summary p(µ|µ∗ , Σ ∗ ) = N (µ∗ , Σ ∗ ). The parameter µ can be eliminated by marginalization p(Z|µ∗ , Σ, Σ ∗ ) = p(Z|µ∗ , Σ, Σ ∗ , µ) p(µ|µ∗ , Σ, Σ ∗ ) dµ using the property that a convolution of two Gaussians N (µ1 , Σ 1 ) and N (µ2 , Σ 2 ) is itself a Gaussian N (µ1 + µ2 , Σ 1 + Σ 2 ). Computationally the hyper-priors are thus applied simply by replacing each Gaussian N (µ, Σ) in equation (3) by a flatter distribution N (µ, Σ + Σ ∗ ). What remains to be established in the model is what Σ ∗ should be; assuming that we have a measure of reliability R varying from 0 to 1 for each cue, we require a mapping Σ ∗ (R) such that Σ ∗ grows as R tends to 0, and such that Σ ∗ is small when R is close to 1. A difficulty in defining this mapping is that for those cues which are trained online we lack precise advance knowledge of the pdf’s covariance, Σ ∗ (R) must be suitable for both relatively flat and sharp distributions. Although a number of functions Σ ∗ (R) could be suitable, our current implementation uses 1 ∗ −1 I (5) Σ (R) = a(1 − R)Σ + b R where I is the identity matrix and Σ is the covariance of the original Gaussian, as before. The first term is linear in Σ, expressing the belief that the flatter a peak is, the less certain is its location. The second term is independent of Σ and ensures that also extremely sharp distributions are blurred. Suitable values for the constants a and b are chosen in advance for each cue and are the same in all sequences in our experiments. The values are also set to express prior belief in the reliability of each cue. For instance, we believe colour to be more trustworthy than the simple texture descriptor, and we can ensure that this is reflected in experiments by setting a and b appropriately. The values of these parameters for each cue are given in Section 5. In fact we used a = 5 for all cues and only varied b. In the preceding discussion we neglected the uniform component in the contaminated Gaussian mixture models. However, convolving a uniform distribution with a Gaussian gives a good enough approximation to a uniform distribution that the the pdf
476
E. Hayman and J.-O. Eklundh
p(Z|µ∗ , Σ, Σ ∗ ) is still a contaminated Gaussian. This is described in more detail in the long version of this paper [9]. There we also consider applying hyper-priors not to the means but to the covariances in the likelihood pdf’s. With colour, for instance it is not unreasonable to assume that the uncertainty does indeed lie in the location of the peak in hue-saturation space, but with other cues we in fact know the mean exactly (e.g. as 0 or 1) and it would be more appropriate to assign hyper-priors to the covariance matrices in the Gaussians. The pdf after marginalization takes a much less simple form, but can often be well approximated by a contaminated Gaussian. The long version of this paper also contains a comparison between hyper-priors and logarithmic opinion pooling as adopted by Khan and Shah [12] by investigating how this procedure alters the pdf’s in the contaminated Gaussians. We have implemented logarithmic opinion pooling and found that it works well and has the added benefit of not requiring further parameters. The results will not be reported in this paper. 4.3
Self-Organized Weighting of Cues in the Probabilistic Technique
A simple adaptation of Triesch and von der Malsburg’s self-organized reweighting ¯ scheme has proved effective for the Bayesian integration approach. The rms error E is computed for each cue k and converted to a score qk between 0 and 1 as in Section 4.1. New values for the reliabilities Rk are then given by equation (4) with the Rk replacing the weights wk , but now the scores and reliabilities are not normalized to sum to one since this would cause a different amount of blurring of the Gaussian mixture models depending on how many cues are present2 . When computing the scores we use the contaminated Gaussian mixture models without the blurring caused by applying the hyper-priors. This places an extra computational burden on the system as all likelihoods must be computed twice, but it cannot be avoided since computing a score for a completely blurred model is meaningless.
5
Implementation Details: The Individual Cues
Analogous to Clark and Yuille’s classification scheme for weak and strong coupling of cues in data fusion [7], we distinguish between cues which operate independently of all other cues, and those which are coupled to others through the training scheme. 5.1
Motion
The only independent cue in our system is motion. Currently we use Black andAnandan’s algorithm (and code) [5] to fit planar affine motion models between temporal pairs of images. In this direct and robust approach the dominant motion is first computed and pixels which satisfy this model are removed. The process may be repeated for further layers by fitting affine motion models to the remaining data. While Black and Anandan stop here and report binary segmentation masks, we instead allow layers to compete for the pixels. This is done by defining a likelihood function for each layer j as 2
¯ will in general Even so, there is still some reliance on the number of cues since the error E increase with the number of cues.
Probabilistic and Voting Approaches
pi,k (Zi,k | Mj,k ) =
477
p0 t−1 2 , σm ), + (1 − p0 )N (Iit − Ii,j A
t−1 where Iit − Ii,j is the greyscale difference between pixels i from the current image and those from the image from the previous time-step aligned with the current image according to the recovered motion model. σm is a non-adaptive parameter. The dominant motion is assumed to belong to the background layer. For the background layer we also perform a temporal integration following Irani et al. [11] by comparing the current image with an internal representation of the layer based on the previous images and recovered motion models. This scheme reduces the effect of random noise in the greyscale images, and blurs out independently moving objects. We found that this scheme was not always appropriate for foreground layers since our sequences do not contain rigid motion, in which case it is somewhat arbitrary what motion model is computed. As a result the appearance model becomes blurred and harms rather than improves the motion computation in subsequent frames. In addition to letting the general consensus dictate the reliability R of motion we multiply R by a number r between 0 and 1 which is evaluated by counting the fraction f of pixels which were used to compute motion models for the foreground layers [18]. r is given by min(1.0, f /β) where β is a constant of 0.05 in our experiments. Having normalized the greyvalues to lie between 0 and 1, the parameters in equation (5) for applying hyper-priors are a = 5, b = 0.1 for this cue.
5.2
Colour
Our colour segmentation cue is based on the work of Raja et al. [16] who model colour distributions with adaptive Gaussian mixture models in hue-saturation space. Although that paper was mainly motivated by tracking, they do also present some segmentation results. The key difference in our system is that models are initialized online rather than offline, as will be discussed in Section 6. Pixels lacking colour, that is their saturation falls below a threshold, are assigned zero reliability. Similarly, saturation of the CCD causes inaccurate estimation of colour components, and so such pixels are also removed. The constants in equation (5) are a = 5, b = 0.001. 5.3
Image Contrast
As a very simple texture cue, we compute the standard deviation of greyscale values within 7×7 patches at two scales and form a feature vector by concatenating the contrast response over the scales. A contaminated mixture of two Gaussians is initialized and subsequently adapted online. This cue was also used in [26,13], though only at a single scale, and the model was given in advance. Responses from larger filter banks could be used to replace or complement the contrast measure. However, as feature vectors grow in size the computational cost increases rapidly unless filter responses can be considered independent of each other. The constants in equation (5) are a = 5, b = 0.005 are for this cue.
478
5.4
E. Hayman and J.-O. Eklundh
Prediction
Akin to what has been done in tracking [26,13], the segmentation masks from the previous frame are used to predict the layer assignment in the current frame. The background mask is warped according to the affine motion model recovered in the previous frame. We do not warp the foreground layer since we have less faith in the foreground motion model due to non-rigid movement. This mask is then blurred by a 9 × 9 Gaussian mask with σ = 4 giving a value Z between 0 and 1 at each pixel. The likelihood of a pixel belong to layer j is modelled with a contaminated normal distribution on the prediction value Z p0 + (1 − p0 )N (Z − 1, σp2 ) pi,k (Zi,k | Mj,k ) = A where A = 1 and the parameter σp = 1/π is chosen such that the the Gaussian tails off suitably towards 0 as Z goes to 0. We acknowledge that the assumption that the observations from this cue is independent of the others is somewhat bold. The spatial cue of Khan and Shah [12] is, in fact, very similar to this3 . For this cue we used a = 5, b = 0.5 in equation (5).
6
Model Initialization and Adaptation
The two cues which require learning and subsequent adaptation are colour and contrast. When starting the algorithm we first run the motion segmentation algorithm for five frames so that the appearance model for the background layer settles. Then, the layer classification from motion segmentation is used to train the colour model using the Expectation Maximization (EM) algorithm [8,4]. Prediction is introduced after an additional ten frames, and does not not require training. After a further ten frames a contrast model is trained also using EM. The reason for the wait is that the contrast measure risks converging to a poor solution if the training data straddles object boundaries, as is often the case with a segmentation obtained purely by motion. The colour and contrast mixture models are trained only using data which are classified as foreground or background with a degree of certainty over a threshold of 0.6. The update equations for EM for Gaussian mixture models can be found in [4,9] and are of closed form for Gaussian mixture models. We have also tried using all data weighted by the current segmentation mask. In effect we then have two sets of missing variables, the first concerns layer membership while the second set describes the assignment to the different components in each Gaussian mixture model. Although this is appealing from a probabilistic viewpoint, we found that EM frequently converges more slowly or fails to converge at all if too many pixels are uncertain during training. Therefore we do not use this weighting in the experiments reported in this paper. Similarly we prefer not to include the uniform component in the pdf when training mixture models because too much data tends to get assigned to the uniform distribution. Furthermore, the contamination was really introduced to provide robustness to outlying measurements, well-spread data should instead be assigned to a Gaussian with high covariance. 3
Interestingly, the manner in which they pose the cue as the probability of measuring the spatial location given the previous mask makes it possible to justify the independence of observations from this cue and those from colour and optic flow.
Probabilistic and Voting Approaches
479
Due to EM’s well-documented tendency to converge to local minima, it is important to provide the algorithm with a good starting point. This we achieve with k-means clustering which in turn is initialized by placing cluster centres as follows: with colour, initial estimates are evenly spaced on the 0.6 saturation circle in hue-saturation space whereas the contrast distributions are initialized at regular intervals along the diagonal direction (1, 1) of its measurement space. The diagonal is chosen because most data lies close to this line; the contrast measures at two scales are highly correlated. Currently we make no attempt to automatically determine the best number of components in the mixture models, although standard techniques such as cross-validation or minimum description length could be used. However, we note that having too many components does not cause problems since components not supported by the data shrink to zero αh and do not have any further effect. For each layer we currently use a mixture of four Gaussians for colour distributions and two for the contrast cue. The approach taken in this paper for updating the colour and contrast distributions follows that of Raja et al. [16]. This procedure is performed separately for colour and contrast. The current cue model, image data and layer assignments are used to compute a new estimate of the model by effectively performing an iteration of the EM algorithm. The degree of membership (i.e. the hidden variables) for each component is computed in an E step, and these values are then used to recompute the model parameters. The updated model is computed as a weighted sum of this and previous models over a fixed time window. The relevant equations may be found in [16,9]. Again, we have extended this adaptation scheme to incorporate the uniform distribution, but do not use it here since we find that better results are obtained when neglecting this component. During training and adaptation we place a lower bound on the eigenvalues of the covariance matrices to avoid numerical instabilities caused by components shrinking to zero covariance.
7
Experimental Results
As yet we only attempt to find a single foreground layer in addition to the background. The complete algorithm runs at approximately 1/3 Hz on a 1.5 GHz Pentium 4 for 164 × 132 images. Of this 2 seconds is used for motion computation while the remaining cues and the fusion process require 0.5 seconds in total. A further half second is needed for self-organized reweighting of cues in the Bayesian method since all likelihoods must be recomputed. The one-off cost of training initial colour and contrast models took 0.5-1 and 0.2-0.5 seconds respectively. The “tennis” sequence ran at 1/13 Hz due to its larger 348 × 276 images. There is significant scope for speeding up this code. Videos of all results can be viewed from the webpage http://www.nada.kth.se/˜hayman/ECCV2002. Experiment 1: The “tennis” sequence. Both the probabilistic and voting integration techniques were tested on 850 frames of the “tennis” sequence in which a player hits some shots while the camera pans. The original sequence and the segmented foreground object are shown in Figure 1. Here the Bayesian integration technique was used. Reliabilities were incorporated via hyper-priors, as described in Section 4.2 and the method
480
E. Hayman and J.-O. Eklundh
of Section 4.3 was used to recompute cue reliabilities in the probabilistic algorithm according to general consensus. Every 100th frame is shown, and we do not show results from the first frames when the models are being trained. Figure 2 shows the contributions of the different cues. Motion does not provide information in the large textureless regions of the image. Colour successfully distinguishes between the skin of the tennis player and the green background and red playing surface. The colour cue provides no information, unfortunately, for the player’s white T-shirt since these pixels are rejected for colour analysis due to their lack of saturation. However, the contrast cue does a surprisingly good job of latching onto the T-shirt and also her shoes, and as a result the final segmentation is very satisfactory. The pixels on the court which are incorrectly assigned to the foreground straddle the red surface and white line, and the resulting colour is indistinguishable from flesh colour. Results are good despite the motion of the foreground object being far from rigid. Experiment 2: The “mother and daughter” sequence. A second experiment illustrates the self-organization of cues with both Bayesian and voting integration schemes on 240 frames of the “mother and daughter” MPEG-4 test sequence. Figure 3 shows the original sequence and segmentation results. During initialization the mother’s head and hand move. At first the segmentation is poor while only motion is used, but results improve as models for the remaining cues are trained and their associated reliabilities increase. The mother is successfully segmented out from the background throughout the sequence. Also the face of the daughter is included since it shares the same colour characteristics as her mother’s face. The segmentation of the daughter’s hair and clothing is more erratic, largely because it is unclear whether these regions were used for the initial training of colour and contrast models. After a while the hair is assigned to the background, apart from one small region on the top of her head which bears a strong resemblance both in colour and texture to the mother’s hair. Rather than showing the final masks, Figure 3 illustrates the layer assignment where bright regions indicate pixels which are assigned to the foreground with a high degree of certainty. Results are good with both integration techniques, but note that voting gives a graded solution while the output from the Bayesian method is almost binary, as expected. The sharp results from the Bayesian scheme are more reliable for generating thresholded masks whereas the voting scheme in some sense gives more meaningful results by clearly indicating regions which are assigned with less certainty due to disagreement of the cues. Figures 4a and b show plots of the evolution of the reliabilities and 4c and d the outputs from each cue for a particular frame. Especially the Bayesian scheme successfully suppresses the contrast cue which is in disagreement with the general consensus. The erratic motion in this sequence caused the motion cue to frequently give poor results, but this did not have much effect on the final segmentation since the cue’s associated reliability was low. In an unreported experiment we confirmed that failing to lower the reliability of the motion cue gave a dissatisfactory overall segmentation. Experiment 3: The “silent” sequence. A final demonstration of our self-organized probabilistic system is given on the 450 frame “silent” MPEG-4 test sequence in Figure 5. The motion is very non-rigid, and an additional challenge is that the background is more cluttered, though static. Even so, the colour and contrast models converge correctly, although a few areas in the background are misclassified since they share the same colour
Probabilistic and Voting Approaches
481
Original image
Foreground object
Fig. 1. Every 100th frame of the original “tennis” sequence and segmentation results from the Bayesian integration method. A threshold mask from the foreground layer is applied to the original images.
(a) Motion
(b) Colour
(c) Contrast
(d) Prediction
(e) Combined
Fig. 2. Results from the different cues at a frame in the “tennis” sequence. The bright regions correspond to pixels which are classified as foreground with high certainty. The different cues complement each other. For instance, the colour cue is completely uncertain in the player’s white T-shirt, but the contrast cue correctly assigns this region to the foreground.
characteristics as the lady’s top. The self-organized voting scheme also performed well, but space limitations prevent us from recording those results here.
8
Conclusions
In this paper we have demonstrated how the use of multiple cues improves segmentation results on video sequences. Even using fairly basic techniques for each individual cue, the combined segmentation masks are very accurate, and the output is robust to one
482
E. Hayman and J.-O. Eklundh
Original image
Bayesian cue integration
Voting cue integration
Fig. 3. Segmentation results from the “mother and daughter” MPEG-4 test sequence. The top rows show the original sequence whereas the middle and bottom rows show the belief of membership to the foreground for the Bayesian and voting integration schemes respectively. Bright areas indicate a higher degree of certainty. After the first few frames results improve as additional cues are trained and their reliabilities increase. Both integration schemes used self-organized weighting of cues according to general consensus.
or more of the modalities failing or being completely uncertain. Voting and Bayesian integration techniques were implemented and evaluated. Both performed well. An important aspect of this work in comparison with much previous research in tracking is that colour and contrast distributions were learnt online rather than offline, thus no human interaction is required. During initialization, regions which are classified from the motion cue are used to train Gaussian mixture models for the remaining cues. These models are then used for segmentation in subsequent frames. Thus the approach is entirely data driven.
Probabilistic and Voting Approaches 1.4
motion colour prediction contrast
reliability
1
motion colour prediction contrast
1 0.8
reliability
1.2
0.8 0.6
483
0.6 0.4
0.4 0.2
0.2 0 0
50
100
150
Frame number
200
0 0
250
(a) The evolution of the cue reliabilities for Bayesian cue integration
Motion
Colour
Contrast
50
100
150
Frame number
200
250
(b) Cue reliabilities in the voting method
Prediction
Combined
(c) Cue combination by Bayesian integration
Motion
Colour
Contrast
Prediction
Combined
(d) Cue combination by voting Fig. 4. Results from the Bayesian and voting schemes for cue integration for the “mother and daughter” MPEG-4 test sequence: (a) and (b) show that the contrast cue is assigned low reliability, particularly so in the probabilistic method. The lack of influence of this cue on the final result is demonstrated in (c) and (d) for the final frame of the sequence. The mother and child’s faces are still accepted as foreground. The uncertain cue is completely suppressed in the Bayesian technique and blurs the response in the voting method.
In our future work we intend to incorporate stereo into the framework. Having more cues will allow us to better investigate the outlier rejection properties of voting. Furthermore, since objects close to the camera are likely to be foreground, stereo provides an additional mechanism for the initialization of other cues. Although our work has demonstrated that automatic bootstrapping of cues is certainly possible, we find that it is difficult to recover if the initialization is poor. Therefore reinitialization needs to be addressed, also because it is desirable for additional Gaussians to be included in the mixture models if a new colour or texture becomes visible. Reinitialization could
484
E. Hayman and J.-O. Eklundh 1.4
motion colour prediction contrast
1.2
reliability
1 0.8 0.6 0.4 0.2 0 0
(a) An original still image
100
200
300
Frame number
400
(b) The evolution of the cue reliabilities
(c) The foreground person segmented out from the background. As before, results are poor in the first few frames before models are trained for cues other than motion. Fig. 5. Segmentation results from the “silent” MPEG-4 test sequence.
also prevent colour and texture models from getting trapped; a property of bottom-up adaptive systems like ours is that if one cue dominates due to a sharply peaked pdf, other cues adapt also to that data, which is fine if the strong cue is correct, but dangerous otherwise since parts of the foreground, say, may become permanently attached to the background. Currently, motion is the only cue which can prevent such problems, but interestingly this cue was assigned low reliability in the experiments presented here due to non-rigid motion. Additional extensions concern incorporating a more detailed texture model and permitting additional foreground layers to be introduced while the system is running. Acknowledgments. The authors would like to express their gratitude to Michael Black for the use of his code for parametric motion computation [5]. EH was funded by a Marie Curie Fellowship of the European Community programme “Improving Human Research Potential” (Contract no. HPMFCT-2000-00650).
References 1. Y. Altunbasak, P.E. Eren, and A.M. Tekalp. Region-based parametric motion segmentation using color information. Graphical Models and Image Processing, 60(1):13–23, Jan 1998. 2. S. Ayer and H. Sawhney. Layered representation of motion video using robust maximumlikelihood estimation of mixture models and MDL encoding. In Proc. Int. Conf. on Computer Vision, pages 777–784, 1995.
Probabilistic and Voting Approaches
485
3. S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and texture-based image segmentation using the Expectation-Maximization algorithm and its application to content-based image retrieval. In Proc. Int. Conf. on Computer Vision, pages 675–682, 1998. 4. J. Bilmes. A gentle tutorial on the EM algorithm and application to gaussian mixtures and Baum-Welch. Technical Report TR-97-021, International Computer Science Institute, Berkeley, CA, April 1997. 5. M.J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow-fields. CVIU, 63(1):75–104, January 1996. 6. C. Br¨autigam, J.-O. Eklundh, and H.I. Christensen. A model-free voting approach for integrating multiple cues. In Proc. European Conf. on Computer Vision, 1998. 7. J. J. Clark and A. L.Yuille. Data Fusion for Sensory Information Processing Systems. Kluwer Academic Press, 1990. 8. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc., 39 B:1–38, 1977. 9. E. Hayman and J. O. Eklundh. Figure-ground segmentation of image sequences from multiple cues, 2002. The long version of this conference paper is available at http://www.nada.kth.se/˜hayman/ECCV2002. 10. M. Irani and P. Anandan. A unified approach to moving object detection in 2D and 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6), 1998. 11. M. Irani, B. Rousso, and S. Peleg. Computing Occluding and Transparent Motions. International Journal of Computer Vision, 12(1):5–16, 1994. 12. S. Khan and M. Shah. Object based segmentation of video using color, motion and spatial information. In Proc. Computer Vision and Pattern Recognition, pages II:746–751, 2001. 13. D. Kragi´c. Visual Servoing for Manipulation: Robustness and Integration Issues. PhD thesis, Royal Institute of Technology (KTH), Stockholm, Sweden, 2001. 14. J. Malik, S. Belongie, J. Shi, and T. Leung. Textons, contours and regions: Cue integration in image segmentation. In Proc. Int. Conf. on Computer Vision, 1999. 15. P. Nordlund and J.-O. Eklundh. Towards a seeing agent. In First International Workshop on Cooperative Distributed Vision, Kyoto, Japan, pages 93–123, 1997. 16. Y. Raja, S.J. McKenna, and S. Gong. Colour model selection and adaptation in dynamic scenes. In Proc. European Conf. on Computer Vision, pages 460–474, 1998. 17. H.S. Sawhney, Y. Guo, and R. Kumar. Independent motion detection in 3D scenes. IEEE Trans. on Patt. Analysis and Machine Intelligence, 22(10):1191–1199, October 2000. 18. J. Sherrah and S. Gong. Continuous global evidence-based bayesian modality fusion for simultaneous tracking of multiple objects. In Proc. Int. Conf. on Computer Vision, pages II: 42–49, 2001. 19. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. on Patt. Analysis and Machine Intelligence, 22(8), Aug 2000. 20. M. Spengler and B. Schiele. Towards robust multi-cue integration for visual tracking. In Computer Vision Systems, July 2001, Vancouver,BC, 2001. 21. H. Tao, H.S. Sawhney, and R. Kumar. Dynamic layer representation with applications to tracking. In Proc. Computer Vision and Pattern Recognition, pages II:134–141, 2000. 22. H. Tao, H.S. Sawhney, and R. Kumar. A global matching framework for stereo computation. In Proc. Int. Conf. on Computer Vision, pages I: 532–539, 2001. 23. P.H.S. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian approach to layer extraction from image sequences. IEEE Trans. on Patt. Analysis and Machine Intelligence, 23(3):297– 303, March 2001. 24. K. Toyama and E. Horvitz. Bayesian modality fusion: Probabilistic integration of multiple vision cues for head tracking. In Proc. Asian Conference on Computer Vision, 2000. 25. K. Toyama and Y. Wu. Bootstrap initialization of nonparametric texture models for tracking. In Proc. 6th European Conf. on Computer Vision, Dublin, 2000.
486
E. Hayman and J.-O. Eklundh
26. J. Triesch and C. von der Malsburg. Self-organized integration of adaptive visual cues for face tracking. In Proc Int, Conf. on Automatic Face and Gesture Recognition, Grenoble, France, 2000. 27. J.Y.A. Wang and E.H. Adelson. Spatio-temporal segmentation of video data. In SPIE: Image and Video Processing II, San Jose, Feb 1994. 28. Y. Wu and T.S. Huang. A co-inference approach to robust visual tracking. In Proc. Int. Conf. on Computer Vision, pages II: 26–33, 2001.
Bayesian Estimation of Layers from Multiple Images Y. Wexler, A. Fitzgibbon and A. Zisserman Robotics Research Group Department of Engineering Science University of Oxford Oxford, OX1 3PJ fwexler,awf,
[email protected]
Abstract. When estimating foreground and background layers (or equivalently an alpha matte), it is often the case that pixel measurements contain mixed colours which are a combination of foreground and background. Object boundaries, especially at thin sub-pixel structures like hair, pose a serious problem. In this paper we present a multiple view algorithm for computing the alpha matte. Using a Bayesian framework, we model each pixel as a combined sample from the foreground and background and compute a MAP estimate to factor the two. The novelties in this work include the incorporation of three different types of priors for enhancing the results in problematic scenes. The priors used are inequality constraints on colour and alpha values, spatial continuity, and the probability distribution of alpha values. The combination of these priors result in accurate and visually satisfying estimates. We demonstrate the method on real image sequences with varying degrees of geometric and photometric complexity. The output enables virtual objects to be added between the foreground and background layers, and we give examples of this augmentation to the original sequences.
1 Introduction An important requirement for the film and broadcast industry is the insertion of virtual objects into real footage, and in turn this requires generating the correct occlusions of inserted objects by real objects in the sequence. To make an accurate composite, foreground and background layers must be estimated, and in particular, pixels which contain mixed foreground and background colour (such as at object boundaries) must be accurately modeled. This problem is difficult, as the boundary may be very complex (it is commonly required to deal with hair, for example). This paper is concerned with the automatic extraction of occlusion masks (or “alpha mattes”) which record at each pixel the proportions in which foreground and background combine to generate the image. The contribution is in the use of relative motion of the foreground and background to estimate a transparency value (or ‘alpha’) at each pixel. As the camera moves, a given foreground pixel will sweep over several background pixels, and so a constraint is available on the foreground’s colour. The problem is poorly constrained, as foreground and background colours may be similar, so it is important to incorporate prior knowledge in order to regularize the solution. We develop a Bayesian framework (Section 3.2) which allows priors to be progressively added to A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 487−501, 2002. Springer-Verlag Berlin Heidelberg 2002
488
Y. Wexler, A. Fitzgibbon, and A. Zisserman
(a)
(b)
Fig. 1. Input sequences. (a) Three frames from sequence “Railings”. We wish to extract the foreground railings and an associated alpha mask for each image in the sequence in order to replace the background building. (b) Three out of ten images used in the “Monkey” sequence. (c) shows the composition of the monkey between the railing and the background building.
the information obtained from the relative motion. These priors include bounds on the image and alpha values, learned distributions of alpha values, and spatial consistency taking account of image edges (Sections 3.4 – 3.6). We demonstrate that this combination of priors facilitates the extraction of accurate alpha mattes from sequences where the foreground and background objects are approximately planar, and compare with ground truth (Section 5).
2 Background The use of mattes to selectively obscure elements in film predates the digital computer [1]. In the 40’s, complex shots combining real actors and virtual backgrounds were executed by painting the virtual components on glass. The digital equivalent is to
Bayesian Estimation of Layers from Multiple Images
489
associate a value ÿ with every pixel, which represents the opacity of the pixel. These values are stored in the image’s alpha channel. Foreground and background are combined via the compositing equation C
=
ÿF
+ (1
ÿ
ÿ)B;
where F is the RGB vector at the foreground pixel, and ÿ is the opacity associated with that foreground value, B is the background RGB and the composite image pixel is C . To recover alpha mattes for real objects, two main strategies exist in the literature, one based on segmentation, the other on motion. Segmentation based techniques use a property of the background and foreground regions—for example statistics on their colour distributions—to assign each pixel a likelihood of being foreground or background. The simplest such schemes are based on shooting against a blue or green background. Then the colour distribution for the background is very well specified, and the problem is simplified. However, it is not always possible to shoot against a blue screen, so newer techniques use more sophisticated models to segment against real, a priori unknown, backgrounds—with impressive results [5, 10]. However, computing such segmentations on the hundreds of frames required in a typical cinema special effects shot requires significant human effort. In a moving shot, on the other hand, more constraints are available from the relative motion of foreground and background, so it is in principle possible to automatically extract the matte. We now review previous work on motion-based methods. In mosaicking papers which estimate the background layer by the dominant motion (e.g. see [7]) it is usual to register the background layers and superimpose the foreground (e.g. for a synopsis mosaic). Here it will be more convenient to stabilize the foreground, and have the background sweep past. The key constraint which motion adds is that foreground pixels are seen against different background pixels. Indeed we can see motion as being analogous to having several images of the same (foreground) object against different coloured backgrounds, a situation which was shown by Smith and Blinn [13] to be required for successful matte extraction. Szeliski et al [14] considered the automatic extraction of foreground and background layers combined with a single (rather than per-pixel) alpha value. Their approach allows the simultaneous estimation of the source images and the image motion, for a pair of translucent planes. Szeliski and Golland [15] compute 3D reconstructions which include alpha values at each point in a 3D voxel grid, but the difficulty of computing such a dense sampling means that their results are not yet suitable for compositing. This paper regularizes the problem using data-defined priors to generate a well constrained, accurate, solution.
3 The Problem The problem this paper solves is the following. We assume that we are given a sequence of images comprising the composite image C and the motion parameters of two objects B and F . In this work, we will assume that the motion is modeled by a plane projective transformation, so that the motion parameters are given by 3 þ 3 homogeneous transformation matrices H b and H f . We may assume that the sequence has been registered
490
Y. Wexler, A. Fitzgibbon, and A. Zisserman
using H f , so that the foreground object is fixed in the image, and the background is sweeping past it. Further we assume that the background has been extracted by registering the background plane so that the background image B is known. Image formation for the composite is assumed to be the following. The measured value at each pixel is some linear combination of background and (unknown) foreground: b C = ÿF + (1 ÿ ÿ)B (H ) (1) In turn, F and B are integrals over the (unknown) area of some CCD sensor. Given that the background is known, this gives an equation for the two unknowns ÿ and F for a monochrome image. In the case of colour images the ÿ value is assumed to apply equally to each of the RGB channels. This results in three equations in four unknowns at each pixel. As will be seen these equations are often not independent. Clearly, in a single image, the values of ÿ and F are conflated. A foreground pixel can be mixed with a background for two main reasons. Either the foreground object is transparent or the foreground object does not occupy a whole pixel and so the pixel samples both foreground and background colours. In this case, the alpha value is related to the proportion of the pixel’s cone of view occupied by the foreground object. It will be useful to keep in mind the qualitative range of possibilities for ÿ and F . For example: suppose the background is known to be white and in one image we see a pixel which is a 50% combination of white and red. Then there are three general possibilities for the foreground object: 1. The object is opaque pink 2. The object is red, but 50% transparent 3. The object is opaque red, but does not cover the whole background. In order to resolve this ambiguity we can use multiple views of the scene, so that each foreground pixel is seen against different backgrounds. figure 2 illustrates the set of constraints that three views provides for a single foreground pixel, which will now be formally defined. 3.1 Problem Definition We are given N images þ1 : : : þN , and wish to compute the alpha mask for a reference view, say þ1 . We use this reference view to induce the geometrical parametrization that will be described in section 4. We have dense motion correspondence for the background between the views, and the images are registered so that there is no motion of the foreground object throughout the sequence. The background motion is such that for each pixel x in the reference view we know the corresponding pixels xi in the ith image that corresponds to a 3D foreground point (noted F in figure 2). We can rewrite (1) for each pixel in each view to give the generative model: þ
i (x) = ÿ(x) þ F (x) + (1 ÿ ÿ(x)) þ Bi (x)
(2)
where Bi (x) = B (Hib x) is the colour measurement of the point of intersection of the ray from the ith camera centre to F (x) (see figure 2), as measured in image þi . Thus at
Bayesian Estimation of Layers from Multiple Images
B3
B1
491
B2
F
C2
C3 C1
Fig. 2. Multiview camera setting. Each possible foreground point F is projected onto the ith image, combined with a different background colour Bi . When the measured Bi s are different the foreground colour and transparency can be estimated. Note, this situation is equivalent to that of using different colour background screens [13].
each pixel we have N equations in 2 unknowns for monochrome, or 3N equations in 4 unknowns for colour pixels. Note that this model assumes that the transparency and colours are independent of the viewing parameters, which is true for a Lambertian scene and when the occluding layer’s depth is small compared to the distance from the cameras. In practice, these approximations are often good for real world sequences. 3.2 Bayesian Framework Although the compositing equation (2) is overconstrained for three or more views, the estimation of ÿ is still poorly conditioned. If a foreground pixel is moving over a uniform section of background, the colours Bi and the final composite values þi will all be similar, so that the N constraints reduce to just one. Therefore, the solution must be regularized, for which priors on the system parameters must be introduced. These priors are easily treated in a Bayesian framework, where the probability of the ensemble of observed images þ is the product of the probability of the observations given the parameters fÿ; F g, and the prior probabilities of ÿ and F p(þ1 ::þ
N ) = p(þ1 ::þN jÿ; F )p(ÿ)p(F )
The likelihood (which is the first term in the product) is obtained from the error in the model (2), and modeling the noise process as a zero-mean Gaussian, gives the energy function (after taking logs)
X ZZ N
L(ÿ; F )
=
i=1
x2R2
kC (x) ÿ
ÿ
þ
x) þ F (x) + (1 ÿ ÿ(x)) þ B i (x) k2 dx
ÿ(
(3)
To this (log) likelihood L is added the cost functions for the priors Rÿ and RF yielding the combined cost E (ÿ; F )
=
L(ÿ; F )
+ Rÿ (ÿ) + RF (F )
(4)
492
Y. Wexler, A. Fitzgibbon, and A. Zisserman
If there are p pixels in the image, then the cost is a function of 4p variables; p for the ÿ matte, and 3p for the foreground colour. In the following sections we first derive the maximum likelihood estimate which minimizes (3) alone, and then derive and demonstrate three different priors that can be used to regularize the result, giving the MAP result as each is incorporated. 3.3 Maximum Likelihood Assuming isotropic and homogeneous image noise with zero mean and standard deviation þ , the maximum likelihood solution is given by the least squares solution of (2). Because the pixels do not interact at this stage, the solution may be computed independently at each pixel, and this allows an efficient linear solution. Equation (2) is bilinear in the two unknowns ÿ(x) and F (x) and so we have four times as many unknowns as pixels in the reference image. A reliable estimate for the transparency ÿ results in a good estimate for the foreground colour F (x) and so this equation can be solved by introducing a new variable u(x) = ÿ(x) F (x) and factoring it as a second step. The equation then becomes:
i (x) = u(x) + v(x) ÿ Bi (x)
ý
(5)
For simplicity we will write the equations for the monochrome case, for which B = N > and C = [c1 :::cN ]> be the vectors of background and composite measurements— at a single pixel—and let AN ÿ2 = [1 B] be a matrix whose left column contains ones. Writing (5) in matrix form gives [b1 :::b ]
3 2 3 2 c1 1 b1 ÿ þ 64 .. .. 75 u = 64 .. 75 . . . v 1 bN cN
which we shall write as
A
ÿ þ u v
=
C
(6)
(7)
This overconstrained linear system can be solved in the least squares sense using the pseudo-inverse, giving [u v ]> = A+ C. We then substitute back to get the transparency ÿ = 1 þ v and foreground colour F = u=ÿ, which is valid only for visible places (i.e. ÿ > 0). In the case of colour images, A is 3N ý 4. This solution assumes that there is no noise in the recovery of the background B. As the background is recovered from the images, we know that it does have the same noise characteristics (even if dampened by the mosaic) and so an alternative is to use total least squares (which is equivalent to line fitting in the foreground/composite space). The solution then minimizes the distance to both C and B. To this end, we solve the following system where the mean is first subtracted from ci and bi
2 þc1 64
þk c
1
3 1 2u3
b
.. . 1
b
k
75 4 v 5 = 0 w
(8)
Bayesian Estimation of Layers from Multiple Images
(a) 1
ÿ
ÿ
(b) zoomed
(d) composite
493
(c) zoomed ÿ
(e) zoomed
Fig. 3. Maximum Likelihood solution. The alpha mask is computed for each pixel independently by solving (8) and out of range values of ÿ are clipped to the interval [0; 1]. Some residuals are still apparent where the background is uniform as seen especially in the transparent region. Colour spill from the background can be observed, i.e. light brown colour from the background is showing on the perimeter of the foreground.
The solution is then obtained using singular value decomposition by taking the singular vector belonging to the smallest singular value. In both methods weights can be incorporated into the system by premultiplying by a diagonal matrix W , and we use these to form the spatial prior explained below. Figure 3 shows the results of this computation for the railings sequence. Section 4 describes how the images are registered and the background extracted, but here we shall just assume that these are given. The alpha and foreground values computed are clamped to the range [0; 1] and then the compositing equation is used to replace the background with a checkerboard pattern. We note that while the computation is largely accurate, the alpha values are noisy, which induces a ghostly image of the original background. The reason for this instability can be understood as follows: suppose the background colour is the same over N images (e.g. if the rays in figure 2 happen to intersect a locally homogeneous colour region of the background), then the resulting N ÿ 2 matrix A in (7) will have rank one instead of rank two. Consequently, there is a one parameter family of solutions for ÿ and F for that pixel. It is exactly this situation which the priors, described in the subsequent sections, resolve.
494
Y. Wexler, A. Fitzgibbon, and A. Zisserman
0.6
0.4
0.2
0 0
(a) Sample image
(b) Alpha mask
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c) Alpha distribution
Fig. 4. Sample image with ground truth ÿ-mask. The image is a zoomed portion of the first image of figure 1. The image and the ground truth are discussed in Section 5. The distribution of ÿ for this figure is plotted in 4(c) with a super-imposed graph of a beta distribution.
3.4 Prior 1: Bounded Reconstruction The first prior we apply is that the domain of ÿ and F is bounded [5]. Our image measurements are normalized for the unit range and so is ÿ. We thus add the following to the above constraints: 0 0
ÿ ÿ1 ÿ ÿ1 ÿ
F
Here F can be either gray-scale or a vector of image measurements (normally red, green and blue components). The optimization is no longer closed form, but as a quadratic problem subject to inequality constraints is relatively easily solved using quadratic programming [4]. 3.5 Prior 2: Measured ÿ Distribution In a typical image the transparency is fairly constrained. Objects occupy continuous regions in the image and the background fills the rest. For the most part, only pixels on the boundary will have fractional transparency. As the count of boundary pixels is proportional to the square root of the foreground area on “average” images we expect to have far fewer of these. On images where the foreground object has more complex boundaries such as hair or fur, the boundary pixel count as a function of area will go as the fractal dimension of the curve, but will retain the tendency for most ÿ values to be near 0 or 1. In order to incorporate this constraint, we obtained ground truth mattes for a variety of images, and computed the histogram of alpha values. Figure 4 shows a typical measured histogram. Examining the figure, we observe that the distribution of alpha values is approximated by a beta distribution and so we can incorporate this knowledge as a prior in our system by defining the following (per-pixel) error function: Ed
=
1 n
X i
(ÿF + (1
þ
ÿ)B
þ
2
Ci )
+
þ ý (ü; û )
ÿ
ÿ
ÿ1 (1 þ ÿ)þ ÿ1
(9)
Bayesian Estimation of Layers from Multiple Images
495
Where the sum is over all samples that contain this information (some images may not contain this pixel) and ÿ ,þ are constants that depend on the fractal dimension of þ;ý ) , where ü is the gamma the foreground shape. The coefficient ý (ÿ; þ )ÿ1 equals ÿÿ(þÿ ý function, and normalizes the integral of the prior to unity. In this paper we use ÿ = þ = ÿ 14 which produces the graph in figure 4. The parameter û is not learned from data, but is specified in order to balance the prior and error terms. Figure 5 shows the result of optimization of the new regularized error for several values of û. The noisy alpha values over the background have been removed and the transitional values at the foreground-background boundary are sharpened. However the boundary remains relatively noisy, as the straight vertical edges of the railings show. 3.6 Prior 3: Spatial Consistency The spatial dependency in image measurements (pixels) is an important prior in many computer vision algorithms. However, describing the relationship between the colour values is fraught with difficulties. Often in Markov random field models a prior encouraging spatial continuity of the image is added to the energy function [12], i.e. a prior of the form RF (x) = y near x kF (x) ÿ F (y)k2 . However, this leads to smoothing in textured areas, and reduces the definition at boundaries. On the other hand, applying such a prior to the alpha channel is far more reasonable: a piecewise constant image model is sensible for alpha providing we do not smooth over boundaries. In order to implement this, we impose a directional prior on ú smoothness, using the image gradient as a guide. It is most likely that the actual alpha map agrees with some of the contours in the reference image and this is the prior that we want to use. The gradients of the intensity image ù are computed by convolving with horizontal and vertical difference operators. At each pixel the local gradient is the vector g = @C @C ( @x ; @y )> . The weight of a contribution to pixel x from a neighbouring pixel x + p is
P
p
W
= exp(ÿ
p> Gp ) p> p
where G = gg> . This may be incorporated in the MAP (4) as a prior on ú
X X x
kpk
(Wp (ú(x + p) ÿ ú(x)))
2
<=1
but an efficient approximation is obtained by modifying the total least squares system (8). The system is augmented to include the neighbouring pixels as additional foreground and background estimates, forming a 9N þ 3 system (assuming a 3 þ 3 neighbourhood), with the contributions weighted as above. Solving the augmented system yields ú values which have been smoothed along the edges, but not across [11]. A similar mechanism is used in optic flow computation [3, 8]. Figure 6 shows the result of applying this prior to the test sequence, and illustrates the improvement in edge consistency along the straight sections, without typical smoothing artefacts such as blunting of the spikes.
496
Y. Wexler, A. Fitzgibbon, and A. Zisserman ÿ
=0
ÿ
= 0:1
ÿ
= 0:5
ÿ
=1
Fig. 5. Prior 2, on the distribution of þ. As the prior on the distribution of þ increases, the mask becomes more binary. The figure is organized in columns: Left column shows the alpha mask where white colour marks transparent pixels and black marks opaque ones. Next is a zoomed portion of the mask on the left. The third column demonstrates the use of the mask for composition on a checkerboard pattern to emphasize both ’overshooting’ and ’undershooting’ of the values. The fourth column shows the alpha map as a height field. The reconstruction is shown for four values of the prior weight ÿ. ÿ = 0 is the MLE solution for which transparency is evident. As ÿ increases, the matte is driven towards a binary image where pixels are labelled as purely foreground or background. Note that for ÿ = 1, the mask is binary and a halo is visible around the spearhead. This is evident in color images.
Bayesian Estimation of Layers from Multiple Images
(a) directional
(b) zoomed
(c) composite
(d) zoomed
497
Fig. 6. Prior 3, directional: The reconstruction is regularized by adding a prior on ÿ consistency, guided by the image gradient. The resulting mask is cleaner in the uniform areas and is not smoothed.
4 Implementation In order to supply the data for (2) we need to relate the image measurements spatially, i.e. identify corresponding pixels across the images. If both the foreground and the background are planar the scene structure can be described by two homographies per input image, even if the objects are independently moving. When the foreground object has general shape we represent it as a depth map for the reference image as described later. The input to the algorithm is a set of views. We choose one as a reference view, performing all computations in a world coordinate system which is attached to this view. In particular, the map is defined with respect to this view. This confers a number of practical advantages: the resolution of the 3D representation is automatically matched to the problem, and degradation due to resampling is minimized. We compute the projective transformations that align the foreground and background objects from each view onto the reference view. This has the advantage that accurate algorithms are available which can perform such registration (see below). Let f 3 homographies mapping pixels in the reference image onto i be the 3 i : r pixels in the th image through the foreground object. Similarly, let ib : r i be the homography mapping pixels from the reference view onto the th view through the
N
ÿ
H þ
7! þ
i
ÿ
i
H þ
7! þ
498
Y. Wexler, A. Fitzgibbon, and A. Zisserman
background plane. Assuming that the background image ÿ B is available, and is aligned with the reference image, we can now supply (2) with data measurements as follows:
ÿi (Hif x) = þ ÿ F + (1 þ þ) ÿ ÿ B (Hib (Hif )ÿ1 x)
(10)
This is solved using the different priors as described above. 4.1 Recovering the Background Sometimes the background can be obtained using a photograph where the foreground object has been physically removed, and a “clean plate” has been captured. In such a case the background plate must be registered to the reference frame using, say, point correspondences between the reference and the background plate. When a clean plate is not available, as in our test sequences, it must be recovered from the data. Given the background homographies Hib and a rough segmentation of the scene we can compute the background image B via a mosaic [7, 9]. Setting mosaic pixels by taking the median colour value gives the results shown in figure 7(b). We compute and store the background in the reference coordinate frame (i.e. ÿr þ ÿ B = 0 on background regions).
(a) Reference image
(b) Background image
Fig. 7. The “Railings” sequence. 7(a) is the reference frame and the background image 7(b) is a mosaic of the background regions in all seven images.
4.2 Aligning the Images For the “Railings” images in figure 7, we used interest-point matching to get an initial 3D reconstruction [2]. This gives initial camera matrices and 3D points. From this reconstruction we extracted two dominant planes—one including the railings and the other the background building. These plane induce two homographies which are used for a rough background/foreground segmentation. The background plane homography
Bayesian Estimation of Layers from Multiple Images
499
is further refined using a direct minimization over image intensities [7, 9] with a robust kernel [6] on the intensity comparison. In this case the direct method produces a very good alignment of the background portion of the various images. We use these homographies to compute an image mosaic containing only background pixels.
5 Comparison with Ground Truth In this section we compare the proposed methods to a ground truth image. In order to obtain a sequence for which ground truth could be extracted, we locked off the camera and foreground object (a furry animal), and moved a textured background plane behind. This sequence emulates the more complex case of the previous example, but allows us to compute ground truth. The set of test images is shown in figure 1. In addition, two further images were obtained of the object against two different backgrounds of nearconstant colour. Combining these images using a technique based on [13] allowed a ground truth mask to be obtained with relative ease. Figure 8(b) illustrates the recovered ground-truth mask. Comparing figure 8(d) shows that simply requiring a bounded reconstruction as in x3.4 yields an alpha matte which is visibly far from the ground truth. The subsequent figures illustrate that the beta-distribution prior allows for some more veridical reconstructions. On the other hand, the directional prior has been confused by the wealth of interior edges in the animal’s fur, and simply weakens the constraints on the reconstruction, yielding a visibly poorer mask. The next section discusses these points.
6 Conclusions We have demonstrated the automatic extraction of alpha mattes from multiple images of moving planes, or a static 3D scene. By expressing the problem as a maximum a posteriori (MAP) estimation, the poorly constrained problem has been successfully regularized, allowing accurate estimation of colour and transparency. Although the paper shows that adding appropriate priors can allow a clean matte to be successfully pulled from a video sequence, offering a significant reduction of effort over single-frame methods, there remains the issue of how to set the priors. As shown in the examples, different combinations of weighting between smoothness and directional priors are required for success on different sequences. Some rules of thumb can be discerned. For example, prior 2 on the distribution of ÿ controls the sharpness of the resulting mask, and according to our experiments, it performs well when þ is in the range 0:2:::0:5. Finally, however, the quality of the pulled matte can only be evaluated by the operator. This situation should not be a cause for dismay, however. The goal of our algorithm is to reduce effort on the part of the matte puller, and in that it succeeds well. Current and future work is concerned with two issues. Integration of the multiple view constraint with the successful modern single view methods [5, 10] is expected to provide valuable complementary information, as single-view methods work best on background images with well constrained image statistics—for example constant backgrounds, which are the source of ill conditioning in our method. Secondly, although
500
Y. Wexler, A. Fitzgibbon, and A. Zisserman
(a) blue screen
(b) ground truth
(d) Bounded
(f) Beta: ÿ = 0:1
(c) input image
(e) Directional
(g) Beta: ÿ = 0:5
(h) Beta: ÿ = 1:0
Compositions
(i) Beta: ÿ = 0:1
(j) Beta: ÿ = 0:5
(k) Beta: ÿ = 1:0
Fig. 8. Ground truth for the Monkey sequence. See section 5 for details
Bayesian Estimation of Layers from Multiple Images
501
the theory described here applies to general scenes (i.e. nonplanar), and techniques are available to compute the pixel correspondences in these cases, much work remains before the general problem can be said to be considered solved.
Acknowledgements Funding for this work was provided by a DTI/EPSRC Link grant. AF would also like to thank the Royal Society for its generous support.
References 1. C. Barron. Matte painting in the digital age. In Proc. SIGGRAPH (sketches), 1998. http://www.matteworld.com/projects/siggraph01.html. 2. P. Beardsley, P. Torr, and A. Zisserman. 3D model acquisition from extended image sequences. In Proc. ECCV, LNCS 1064/1065, pages 683–695. Springer-Verlag, 1996. 3. M. J. Black and P. Anandan. A framework for the robust estimation of optical flow. In Proc. ICCV, pages 231–236, 1993. 4. P. T. Boggs and J. W. Tolle. Sequential quadratic programming. In Acta Numerica, pages 1–51. 1995. 5. Yung-Yu Chuang, Brian Curless, David Salesin, and Richard Szeliski. A bayesian approach to digital matting. In CVPR 2001, 2000. 6. P. J. Huber. Robust Statistics. John Willey and Sons, 1981. 7. M. Irani, B. Rousso, and S. Peleg. Detecting and tracking multiple moving objects using temporal integration. In G. Sandini, editor, Proc. ECCV, pages 282–287. Springer-Verlag, 1992. 8. H. H. Nagel. Extending the ’oriented smoothness constraint’ into the temporal domain and the estimation of derivatives of optical flow. In O. D. Faugeras, editor, Proc. ECCV, pages 139–148. Springer–Verlag, 1990. 9. S. Peleg and J. Herman. Panoramic mosaics by manifold projection. In Proc. CVPR, 1997. 10. P. Perez, A. Blake, and M. Gangnet. Jetstream: Probabilistic contour extraction with particles. In International Conference on Computer Vision, pages 424–531, 2001. 11. P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE PAMI, 12(7):629–639, Jul 1990. 12. M. Shizawa and K. Mase. A unified computational theory for motion transparency and motion boundaries based on eigenenergy analysis. In CVPR91, pages 289–295, 1991. 13. A. R. Smith and J. F. Blinn. Blue screen matting. In Proc. SIGGRAPH, pages 259–268, 1996. 14. R. Szeliski, S. Avidan, and P. Anandan. Layer extraction from multiple images containing reflections and transparency. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’2000), volume 1, pages 246–253, 2000. 15. R. Szeliski and P. Golland. Stereo matching with transparency and matting. In Sixth International Conference on Computer Vision (ICCV’98), pages 517–524, 1998.
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction Feng Han, Zhouwen Tu, and Song-Chun Zhu Dept. of Comp. and Info. Sci., Ohio State Univ., Columbus, OH 43210, USA {hanf, ztu, szhu}@cis.ohio-state.edu,
Abstract. In this paper, we present a stochastic algorithm by effective Markov chain Monte Carlo (MCMC) for segmenting and reconstructing 3D scenes. The objective is to segment a range image and its associated reflectance map into a number of surfaces which fit to various 3D surface models and have homogeneous reflectance (material) properties. In comparison to previous work on range image segmentation, the paper makes the following contributions. Firstly, it is aimed at generic natural scenes, indoor and outdoor, which are often much complexer than most of the existing experiments in the “polyhedra world”. Natural scenes require the algorithm to automatically deal with multiple types (families) of surface models which compete to explain the data. Secondly, it integrates the range image with the reflectance map. The latter provides material properties and is especially useful for surface of high specularity, such as glass, metal, ceramics. Thirdly, the algorithm is designed by reversible jump and diffusion Markov chain dynamics and thus achieves globally optimal solutions under the Bayesian statistical framework. Thus it realizes the cue integration and multiple model switching. Fourthly, it adopts two techniques to improve the speed of the Markov chain search: One is a coarse-to-fine strategy and the other are data driven techniques such as edge detection and clustering. The data driven methods provide important information for narrowing the search spaces in a probabilistic fashion. We apply the algorithm to two data sets and the experiments demonstrate robust and satisfactory results on both. Based on the segmentation results, we extend the reconstruction of surfaces behind occlusions to fill in the occluded parts.
1
Introduction
Recently there are renewed and growing interest in computer vision research for parsing and reconstructing 3D scenes from range images, driven by new developments in sensor technologies and new demands in applications. Firstly, high precision laser range cameras are becoming accessible to many users, which makes it possible to acquire complex real world scenes like those displayed in Fig. 1. There are also high precision 3D Lidar images for terrain maps and city scenes with up to centimeter accuracy. Secondly, there are new applications in graphics and visualization, such as image based rendering and augmented reality, and in spatial information management, such as constructing spatial temporal A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 502–516, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
503
databases of 3D urban and suburban maps. All these request the reconstruction of complex 3D scenes, for which range data are much more accurate than other depth cues such as shading and stereo. Thirdly, Range data are also needed in studying the statistics of natural scenes[8] for the purposes of learning realistic prior models for real world imagery as well as for understanding the ecological influences of the environment on biological vision systems. For example, a prior model of 3D scenes is useful for many 3D reconstruction methods, such as multiview stereo, space carving, shape recovery from occlusion, and so on.
A): A reflectance map of office A
B): A reflectance map of office B
C): A reflectance map of street scene
D): A reflectance map of cemetery
Fig. 1. Examples of indoor and outdoor scenes from the Brown dataset. A, B, C and D are reflectance images IΛ on a rectangular lattice Λ. The laser range finder scans the scene in a cylindric coordinates and produces panoramic views of the scenes.
In contrast to the new developments and applications, current range image segmentation algorithms are mostly motivated by traditional applications in recognizing industry parts in an assembly line. Therefore these algorithms only deal with polyhedra scenes. In the literature, algorithms for segmenting intensity images have been introduced or extended to rang image segmentation, e.g., edge detection[9], region growing methods[5,1] and clustering[6,4]. An empirical comparison study was reported jointly by a few groups in [7]. Generally speaking, algorithms for range segmentation are not as advanced as those for intensity image segmentation, perhaps due to the lack of complex range datasets before. For example, there is no existing algorithm in the literature which can automatically segment complex scenes as those displayed in Fig. 1. In this paper, we present a stochastic algorithm by effective Markov chain Monte Carlo (MCMC) for segmenting and reconstructing 3D scenes from laser range images and their associated reflectance maps. In comparison to previous work on range image segmentation, the paper makes the following contributions. Firstly, to deal with the variety of objects in real world scenes, the algorithm introduces multiple types of surface models, such as planes, and conics for manmade objects, and splines for free-form objects. These surfaces models compete to explain the range data under some model complexity constraints. The algo-
504
F. Han, Z. Tu, and S.-C. Zhu
rithm also introduces various prior models on surfaces, boundaries, and vertices (corners) to achieve robust solutions from noisy data. Secondly, the algorithm integrates the range data with the associated reflectance map. The reflectance measures the proportion of laser energy returned from surface in [0, 1] and therefore carries material properties. It is especially useful for surface of high specularity, for example, glass, metal, ceramics, and crucial for surface at infinity, such as the sky, where no laser ray returns. The range data and reflectance map are tightly coupled and integrated under the Bayes framework. Thirdly, the algorithm achieves globally optimal solutions in the sense of maximizing a Bayesian posterior probability. As the posterior probability is distributed over subspaces of various dimensions due to the unknown number of objects and their types of surface models, ergodic Markov chains are designed to explore the solution space. The Markov chain consists of reversible jumps and stochastic diffusions. The jump realizes split and merge, model switching, and the diffusion realizes boundary evolution and competition and model adaptation. Fourthly, it adopts two techniques to improve the speed of the Markov chain search. One is a coarse-to-fine strategy which starts with segmenting large surfaces, such as sky, walls, ground, and then proceeds to objects of medium sizes, such as furnitures, people etc, and then small objects such as cups, books, etc. The other are data driven techniques such as edge detection and clustering. The data driven methods provide heuristic important information, expressed as importance proposal probabilities [13], on the surface and boundaries for narrowing the search spaces in a probabilistic fashion. We apply the algorithm to two datasets. The first is a standard USF polyhedra data for comparison, and the second is from Brown university which contains real world scenes. The experiments demonstrate robust and satisfactory results. Based on the segmentation results, we extend the reconstruction of surfaces behind occlusions to fill in the occluded parts. The paper is organized as follows. We start with a Bayesian formulation in section (2). Then we discuss the algorithm in Section (3). Section (4) shows the experiments, and Section (5) discuss some problems and future work.
2
Bayes Framework: Integrating Cues, Models, and Priors
In this section, we formulate the problem under the Bayes framework with integrated two cues, five families of surface models, and various prior models. 2.1
Problem Formulation
We denote an image lattice by Λ = {(i, j) : 0 ≤ i ≤ M, 0 ≤ j ≤ N }. Then two cues are available. One is the 3D range data which is a mapping from lattice Λ to a 3D point, D : Λ → R3 ,
D(i, j) = (x(i, j), y(i, j), z(i, j)).
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
505
(i, j) indexes a laser ray that is reflected from a surface point (x, y, z). Associated with the range data is a reflectance map I : Λ → {0, 1, ..., G − 1} I(i, j) is the portion of laser energy returned from point D(i, j). G is the total grey levels in discretization. Thus it measures some material properties. For example, surfaces of high specularities, such as glass, ceramics, medals, appear dark in I. I(i, j) = 0 for mirrors and surfaces at infinity, such as the sky. D(i, j) is generally very noisy and thus unreliable when I(i, j) is low, and is considered a missing point if I(i, j) = 0. In other words, I and D are coupled at such places. The objective is to partition the image lattice into an unknown number of K disjoint regions, Λ = ∪K n=1 Rn ,
Rn ∩ Rm = ∅ ∀m =n.
At each region R, the range data DR fit to some surface model with parameter ΘD and the reflectance IR fit to some reflectance model with parameter ΘI . Let W denote a solution, then I D I W = (K, {Rn , (D n , n ), (Θn , Θn ) : n = 1, 2..., K}).
D and I index the types of surface models and reflectance models. In the Bayesian framework, an optimal solution is searched for maximizing a posterior probability over a solution space Ω W , W ∗ = arg max p((D, I)|W )p(W ). ΩW
In practice, two regions Ri , Rj may share the same surface model, i.e. ΘiD = and ΘiI =ΘjI . For example, a painting or a piece of cloth hung on a wall, a thin book or paper on a desk, may fit to the same surfaces as the wall or desk, but they have different reflectances. It is also possible that ΘiD =ΘjD and ΘiI = ΘjI . To minimize the coding length and to pool information from pixels over more regions, we allow adjacent regions to share either depth or reflectance parameters. Thus a boundary between two regions could be labelled as reflectance boundary, depth boundary, or both. In the following, we briefly summarize the models for p((D, I)|W ) and p(W ). ΘjD
2.2
Likelihood with Multiple Surface and Reflectance Models
In the literature, there are many ways for representing a surface, such as implicit polynomials [1,5], superquadrics [12] and other deformable models. In this paper, we choose five types of surface models to account for various shapes in natural scenes. 1. Family D1 : planar surfaces specified by three parameters Θ = (a, b, d). Let (a, b, c) be a unit surface normal with a2 + b2 + c2 = 1, and d is the perpendicular distance from the origin to the plane. We denote by Ω1D Θ as the space of all planes.
506
F. Han, Z. Tu, and S.-C. Zhu
2. Family D2 : B-spline surfaces with 4 control points. Each B-spline surface has a rectangular grid on reference plane ρ with its two dimensions indexed by (u, v). Then a grid of h × w control points are chosen on the ρ plane. Then a spline surface is s(u, v) =
h w
ps,t Bs (u)Bt (v),
s=1 t=1
where ps,t = (ηs,t , ζs,t , ξs,t ) is a control point with (ηs,t , ζs,t ) being coordinates on ρ and ξs,t is the degree of freedom at a point. By choosing h = w = 2, a surface in D2 is specified by 9 parameters Θ = (a, b, d, δ, φ, ξ0,0 , ξ0,1 , ξ1,0 , ξ1,1 ). We denote by Ω2D Θ the space of family D2 . 3. Family D3 : similarly we denote by Ω3D the space for spline surfaces with 9 control points. So a surface in D3 is specified by 14 parameters Θ = (a, b, d, δ, φ, ξ0,0 , ..., ξ2,2 ). 4. Family D4 : This is surface model taken from [11] to fit spheres, cylinders, cones, and tori. D is specified by 7 parameters Θ = (&, ϕ, ϑ, k, s, σ, τ ). We denote by Ω4D Θ the space of family D4 . 5. Family D5 : This is a non-parametric 3D histogram for the 3D points position. w w It is specified by Θ = (hu1 , hu2 , ..., huLu , hv1 , hv2 , ..., hvLv , hw 1 , h2 , ..., hLw ), where Lu , Lv and Lw are the number of bins on u, v, w directions respectively. This model is used to represent cluttered regions, such as leaves of trees. We denote by Ω5D Θ the space of family D5 .
a
b
c
d
Fig. 2. One typical planar surface in family D1 , two typical B-spline surfaces in family D2 and D3 and one typical surface in family D4 respectively.
Fig 2 displays four typical surfaces, one for each of the first four families. For the reflectance image I, we use three families of models, denoted by ΩiI , i = 1, 2, 3 respectively. 1. Family I1 : This is a uniform region with constant reflectance Θ = µ. 2. Family I2 : This is a cluttered region with a non-parametric histogram Θ = (h1 , h2 , ..., hL ) for its intensity with L being the number of bins. 3. Family I3 : This is a region with smooth variation of reflectance, modeled by a B-spline model as in family D3 .
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
507
For the surface and reflectance models above, the likelihood model for a solution W assumes the fitting residues to be Gaussian noise subject to some robust statistics treatment, so we have p((D, I)|W ) ∝
K
D
D
I
I
e−φ(D(i,j)−S(i,j;n ,Θn ))δI(i,j)≥τ ) −φ(I(i,j)−J(i,j;n ,Θn )) .
n=1 (i,j)∈Rn
In the above formula, φ(x) is a quadratic function with two flat tails to account D I I for outliers [2] that has been used in the paper. S(i, j; D n , Θn ) and J(i, j; n , Θn ) D D are respectively the fitted surface and reflectance according to model (n , Θn ) and (In , ΘnI ). The depth data D(i, j) is not accounted for if the reflectance I(i, j) is lower than a threshold τ , i.e δ(I(i, j) ≤ τ ) = 0. To fit the models robustly, we adopt a two-step procedure. Firstly, truncate points that are less than 25% of the maximum error; Secondly, truncate points a trough or plateau. Furthermore, the least median of squares method based on orthogonal distance in [16] has been adopted for the parameters estimation. 2.3
Priors on Surfaces, Boundaries, and Corners
Generally speaking, the prior model p(W ) should penalize model complexity, enforce stiffness of surfaces, and enhance smoothness of the boundaries, and form canonical corners. The prior model for the solution W is p(W ) = p(K)p(πK )
K
D D I I I p(D n )p(Θn |n )p(n )p(Θn |n ).
n=1
where πK = (R1 , ..., RK ) is a K-partition of the lattice. Equally, a partition πK is represented by a planar graph with K faces for the regions, a number of edges for boundaries, and vertices for corners, πK = (Rk , k = 1, ..., K; Γm , m = 1, ..., M ; Vn , n = 1, ..., N ) K M N Therefore, p(πK ) = k=1 p(RK ) m=1 p(Γm ) n=1 p(Vn ). We find that some previous prior model used by Leclerc and Fischler [10] for computing 3D wireframe from line drawings is very relevant to our prior model. 1. Model complexity is penalized by three factors. One is p(K) ∝ e−λo K to I penalize the number of regions, and the second includes p(D n ) and p(n ) which prefers simple model. In general, for a model of type , p() is proportional toc the inverse of the space volume, p() ∝ 1/|Ω |. The third is p(Rn ) ∝ e−α|Rn | with |Rn | being the area (size) of Rn . The third term forces small regions to merge. 2. Surface stiffness is enforced by p(ΘnD |D n ), n = 1, 2, · · · , K. Every three adjacent control points in the B-spline form a plane, and adjacent planes are forced to have similar normals in p(ΘnD |D n ). 3. Boundary smoothness is enhanced by p(Γm ), m = 1, 2, · · · , M , like the ˙
¨
SNAKE model. p(Γ (s)) ∝ e φ(Γ (s))+φ(Γ (s))ds . 4. Canonical corners is imposed by p(Vn ), n = 1, 2, · · · , N . As in [10] the angles at a corner should be more or less equal.
508
3
F. Han, Z. Tu, and S.-C. Zhu
Computing the Global Optimal Solution by DDMCMC
Based on the previous formulation, we see that the solution space Ω W contains many subspaces of varying dimensions. The space structure is typical, and will be valid even when other classes of models are used in future. In the literature of range segmentation, many methods are applied, like edge detection [9], region growing methods [5,1], clustering [6,4], and some energy minimization method, generalized Hough transforms. But none of these methods can search in such complex spaces. To compute a globally optimal solution, we design ergodic Markov chains with reversible jumps and stochastic diffusions, following the successful work on intensity segmentation by a scheme called data driven Markov chain Monte Carlo [13]. The five types of MCMC dynamics used are: diffusion of region boundary, splitting of a region into two, merging two regions into one, switching the family of models, and model adaptation for a region. In a diffusion step, we run region competition [17] for a contour segment Γij between two regions Ri and Rj . The statistical force sums over both the log-likelihood ratio of the two cues: p(D(x(s), y(s)); (liD , ΘiD )) p(I(x(s), y(s)); (liI , ΘiI )) dΓij (s) = cκ(s)n(s) + log + log . D D dt p(D(x(s), y(s)); (lj , Θj )) p(I(x(s), y(s)); (ljI , ΘjI ))
Since two cues are used here, each region is different from its neighboring regions by either range cue or reflectance cue or both of them. Thus, in the splitting process, we can make the proposal about how to split one surface based on either the range cue or the reflectance one. This adds an extra step before splitting to select which cue will be used. As to the rest dynamics, they run in almost the similar way as in [13] with minor changes. Now we come to the two techniques used to speed up the Markov chain search. 3.1
Coarse-to-Fine Boundary Detection
In this section, we detect potential edges based on local window information of the range cue, and trace the edges to form a partition of the lattice. We organize the edge maps in three scales according to their significances. For example, Fig. 3 shows the edge maps for the office B displayed in Fig. 1. These edge maps are used at random to suggest possible boundary Γ for the split and merge moves. Indeed, the edge maps encode some importance proposal probabilities for the jumps. Not using such edge maps is equivalent to use a uniform distribution for the edges, which obviously is inefficient. Refer to [13] for the details of how edge maps are used in designing jump. Here we deliberate on how the edge maps are computed. At each local window, say a 5 × 5 patch ω ⊂ Λ, we have a set of 3D points pi = D(m, n) for (m, n) ∈ ω. One can estimate the normal of this patch by computing a 3 × 3 scatter matrix S [4].
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction scale 1
scale 2
509
scale 3
Fig. 3. Computed edge maps based on the range cue at three scales for the office scene B in Fig. 1.
S=
(pi − p¯)(pi − p¯)T .
i
Then the eigen vector n = (a, b, c) that corresponds to the smallest eigen value λmin of S is the normal of this patch ω. λmin is a measure of how smooth the patch is. A small λmin means a planar patch. The distance of the origin to this patch is the inner product between the unit normal and the center point p¯, that is, d = n · p¯. Then an edge is detected by discontinuity of (a, b, d) in adjacent pixels using standard technique, and each point is associated with an edge strength measure. We threshold the edge strength at three levels to generate edge maps which are traced with heuristic local information. We also apply standard edge detection to the refelectance image and obtain edge maps on three scales. These edge maps from the two cues are not reliable results by themselves for segmentation, but they provide important heuristic information for driving jumps of Markov chains. 3.2
Coarse-to-Fine Surface Clustering
As the edge maps encode importance proposal probabilities for boundary, we compute importance proposal probabilities on the parameter spaces Ω1D , Ω2D , Ω3D , Ω4D and Ω5D respectively. In edge detection, each small patch ω ⊂ Λ is characterized by (a, b, d, p¯, λmin ). Therefore, we collect a set of patches, Q = {ωj : j = 1, 2, ..., J}. In practice, we can discard patches which have relatively large λmin , i.e. patches that are likely on the boundary. We also use adaptive patch sizes. We cluster the patches in set Q into a set of C hypothetic surfaces C = {Θi : Θi ∈ Ω1D ∪ Ω2D ∪ Ω3D ∪ Ω4D ∪ Ω5D , i = 1, ..., C.} The number of hypothetic clusters in each space is chosen to be a conservative number and the clusters are computed in a coarse-to-fine strategy. That is, we
510
F. Han, Z. Tu, and S.-C. Zhu
Ceiling
Floor
Wall 1
Wall 2
Wall 3
Wall 4
Fig. 4. Computed saliency maps for six clusters of office B at a coarse scale, which fit surfaces of big areas.
extract clusters of large “populations” which usually correspond to large objects. Then we compute clusters for smaller objects. It is straight forward to do an EM algorithm that classifies the patches in S and also compute the clusters in C. Thus we obtain a probability for how likely a patch ωj belongs to a surface with parameter Θi , we denote it by q(Θi |ωj ) and i q(Θi |ωj ) = 1, ∀j = 1, ..., J. We call q(Θi |ωj ) for all i = 1, ..., J the saliency map - a term often used by psychophysicists. A saliency map of a hypothetic surface Θi tells how well pixels on the lattice fit (or belong) to the surface. For example, Fig. 4 shows six saliency maps for six most prominent clusters in the office view B of Fig. 1. A bright pixel means high probability. The total sum of the probability over the lattice is a measure of how prominent a cluster is. In our experiments, large surfaces are clustered first, as it is indicated in Fig. 4, the six most prominent clusters represents the ceiling, floor and four walls of the office scene. Other small objects, such as tables, and books do not fit well. As Fig 4 shows, many small objects do not light up in the saliency map. Thus in a coarse to fine strategy, we fit such patches at finer scale. Fig 5 shows the saliency maps of five additional clusters for a small part of the scene Λo ⊂ Λ. This patch is taken from the office B scene in Fig 1.
4 4.1
Experiments The Datasets and Preprocessing
We test the algorithm on two datasets. The first one is the standard Perceptron LADAR camera images in the USF dataset. These images contains polyhedra objects and the objects have almost uniform sizes. The second one is Brown dataset, where images are collected with a 3D imaging sensor LMS-Z210 by Riegl. The field of view is 800 vertically and 2590 horizontally. Each image contains 444 × 1440 measurements with an angular separation of 0.18 degree.
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
window
background
a PC box on desk
chair backs
511
books on desk
Fig. 5. Computed saliency maps for five clusters in a finer scale for some smaller surfaces which do not light up in the coarse scale (above). The saliency maps are shown in a zoom-in view of a patch in the office B picture.
In general, range data are contaminated by heavy noise. Effective preprocessing must be used to deal with all types of errors presented in the data acquisition, while preserving the true discontinuities. In our experiments, we adopt the least median of squares (LMedS) and anisotropic diffusion [14] to preprocess the range data. LMedS is related to the median filter used in image processing to remove impulsive noise from images and can be used to remove strong outliers in range data. After that, the anisotropic diffusion is adopted to handle the general noises while avoiding the side effects caused by simple Gaussian or diffusion smoothing, like the decreasing of the absolute value of curvature and smoothing of orientation discontinuities into spurious curved patch. Fig. 6 shows a surface rendered before and after the preprocessing.
a)
b)
Fig. 6. Surface rendered for a range scene by OpenGL. a) before and b) after preprocessing.
4.2
Results and Evaluation
Our results on the Florida dataset are shown in Fig. 7. We also show the 3D scenes by OpenGL. These 3D scenes are reconstructed by completing the surfaces behind the occluding objects. For comparison, we also show in Fig. 7 a manual segmentation used in [7]. It is no surprise that the algorithm can parse such scene very well, because the image models are sufficient to account for the surfaces in the dataset.
512
F. Han, Z. Tu, and S.-C. Zhu
segmentation result
reconstruction result
manual segmentation
Fig. 7. Our segmentation and reconstruction results and the corresponding manual segmentation from [7] of Florida data set.
The results on the Brown dataset are shown in Fig. 8 and Fig. 9 respectively. Fig. 8 shows the segmentation results of two parts from office scene A and two parts from the outdoor scene C and D in Fig. 1 respectively. Fig. 9 shows the
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
513
Fig. 8. Segmentation results for two parts of scene A and one part of scene C, D respectively in Fig. 1 and the corresponding manual segmentation.
segmentation result of the most complicated part of office scene B in Fig. 1. Since there is no segmentation benchmark available for such kind of complex scenes we used here, we ask several candidates to manually segment these scenes and show the average results in the above figures for comparison. The reconstruction scene based on the segmentation result shown in Fig. 9 is presented in Fig. 10. Range image is often incomplete due to partial occlusion or poor surface reflectance. This can be clearly seen from Fig. 6, in which the floor and the two walls have a lot of missing points. Analysis and reconstruction of range images usually focuses on complex objects completely contained in the field of view; little attention has been devoted so far to the reconstruction of simply-shaped wide areas like parts of wall hidden behind furniture and facility pieces in the indoor scene shown in Fig. 6 [3]. In the reconstructing process, how to fill the missing data points of surfaces behind occlusions is a challenging question. The completion of these depth information needs higher level understanding of the
514
F. Han, Z. Tu, and S.-C. Zhu
Fig. 9. Segmentation result of the most complicated part of scene B in Fig. 1 and the corresponding manual segmentation.
Fig. 10. The reconstructed scene based on the segmentation result shown in Fig. 9. To give the user a better view, we remove some furnitures around the table in the reconstructed scene. We also insert a polyhedral object extracted from the Florida dataset to show that our segmentation result can be easily used in scene editing.
A Stochastic Algorithm for 3D Scene Segmentation and Reconstruction
515
3D models, for examples, expressed in terms of meaningful parts. To solve this problem, an algorithm also shall make inference about two things: 1). The types of boundaries as crease, occluding, and so on. 2). The ownership of the boundary to a surface. In our reconstruction procedure, we only use a simple prior model to recover the missing parts of the backgrounds (like the walls and the floor) by assuming they are rectangles. Since we can obtain the needed parameters to represent these rectangles from the segmentation result, it is not difficult to fill the missing points. Although this procedure is pretty simple, it illustrates that our segmentation result provides a solid foundation for continuing work in this direction. Moreover, our segmentation result can also be directly applied to scene editing, which is illustrated in Fig. 10.
5
Future Work
The experiments reveal the difficulty to segment degenerated regions which are essentially one-dimensional, such as the cables and rail in doorway (Fig. 8). Thus, more sophisticated models will be integrated in future work. We should also study a more principled and systematic way for grouping surfaces into objects and completing surfaces behind the occluding objects. Acknowledgements. This work is supported partially by two NSF grants IIS 98-77-127 and IIS-00-92-664, and an ONR grant N000140-110-535.
References 1. P.J. Besl and R.C. Jain, “Segmentation through variable order surface fitting”, IEEE Trans. on PAMI, vol. 10, no.2, pp167-192, 1988. 2. M. J. Black and A. Rangarajan, ”On the unification of line process, outlier rejection, and robust statistics with applications in early vision”, Int’l J. of Comp. Vis., Vol. 19, No. 1 pp 57-91. 1996. 3. F. Dell’Acqua and R. Fisher, ”Reconstruction of planar surfaces behind occlusions in range images”, To appear in IEEE Trans. on PAMI 2001. 4. P. J. Flynn and A.K. Jain, “Surface classification: hypothesis testing and parameter estimation”, Proc. of CVPR, 1988. 5. A. Gupta, A. Leonardis, and R. Bajcsy, “Segmentation of range images as the search for geometric parametric models”, Int’l J. of Computer Vision, 14(3), pp 253-277, 1995. 6. R. L. Hoffman and A. K. Jain, “Segmentation and classification of range images”, IEEE Trans. on PAMI, vol. 9, no.5, pp608-620, 1987. 7. A. Hoover, et al. “An experimental comparison of range image segmentation algorithms”, IEEE Trans. on PAMI, vol.18, no.7, pp673-689, 1996. 8. J.G. Huang, A. B. Lee, and D.B. Mumford, “Statistics of range images”, Proc. of CVPR, Hilton Head, South Carolina, 2000. 9. R. Krishnapuram and S. Gupta, “Morphological methods for detection and classification of edges in range images”, Mathematical Imaging ans Vision, vol.2 pp351375, 1992.
516
F. Han, Z. Tu, and S.-C. Zhu
10. Y. G. Leclerc and M.A. Fischler, “An optimization-based approach to the interpretation of single line drawings as 3D wire frame”, Int’l J. of Comp. Vis., 9:2, 113-136, 1992. 11. D. Marshal, G. Lukacs and R. Martin, ”Robust segmentation of primitives from range data in the presentce of geometric degeneracy”, IEEE Trans. on PAMI, Vol. 23, No. 3 pp 304-314. 2001. 12. A. P. Pentland, “Perceptual organization and the representation of natural form”, Artificial Intelligence, 28, 293-331, 1986. 13. Z. W. Tu, S. C. Zhu and H. Y. Shum, “Image segmentation by data driven Markov chain Monte Carlo.” Proc. of ICCV, Vancouver, 2001. 14. M. Umasuthan and A.M. Wallace, ”Outlier removal and discontinuity preserving smoothing of range data”, IEE Proc.-Vis. Image Signal Process, Vol. 143, No. 3, 1996. 15. A. L. Yuille and J. J. Clark, ”Bayesian models, deformable templates and comptitive priors”, in em Spatial Vision in Humans and Robots, L. Harris and M. Jenkin (eds.), Cambridge Univ. Press, 1993. 16. Z.Y. Zhang, ”Parameter estimation techniques: A tutorial with application to conic fitting”, Technical Report of INRIA, 1995. 17. S. C. Zhu and A. L. Yuille. “Region competition: unifying snakes, region growing, and Bayes/MDL for multiband Image Segmentation”. IEEE Trans. PAMI. vol. 18, No. 9. pp 884-900. 1996.
Normalized Gradient Vector Diffusion and Image Segmentation Zeyun Yu and Chandrajit Bajaj Department of Computer Science, University of Texas at Austin, Austin, Texas 78712, USA
[email protected]
Abstract. In this paper, we present an approach for image segmentation, based on the existing Active Snake Model and Watershed-based Region Merging. Our algorithm includes initial segmentation using Normalized Gradient Vector Diffusion (NGVD) and region merging based on Region Adjacency Graph (RAG). We use a set of heat diffusion equations to generate a vector field over the image domain, which provides us with a natural way to define seeds as well as an external force to attract the active snakes. Then an initial segmentation of the original image can be obtained by a similar idea as seen in active snake model. Finally an RAG-based region merging technique is used to find the true segmentation as desired. The experimental results show that our NGVD-based region merging algorithm overcomes some problems as seen in classic active snake model. We will also see that our NGVD has several advantages over the traditional gradient vector diffusion. Keywords. Image segmentation, Gradient vector diffusion, Heat diffusion equation, Active snake model, Watershed method, Region merging.
1
Introduction
Since it was proposed by M.Kass etc. [1], active snake model has drawn a lot of attention from researchers in image-related fields. Due to its efficiency of converging to the desired features within an image by simply defining an energy function, it has found many applications, including edge detection [1], shape modeling [14], segmentation [15], and motion tracking [15,16]. The traditional snake model is defined by an energy functional: E=
1 0
1 [αx (s)2 + βx (s)2 ] + Eext (x(s))ds. 2
(1)
where x(s) = [(x(s), y(s)], and s ∈ [0, 1]. The first two terms within the above integral stand for the internal force that is determined by the geometric properties of the snakes, while the third term is thought of as the external force that is the issue mostly discussed in active snake models. Generally there are several difficulties with this traditional model. First, the initial guess of the contour must be carefully chosen to be close to the true A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 517–530, 2002. c Springer-Verlag Berlin Heidelberg 2002
518
Z. Yu and C. Bajaj
boundaries. This is because the snake moves partially in the direction of external force Eext (x) which is based on the images gradient. Therefore, the external force exists only at those points close to the boundaries. Some approaches have been proposed to solve this problem, e.g., multi-scale method [8], where different Gaussian factor σ are used to create a scalable potential force to attract the snake to the desired boundaries; distance potential force [5], in which the distance map is produced first and then, based on the gradient of the distance map, a potential force is obtained to generate a large attraction range. Recently Xu et.al. proposed another method [2], called gradient vector flow (GVF), to find the external force based on a group of heat diffusion equations. The second problem with the traditional model is that it is difficult for the snake to move into the boundary concavities. If we consider the energy functional (1) again, we can see that the first two terms in the integral always make the snake as straight as possible. Thus, if the external force Eext (x) is not large enough to push the snake into the boundary concavities, the snake will always stop near the “entrance” of the concavities. Although many ways have been tried to solve this problem (see [9,10,11,12]), most of them did not give satisfying results. Xu’s GVF method and its generalized version [3] were originally proposed to rectify this problem but still did not work very well in the case of long and thin boundary concavities (see next section for analysis). Furthermore, boundary “gaps” or weak boundaries are always overwhelmed by the nearby strong boundaries. The third problem with the traditional model is how to select the initial snakes. A good guess of the initial snakes makes a great impact on the final segmentation. Many of the existing methods, including Xu’s methods [2,3,4], choose the initial snakes by hand. Some other approaches find the initial snakes by “balloons” or by the locus of the zero-crossing of the Laplacian of the smoothed images (see [18] for a summary). In the present paper we will see a very natural way to find the initial snakes (or seeds). To overcome these problems we proposed a new type of diffusion equations to generate the gradient vector flow. Our new diffusion equations are similar to those in [2] except that ours are based on the normalized gradient vector diffusion (NGVD). This slightly modified scheme does tremendously improve the vector diffusion behaviors, compared to other vector diffusion methods [2,3,4]. First, it is very easy to deal with the boundary concavities such that the snakes can easily move into the concavities. Second, weaker but obvious boundaries can be preserved even they are close to much stronger boundaries. Third, the snakes can correctly stop near the boundary “gaps”. Based on the obtained vector field, we then propose an approach to generate the image segmentation in the following way. First, all the source points are identified over the image domain, then an initial segmentation can be easily obtained from the previously produced vector field. Finally an RAG-based approach was used to find the true segmentation of the original image. We will also see an interesting relationship between our NGVD-based approach and the classic watershed method and how our approach is better in some aspects than watershed method.
Normalized Gradient Vector Diffusion and Image Segmentation
519
We organize the rest of this paper as follows: in next section we will review the work by Xu etc. [2,3]. Then in section 3, our approach of generating vector field will be given and we will see the difference between our NGVD approach and Xu’s method. Sections 4 discusses how the gradient vector flow generated by our method can be applied to the image segmentation, including initial segmentation, region merging and the comparison between our approach and watershed method. And then we will present some segmentation results on different types of images in section 5. Finally in section 6, we give our conclusion.
2
Review of Gradient Vector Flow
In this section we give a brief description of this approach. We refer the readers to [2,3,4] for more details. 2.1
Edge Map
This method begins by defining an edge map f (x, y) derived from the original image I(x, y). The edge map should have the property that f (x, y) is large near the image boundaries and small within the homogeneous regions. There are many ways to define an edge map. Commonly used are:
or
f (x, y) = −∇I(x, y)2 ,
(2)
f (x, y) = −∇[Gσ (x, y) ∗ I(x, y)]2 .
(3)
Then take the gradient of the edge map as the initial values of the vector flows, that is, u(0) = ∂f /∂x, v (0) = ∂f /∂y. In the following we will denote ∂f /∂x and ∂f /∂y as fx and fy , respectively. From the definition, we know that the initial vectors determined by ∇f (x, y) always point toward the image boundary and have large magnitudes near the boundaries and low magnitudes in the homogeneous regions. 2.2
Gradient Vector Flow
In the work of Xu et.al.[2,3,4], the following energy functional was used to propagate the vectors from the boundaries to the inner parts of homogeneous regions: ε= µ(u2x + u2y + vx2 + vy2 ) + ∇f 2 v − ∇f 2 dxdy (4) where v(x, y) = [u(x, y), v(x, y)], and as we mentioned above, the initial value of v(x, y) is determined by ∇f (x, y). µ is a regularization parameter to be set on the basis of noise present in the image [2]. This variational formula consists of two terms. The first term, the sum of the squares of the partial derivatives of the vector field, makes the resulting
520
Z. Yu and C. Bajaj
vector flow v(x, y) varying smoothly. The second term stands for the difference between the vector flow and its initial status. Thus minimizing this energy will force v(x, y) nearly equal to the gradient of the edge map where ∇f (x, y) is large. It can be shown that equivalent diffusion equations can be derived from the minimization of the energy functional defined above (treating u and v as functions of time): ∂u ∂t = µ∇2 u − (u − fx )(fx2 + fy2 ) (5) ∂v 2 2 2 = µ∇ v − (v − f )(f + f ) y x y ∂t These two equations can be separately solved by numerical method, resulting a vector flow over the entire image domain mostly with non-zero magnitudes if enough number of iterations are executed. And it was shown that an appropriate choice of time step t guaranteed the convergence of GVF. 2.3
Generalized Gradient Vector Flow
As pointed out in [3], the GVF method described above has some difficulties when the boundary concavities are long and thin. To remedy this problem, Xu etc. proposed a method called generalized gradient vector flow(GGVF). The essential idea is to introduce two weighting functions for (5): ∂u ∂t = g(∇f )∇2 u − h(∇f )(u − fx ) (6) ∂v 2 = g(∇f )∇ v − h(∇f )(v − f ) y ∂t where g(∇f ) = e−(∇f /K) ,
h(∇f ) = 1 − g(∇f ).
As claimed in [3], the problem with GVF [2] is caused by excessive smoothing of the field near the boundaries. Hence, the weighting functions g(∇f ) and h(∇f ) are introduced to decrease the smoothing effect near strong gradient. Consider a boundary concavity (see fig.1(a) ACB). As we know, the force coming from C is decreasing from left to right. Therefore, if D is far from C (assuming the concavity is arbitrarily long), the force from C will decrease to zero. On the other hand, the forces from A and B are relatively much stronger, resulting vectors pointing either up or down at the “entrance” of the concavity (see fig.1(b)). This problem prevents the snakes moving into the concavity. Although in the generalized GVF method the weighting functions g(∇f ) and h(∇f ) are used such that the force coming from C becomes stronger compared to that calculated in GVF method, but, as we can see in equations (5) and (6), the diffusion equations are isotropic, thus the forces from A and B also become stronger. The resulting vectors near the “entrance” of the concavity are still pointing either up or down. Remember that here we ignore the forces coming
Normalized Gradient Vector Diffusion and Image Segmentation
521
from the right (where R is shown in fig.1(b)) to the concavity since the analysis is the same. Therefore the true problem does not occur at the “entrance” of the concavity, but in the central area of the concavity. The readers will see this fact soon in next section. Another problem with GGVF (and GVF as well) is how to stop the curves from moving out of the boundaries “gap”. If the boundary gap is isolated (like the examples in [2,3]), these algorithms work correctly on the boundaries with gaps. However, if there is another boundary near this gap (as seen in fig.1(a) where G is shown), all the vectors around G resulting from equation (5) or (6) will point to F . This makes the curves move out of G and causes a problem. Similar problem happens when weak boundaries are close to relatively much stronger boundaries such that the weaker boundaries are overwhelmed after diffusion.
A D
C
G
F
L
R
B
(a) Problems with traditional GVF
(b) Vector diffusion by GVF
Fig. 1. Illustration of the problems seen in traditional GVF method
3
Normalized Gradient Vector Diffusion
In this section we will describe our normalized gradient vector diffusion, which overcomes the problems seen in traditional gradient vector diffusion. For simplicity we will call this method NGVD in the following. 3.1
Analysis: The Ideas
The traditional GVF and even the generalized GVF methods cannot efficiently solve the problems as we have seen above. The key observation behind these problems is that a weak vector (with almost zero magnitude) makes a very little impact on its neighbors that have much stronger magnitudes. However, if we normalize the vectors before we apply the diffusion equations, the results will be quite different. Fig.2 gives an example of how these two different ways diffuse two vectors and have totally different results. For the normalized vector diffusion method, a very weak vector can tremendously affect a strong vector, both on its magnitude and on its orientation. This is the essential idea behind our new diffusion approach.
522
Z. Yu and C. Bajaj
V2
V2
V
V2
V
v
v 450
V1
u
(a) Traditional GVF
V1
V1
u
(b) Normalized GVD
Fig. 2. Two ways to diffuse vectors: the way on the left shows the traditional method that results in a vector almost pointing up, while the method on the right shows our normalized GVD in which the resulting vector is equally influenced by both V1 and V2 regardless their magnitudes.
3.2
Heat Equations and Algorithms
There are two different ways to implement the normalized vector diffusion: one is isotropic diffusion, similar to Xu’s method, the other is anisotropic diffusion. Both ways produce almost the same vector field although their implementations are slightly different. Isotropic Diffusion. The heat equations of the isotropic vector diffusion are very similar to the equations used in the traditional GVF method as we saw in previous section. The only difference is that in our NGVD method we need to normalize all the vectors before we apply the diffusion operations. The heat equations can be written as: ∂u∗ ∂t = µ∇2 u∗ − (u∗ − fx )(fx2 + fy2 ) (7) ∂v∗ 2 ∗ ∗ 2 2 = µ∇ v − (v − f )(f + f ) y x y ∂t where (u∗ , v ∗ ) is the normalized vector on a pixel over the image domain and (fx , fy ) stands for the initial vector on the same pixel. This equation can be further converted into an iterative numerical version by using simple central difference scheme (more details can be found in [2]). The overall algorithm is described as follows: Initialize: Calculate f (x, y) using (2) or (3). Let u(0) = fx , v (0) = fy . Repeat: (for each iteration) (a) Normalize all the non-zero vectors to unit vectors. (b) Calculate the new value for u using the first equation in (7). (c) Calculate the new value for v using the second equation in (7). Remember that the number of iterations can be set manually or automatically. The latter one is achieved by checking the maximal change of the vectors
Normalized Gradient Vector Diffusion and Image Segmentation
523
between two adjacent iterations. If the maximal change is smaller than a fixed threshold, then the diffusion terminates. Anisotropic Diffusion. The normalized gradient vector diffusion can also be simulated with anisotropic diffusion (see [13] for its definition). The heat equations can be written in anisotropic forms as follows: ∂u 1 2 2 ∂t = µ∇( √u2 +v2 ∇u) − (u − fx )(fx + fy ) (8) ∂v = µ∇( √ 1 ∇v) − (v − fy )(f 2 + f 2 ) x y ∂t u2 +v 2 where (u, v) is the vector being diffused at a pixel over the image domain and (fx , fy ) is the initial vector at the same pixel. Using a similar scheme seen in [13], we can rewrite the above equation as follows: u i,j = t
µ 4
× ( (m,n)∈N (i,j) √
i,j v t =
µ 4
× ( (m,n)∈N (i,j) √
um,n 2 u2m,n +vm,n
− 4ui,j ) − (ui,j − fx )(fx2 + fy2 )
vm,n 2 u2m,n +vm,n
− 4vi,j ) − (vi,j − fy )(fx2 + fy2 )
(9)
where (i, j) is the pixel being considered and (m, n) stands for the neighbors of (i, j) (either 4-neighbor or 8-neighbor scheme can be used here). Note that equation (9) is not exactly equivalent to equation (8) from the perspective of anisotropic diffusion defined in [13]. But the experiments showed that the slightly modified scheme in (9) gives more desirable results. The overall algorithm for the anisotropic diffusion version can be described as follows: Initialize: Calculate f (x, y) using (2) or (3). Let u(0) = fx , v (0) = fy . Repeat: (for each iteration) (a) Calculate the new value for u using the first equation in (9). (b) Calculate the new value for v using the second equation in (9). 3.3
Comparison between Different Diffusion Schemes
In the following we will compare our approach with the traditional GVF/GGVF methods [2,3,4,12] on their abilities of dealing with boundary concavities and boundary gap. Fig.3 shows the testing image which we will work on. This image contains one boundary concavity and one boundary gap. The regions indicated by the two dashed rectangles will be enlarged in fig.4 and fig.5 such that the vector field can be more clearly displayed. Fig.4(a) shows the enlarged vector field generated by the traditional gradient vector diffusion around the boundary concavity. We can see that the vectors in the concavity (especially those near the center) still point up or down as they
524
Z. Yu and C. Bajaj
Fig. 3. The testing image used for the comparison: the boundary concavity and gap are indicated by dashed rectangles
did before the diffusion. Therefore, a snake will stop near the entrance of the concavity and cannot reach the bottom. However, by our normalized vector diffusion, the vectors in the concavity obviously changed their magnitudes and orientations and thus the snake can easily move into the bottom of the concavity (see fig.4(b) and (c)). The results by our normalized method are almost the same both for isotropic diffusion and for anisotropic diffusion.
108
108
108
106
106
106
104
104
104
102
102
102
100
100
100
98
98
98
96
96
96
94
94
94
92
92
50
55
60
65
70
75
80
85
90
(a) By traditional GVF
90
92
50
55
60
65
70
75
80
85
90
(b) By isotropic NGVD
50
55
60
65
70
75
80
85
90
(c) By anisotropic NGVD
Fig. 4. The comparison between the traditional gradient vector diffusion and our normalized gradient vector diffusion for their behaviors on the boundary concavities
Fig.5 shows the results by our approach and traditional GVF near the boundary gap. As we can see from fig.5(a) traditional GVF has difficulty preventing the vectors on the boundary from being significantly influenced by the nearby boundaries and thus causes a problem such that the snake may move out of the boundary gap. However, our approach can avoid this problem. From fig.5(b) and (c) we can see that the vectors around the boundary gap point to the boundary from both sides. The price we pay for the improvement of the performance in the above cases is that extra time is required to normalize the gradient vectors before or during each diffusion iteration. Furthermore, the traditional GVF method usually behaves better in noisy images than our NGVD. Based on these two considerations, a combined version of gradient vector diffusion can be used in the implementations to reduce time consumption as well as sensitivity to noises. The strategy is to
Normalized Gradient Vector Diffusion and Image Segmentation
525
115
115
110
110
110
105
105
105
100
100
100
95
95
95
90
90
90
85
85
138
140
142
144
146
148
150
152
(a) By traditional GVF
154
85
136
138
140
142
144
146
148
150
152
(b) By isotropic NGVD
154
136
138
140
142
144
146
148
150
152
154
(c) By anisotropic NGVD
Fig. 5. The comparison between the traditional gradient vector diffusion and our normalized gradient vector diffusion for their behaviors on the boundary gap, which is located around the point (141, 100) in the left part of each figure
utilize the traditional GVF method in the first half (or more) of the entire diffusion process, while the normalized GVD method is used only in the second half (or less) of the diffusion process.
4
Image Segmentation
In this section we will briefly describe how to apply the gradient vector field to the image segmentation. As we have said before, the gradient vector field provides the active snakes an external force. However, in active snake model, we must answer the following questions: how to initialize the snakes? and what is the stopping criterion? 4.1
Seed Points
There are several ways to choose the initial snakes (or seed points): by hand, by “balloons”, or by the locus of the zero-crossing of the Laplacian of the smoothed images (see [18] for a summary). In our case, we use source points (defined below) as our seeds. The source points can be automatically generated from the gradient vector field obtained from the diffusion process described in previous section. Definition: A point is called source if none of its neighbors points to it. In other words, a point A is a source if and only if, for any neighbor B of A: v B · BA ≤ 0, where v B is the gradient vector at B and BA is the vector from B to A. A source point looks very like the origin of a spring where the water comes out. In fig.6 we show an example of how a source looks like in our case of gradient vector flows. Some source points clearly appear in five different sites in the vector field shown in fig.6(a). Usually there are more than one source even
526
Z. Yu and C. Bajaj
in a homogeneous region, and the number of sources depends on two factors: (1) the geometric shape of the region’s boundaries; (2) the number of iterations executed in the generation of the vector field. fig.6(b) illustrates the locus (the white small circles) of the source points distributed in a figure (only consider the central region). 70
65
60
55
50
45
40
35
30
35
40
45
50
55
60
65
(a) Example of source points
(b) Seed points
Fig. 6. Take source points as our initial snakes (or seeds): the five source points shown on the left part of (b) correspond to the five sources of the vector field shown in (a)
4.2
Initial Segmentation
After we identify all seed points, initial snakes are generated by the following way: each isolated seed point is taken as an initial snake, and all connected seed points correspond to only one snake. Then we can let the initial snakes start to move. Their movements are directed by an external force that is determined by the previously generated vector field. The snakes stop moving by a very natural way, that is, if all the vectors on a snake are pointing to the opposite direction of the movement, then the snake stops. Finally when all the snakes stop, we obtain a partition of the original image such that each region corresponds to a snake. This is what we call initial segmentation, and next section will show some examples of initial segmentation on several different types of images. However, the initial segmentation does not give a satisfying result because there are too many small regions left. So we need to further refine the segmentation by merging the regions. 4.3
Region Merging
There have already been various approaches on region merging [19,20,21], most of which were based on the initial segmentation produced by the well-known Watershed method. Although our approach to generate the initial segmentation is different from Watershed method (the difference will be given in the following subsection), the basic idea for the region merging is the same: first, generate a Region Adjacency Graph (RAG) from the initial segmentation, and then
Normalized Gradient Vector Diffusion and Image Segmentation
527
merge the most similar regions step by step until some criterion is satisfied. Different merging methods may have different ways to define the similarity. In our approach we will consider two factors to determine the similarity: one is the intensity difference between two adjacent regions, and the other is the average image gradient on the common boundaries between two adjacent regions. Two regions are said to be more similar if their intensity difference is smaller and/or the average image gradient on their common boundaries is higher. For these purposes we organize our data structures as follows: struct Region: int number; /* double intensity; /* BOUND *boundary; /* RGN *neighbor; /*
number of points in this region */ average intensity of this region */ list of boundary points of this region */ list of neighboring regions */
where the structure BOUND is defined by the x-coordinate and y-coordinate of the boundary point and a pointer to the next boundary point. The structure RGN contains the ID number of the neighboring region, the number of points on the common boundaries between the neighboring region and the region under consideration, the average image gradient on the common boundaries, the overall similarity between the neighboring region and the region under consideration, and finally a pointer to the next neighboring region. 4.4
Relationship with Watershed Method
As mentioned above, Watershed is one of the well-known approaches to generate an initial segmentation (see [22]). This method usually begins with the image gradient map and takes the minimum of this map as the seeds. From this point Watershed method is very similar to our approach: both methods start from the image gradient map and the minimums in Watershed method look very like the source points of our approach. Furthermore, the geodesic distance transformation used in Watershed method can be thought of as a propagating process, which is analogous to the propagating process of gradient vectors in our approach. However, the geodesic distance transformation, like the distance potential force [5] we mentioned in the first section, makes the traditional Watershed method limited when dealing with the boundary concavities. Another advantage of our approach is its simplicity of generating the initial segmentation given a vector field over the image domain, because the vectors provide snakes the information of where to start or stop and how to move. However, it is not so straightforward in Watershed method to obtain the initial segmentation from a set of seeds.
5
Results
In the following we will give some examples of image segmentation by our region merging approach based on the normalized gradient vector diffusion (NGVD). Fig.7 shows an example of microscopy images, where the original image is shown
528
Z. Yu and C. Bajaj
in (a). After the initial segmentation there are totally 105 regions left (as seen in (b)). Finally 14 regions, including the background region, are remained after region merging. Although the original image is quite blurred, the cells are still correctly segmented. Fig.8 shows an NMR image of brains. The initial segmentation generates 335 regions and after the region merging 25 regions are left. In fig.9 we show the results of segmentation on the same image using traditional GVF and our normalized gradient vector diffusion. It is very clear to see the different performance of these two methods on the boundary concavities shown in the left-top part of the images.
(a) The original image
(b) Initial segmentation
(c) After region merging
Fig. 7. Illustration of image segmentation by our NGVD-based region merging: there are 105 regions left after initial segmentation, and 14 regions left after merging
(a) The original image
(b) Initial segmentation
(c) After region merging
Fig. 8. Illustration of image segmentation by our NGVD-based region merging: there are 335 regions left after initial segmentation, and 25 regions left after merging
Normalized Gradient Vector Diffusion and Image Segmentation
(a) By traditional GVF
529
(b) By our NGVD
Fig. 9. The comparison between the traditional gradient vector diffusion and our normalized gradient vector diffusion for their behaviors on the image segmentation. Look closely at the difference between (a) and (b) in the left-top part of the images.
6
Conclusion
In this paper we present a new type of diffusion equations to generate Gradient Vector Fields. We normalize the vectors over the image domain before or during each diffusion iteration. The experiments show that our new approach not only can rectify the problems encountered in traditional snake models, but also can efficiently handle the cases of long-thin boundary concavities and boundary gaps as seen in traditional GVF method. Our approach can be applied to image segmentation with the help of region merging technique. The discussion also reveals an interesting relationship between our approach and the well-known Watershed method. Finally, it is quite straightforward to extend our method to 3D gradient vector diffusion and image segmentation.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snake: active contour model. Int. J. Computer Vision. 1(1987) 321-331 2. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. IEEE Trans. Image Processing. 7(3) (1998) 359-369 3. Xu, C., Prince, J.L.: Generalized gradient vector flow external force for active contour. Signal Processing - An International Journal. 71(2) (1998) 131-139 4. Xu, C., Prince, J.L.: Gradient Vector Flow Deformable Models. Handbook of Medical Imaging, edited by Isaac Bankman, Academic Press. September, 2000 5. Cohen, L.D., Cohen, I.: Finite-element method for active contour models and balloons for 2D and 3D images. IEEE Trans. on Pattern Analysis and Machine Intelligence. 15(11) (1993) 1131-1147 6. Terzopoulos, D., Witkin, A., Kass, M.: Constraints on deformable models: recovering 3D shape and non-rigid motion. Artificial Intelligence. 36(1) (1988) 91-123 7. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape modeling with front propagation: A level set approach. IEEE Trans. on Pattern Analysis and Machine Intelligence. 17 (1995) 158-175
530
Z. Yu and C. Bajaj
8. Leroy, B., Herliin, I., Cohen, L.D.: Multi-resolution algorithm for active contour models. In 12th Int. Conf. Analysis and Optimization of System. (1996) 58-65 9. Cohen, L.D.: On active contour models and balloons. CVGIP: Image Understanding. 53 (1991) 211-218 10. Davatzikos, C., Prince, J.L.: An active contour model for mapping the cortex. IEEE Trans. Medical Image. 14 (1995) 65-80 11. Abrantes, A.J., Marques, J.S.: A class of constrained clustering algorithm for object boundary extraction. IEEE Trans. Image Processing. 5 (1996) 1507-1521 12. Prince, J.L., Xu, C.: A new external force model for snakes. Proc. 1996 Image and Multidimensional Signal Processing Workshop. (1996) 30-31 13. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence. 12(7) (1990) 629-639 14. McInerney, T., Terzopoulos, D.: A dynamic finite element surface model for segmentation and tracking in multidimensional medical images with application to cardiac 4D image analysis. Comput. Med. Imag. Graph.. 19 (1995) 69-83 15. Leymarie, F., Levine, M.D.: Tracking deformable objects in the plane using an active contour model. IEEE Trans. on Pattern Analysis and Machine Intelligence. 15 (1993) 617-634 16. Terzopoulos, D., Szeliski, R.: Tracking with Kalman snakes. in Active Vision, A.Blake and A.Yuille, Eds. Cambridge, MA: MIT Press, (1992) 3-20 17. Malladi, R., Sethian, J.A.: A real-time algorithm for medical shape recovery. in Proceedings of International Conference on Computer Vision. Mumbai, India. (1998) 304-310 18. Shah, J.: A common framework for curve evolution, segmentation and anisotropic diffusion. in Proceedings of International Conference on Computer Vision and Pattern Recognition: CVPR’96. June (1996) 136-142 19. Weickert, J.: Fast segmentation methods based on partial differential equations and the watershed transformation. P. Levi, R.-J.Ahlers, F. May, M. Schanz (Eds.), Mustererkennung 1998, Springer, Berlin. (1998) 93-100 20. Haris, K., Efstratiadis, S.N., Maglaveras, N., Katsaggelos, A.K.: Hybrid image segmentation using watersheds and fast region merging. IEEE Trans. on Image Processing. 7(12) (1998) 1684-1699 21. Moscheni, F., Bhattacharjee, S., Kunt, M.: Spatiotemporal segmentation based on region merging. IEEE Trans. on Pattern Analysis and Machine Intelligence. 20(9) (1998) 897-915 22. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. on Pattern Analysis and Machine Intelligence. 13(6) (1991) 583-598
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension Serge Belongie1 , Charless Fowlkes2 , Fan Chung1 , and Jitendra Malik2 1
University of California, San Diego, La Jolla, CA 92093, USA {sjb,fan}@cs.ucsd.edu 2 University of California, Berkeley, Berkeley, CA 94720, USA {fowlkes,malik}@cs.berkeley.edu
Abstract. Fowlkes et al. [7] recently introduced an approximation to the Normalized Cut (NCut) grouping algorithm [18] based on random subsampling and the Nystr¨ om extension. As presented, their method is restricted to the case where W , the weighted adjacency matrix, is positive definite. Although many common measures of image similarity (i.e. kernels) are positive definite, a popular example being Gaussianweighted distance, there are important cases that are not. In this work, we present a modification to Nystr¨ om-NCut that does not require W to be positive definite. The modification only affects the orthogonalization step, and in doing so it necessitates one additional O(m3 ) operation, where m is the number of random samples used in the approximation. As such it is of interest to know which kernels are positive definite and which are indefinite. In addressing this issue, we further develop connections between NCut and related methods in the kernel machines literature. We provide a proof that the Gaussian-weighted chi-squared kernel is positive definite, which has thus far only been conjectured. We also explore the performance of the approximation algorithm on a variety of grouping cues including contour, color and texture.
1
Introduction
Among the methods for image segmentation developed in recent years, those based on pairwise grouping arguably show the most promise. By the term “pairwise” we mean that the grouping operation is based on measures of similarity or dissimilarity between pairs of pixels. In contrast, “central” grouping methods proceed by comparing all the pixels to a small number of prototypes or cluster centers; examples include k-means and EM clustering with Gaussian mixture models. Central grouping methods tend to be computationally cheaper, but have difficulty dealing with irregularly-shaped clusters and gradual variation within groups. Moreover, they are sensitive to initialization and require model-selection (i.e. specification of the number of groups). Generally speaking, pairwise grouping methods either eliminate or simplify these problems. Some of the approaches that have been proposed for grouping pairwise data include spectral graph partitioning [18,19,13], deterministic annealing [15], and stochastic clustering [8]. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 531–542, 2002. c Springer-Verlag Berlin Heidelberg 2002
532
S. Belongie et al.
The drawback, of course, is that approaches based on pairwise data in principle require measurements between all possible pairs of pixels. Consequently, the number of pairs considered is often restricted by placing a threshold on the number of connections per pixel, e.g. by specifying a cutoff radius. This discourages the use of long-range connections and this can result in over-segmentation of homogeneous regions. A promising solution to this problem for the case of spectral graph theoretic methods was recently proposed by Fowlkes et al. [7]. Their method, based on the Nystr¨ om approximation for the integral eigenvalue problem, works by solving a grouping problem on a small set of m randomly sampled pixels and then extending the solution to the complete set of pixels. Using this approach, they produced high-quality segmentations of image sequences in a fraction of the time required to compute the exact solution. Though not explicitly stated, Fowlkes et al. assume that the function used for computing the simlarity between pairs of pixels is positive definite, i.e. that the weight matrix comprised of all the pairwise similarities is a Gram matrix. While this assumption is generally taken for granted in kernel based methods (e.g. [17]), the same cannot necessarily be said for similarity measures used in the computer vision literature. In the present work, we show that this restriction can be lifted by modifying the orthogonalization step used in [7], which requires positive definiteness. This proposed change necessitates an additional O(m3 ) operation; as such it is desirable to know when this alternative is necessary. To this end we discuss the application of the Nystr¨ om method to a number of commonly used similarity functions, both positive definite and indefinite (i.e. neither positive definite nor negative definite). The organization of this paper is as follows. We begin by reviewing in Section 2 the spectral graph theoretic pairwise grouping algorithm used in this work, namely Normalized Cuts (NCut) [18]. Next we review the Nystr¨ om extension in Section 3, focusing on its application to NCut. In Section 4 we discuss the issues of definiteness and indefiniteness of commonly used kernels used for measuring pairwise similarity and provide our modification to the method of [7]. Experimental results and discussion are provided in Sections 5. Some properties of the approximation are discussed in Section 6 and finally we conclude in Section 7.
2
Review of Normalized Cuts
Let the symmetric matrix W ∈ IRN ×N denote the weighted adjacency matrix for a graph G = (V, E) with nodes V and edges E. We will refer to the function used to compute Wij as a kernel; examples of kernels and their properties are discussed in Section 5. Let A and B represent a bipartition of V , i.e. A ∪ B = V and A ∩ B = ∅. Let cut(A, B) denote the sum of the weights between A and B: cut(A, B) = i∈A,j∈B Wij . The degree of the ith node is defined as and the volume of a set as the sum of the degrees within that set: di = j W ij vol(A) = i∈A di and vol(B) = i∈B di . The Normalized Cut between sets A and B is then given as follows:
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension
NCut(A, B) = cut(A, B)
1 1 + vol(A) vol(B)
=
533
2 · cut(A, B) vol(A)vol(B)
where denotes the harmonic mean. We wish to find A and B such that NCut(A, B) is minimized. Appealing to spectral graph theory [4], Shi and Malik [18] showed that an approximate solution may be obtained by thresholding the eigenvector corresponding to the second smallest eigenvalue of the normalized Laplacian L, which is defined as L = D−1/2 (D − W )D−1/2 = I − D−1/2 W D−1/2 where D is the diagonal matrix with entries Dii = di . The matrix L is positive semidefinite, even when W is indefinite. Its eigenvalues lie on the interval [0, 2] so the eigenvalues of D−1/2 W D−1/2 are confined to lie inside [−1, 1] (see [4]). Finally, extensions to multiple groups are possible via recursive bipartitioning or through the use of multiple eigenvectors.
3
Review of the Nystr¨ om Approximation to NCut
Since N is quite large for typical images (e.g. 2562 ), finding the eigenvectors of L is computationally intensive. One approach to dealing with this difficulty is to connect only to those pixels that are nearby in the image. This makes L sparse and permits the use of an efficient eigensolver (e.g. Lanczos). However, this discourages the use of long-range connections and the approximation properties are not easily understood. The Nystr¨om approximation provides an alternative approach based on random sampling. The application of the Nystr¨ om approximation to NCut proceeds as follows. First, choose m samples at random from the full set of N pixels. For simplicity in notation, reorder the samples so that these m come first and the remaining n = N − m samples come next. Now partition the weight matrix W as A B W = (1) BT C with A ∈ IRm×m , B ∈ IRm×n , C ∈ IRn×n , and N = m + n, with m n. Here A represents the subblock of weights amongst the random samples, B contains the weights from the random samples to the rest of the samples, and C contains the weights between all of the remaining samples. Assuming m n, C is huge. The Nystr¨ om extension implicitly approximates C using B T A−1 B. The quality of the approximation of the full weight matrix A B ˆ W = (2) B T B T A−1 B can be quantified as the norm of the Schur complement C − B T A−1 B. The size of this norm is governed by the extent to which C is spanned by the rows of
534
S. Belongie et al.
B. Thus, rather than set the majority of entries in W to zero to produce a sparse approximation, the Nystr¨ om method provides (implicitly) an approximation to the entire weight matrix based on a subset of rows/columns. ˆ can be diagonalized in an efficient manner. Fowlkes et al. [7] show that W Let A1/2 denote the symmetric positive definite square root of A, define S = A+A−1/2 BB T A−1/2 and diagonalize it as S = U ΛU T . If the matrix V is defined as A A−1/2 U Λ−1/2 (3) V = BT ˆ is diagonalized by V and Λ, i.e. W ˆ = V ΛV T and then one can show that W T V V = I. We assume that pseudoinverses are used in place of inverses as necessary when there is redundancy in the random samples. To apply this approximation to NCut, it is necessary to compute the row ˆ . This is possible without explicitly evaluating the B T A−1 B block sums of W since A1m + B1n ˆ ˆ d = W1 = B T 1m + B T A−1 B1n ar + br = (4) bc + B T A−1 br where ar , br ∈ IRm denote the row sums of A and B, respectively, and bc ∈ IRn denotes the column sum of B. ˆ in hand, the blocks of D ˆ −1/2 W ˆD ˆ −1/2 that are needed to approximate With d its leading eigenvectors are given as Aij , Aij ← ˆid ˆj d
i, j = 1, . . . , m
and Bij ←
Bij ˆid ˆ j+m d
,
i = 1, . . . , m, j = 1, . . . , n
to which we can apply equation (3) as before.
4
Nystr¨ om-NCut for Indefinite Kernels
ˆD ˆ −1/2 , Fowlkes et al. assume that A is positive semidefˆ −1/2 W In diagonalizing D inite in order to compute A1/2 , the positive semidefinite square root of A. Positive definite kernels correspond to a dot products in a “feature space” that is protentially of much higher dimensionality than the input space. This geometric intuition is the essence of the kernel trick [16], which serves as the basis for
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension
535
kernel-based methods such as support vector machines (SVM) and kernel principal components analysis (KPCA). As such, in the kernel-machines literature, the term “kernel” is often used synonymously with “positive definite kernel.” The same assumption cannot be made in general for similarity functions used in grouping in the computer vision literature. To be sure, Gaussian-weighted Mahalanobis distance between feature vectors, one of the most common similarity measure used in grouping, is positive definite, as are several other popular choices. However, there are a number of similarity measures one can use that only satisfy the fairly weak requirements that W is symmetric and Wij is “big” if pixels i and j are similar and “small” if they are not. It is therefore important that both cases be properly addressed. Equation (3) can be thought of as a “one-shot” combined Nystr¨ om eigenvector approximation and orthogonalization operation. By keeping these two steps separate, we will show that the positive definiteness requirement can be circumvented.1 Starting from the approximation of W in Equation (2), let A = U ΛU T ˆ as denote the diagonalization of A. We may then write W T −1 T U ˆ = U B W T −1 Λ U Λ B UΛ where the block B T U Λ−1 represents the Nystr¨om extension. As Williams and Seeger [20] noted, this is equivalent to the expression for the projection of a test point onto the feature-space eigenvectors in Kernel PCA. Although this extension appears to give us an approximate diagonalization, the extended eigenvectors are not orthogonal. ¯ T = [U T Λ−1 U T B] We carry out the orthogonalization step as follows. Let U 1/2 T T ˆ = ZZ . Let F ΣF denote the diagonalization ¯Λ so that W and define Z = U of Z T Z. Then the matrix V = ZF Σ −1/2 contains the leading orthonormalized ˆ , i.e. W ˆ = V ΣV T with V T V = I. As before, a pseudoinverse eigenvectors of W can be used in place of a regular inverse when A has linearly dependent columns. Thus the approximate eigenvectors are produced in two steps: first we use ¯ and Λ and then we orthogonalize U ¯ to the Nystr¨ om extension to produce U produce V and Σ. Although this “two-step” approach is applicable in general, the additional O(m3 ) step it requires for the orthogonalization takes extra time and leads to an increased loss of significant figures. Therefore it is expedient to know when the one-shot method can be applied, i.e. when a given kernel is positive definite.
5
Experiments and Discussion
In this section we discuss a number of kernels, both positive definite and indefinite, and show examples of their use. 1
Since the normalized Laplacian is positive semidefinite even when W is not, it is tempting to try to apply Nystr¨ om to L instead of D−1/2 W D−1/2 . Unfortunately, the Nystr¨ om method finds the leading eigenvectors, and the eigenvectors of L we need are the trailing ones.
536
S. Belongie et al.
Gaussian weighted distance. Perhaps the most commonly used measure of similarity between pixels is Gaussian weighted Mahalanobis distance between feature vectors xi and xj : 1
Wij = e− 2 (xi −xj )
T
Σ −1 (xi −xj )
This kernel is positive definite and therefore admits the use of the one-shot Nystr¨ om method. Fowlkes et al. [7] used this kernel on feature vectors containing position, color, and optical flow. Most of the works cited in the introduction use this kernel (among others); as such, additional experimental results will not be provided here. Histogram comparison using the χ2 test. The χ2 test is a simple and effective means of comparing two histograms. It has been shown to be a very robust measure for color and texture discrimination [15]. Given two normalized histograms hi (k) and hj (k) define χ2ij =
K
1 (hi (k) − hj (k))2 2 hi (k) + hj (k) k=1
where it is understood that any term in the sum for which hi (k) = 0 and hj (k) = 0 is replaced by zero. We can then define the similarity between a pair of histograms as Wij = 2 e−χij /α . This kernel is widely conjectured to be positive definite (see e.g. [3]) but to our knowledge no proof of this has been published. The appendix A contains our proof that Gaussian-weighted χ2 is positive definite. An example of Nystr¨ om-NCut on a color image of a tiger is shown in Figure 1. In this example, we computed a local color histogram inside a 5×5 box around each pixel using the color quantization scheme of [14]. Finally, since the weight matrix is positive definite, we used the one-shot Nystr¨ om method. We note that the same technique can be applied to texture using the “textons” of Malik et al. [11], i.e. vector-quantized filter responses. Intervening contour. To integrate contour information into a pairwise region based grouping framework, it is convenient to construct a kernel that indicates points are dissimilar if they lie on oposite sides of an intervening contour [10]. We consider a distance between each pair of pixels that takes into account all possible paths across the image. Each path between a pair is assigned a distance equal to the maximum contour energy encountered along the path. The distance rij between the pair is then taken to be the minimum energy over all paths and 2 the similarity between two pixels is e−rij /α . It is not known whether this kernel is positive definite. This cue captures the Gestalt notion of closure. If two points are separated by a closed contour then they will have low similarity while if there is a path
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension (a)
537
(c)
(b)
Fig. 1. Segmentation of tiger image based on Gaussian weighted χ2 -distance between local color histograms. The image size is 128 × 192 and the histogram window size is 2 5 × 5. Color quantization was performed as in [14] with 8 bins. Since the e−χij kernel is positive definite, we can use the one-shot method of [7]. (a) Original image shown with m = 100 random samples used in approximation. (b) Nystr¨ om-NCut eigenvectors 2 through 7, sorted in ascending order by eigenvalue. (c) Segment-label image obtained via k-means clustering on the eigenvectors as described in [7].
connecting two points that doesn’t cross an edge then they will have high similarity. The problem of finding this minimum over all paths has the same structure as the classic shortest path problem and is easily solved by application of Dijkstra’s algorithm [5]. Since the problem is sparse it is possible to achieve a running time of O(m · (N log N )) where m is the number of samples and N is the number of pixels. An illustration of this method applied to a sample image is shown in Figure 2. One Minus Squared Distance. A simple choice of kernel for expressing similarity between pixels is the following, Wij = 1 −
2 rij α
538
S. Belongie et al. V contour energy
H contour energy
10
10
20
20
30
30
10
20
30
40
50
60
40
10
image
10
20
20
30
30
10
20
30
30
40
50
segmentation
10
40
20
40
50
60
40
10
20
30
40
50
60
Fig. 2. Segmentation using intervening contour. Original image of synthetic shapes with noise is shown at lower left. At top, the horizontal and vertical boundary energy is shown; this is computed by squaring the x and y components of the smoothed gradient. The connection weight between a pair of pixels is based on the contour energy encountered along all paths between the pair of pixels; see text for details. The segmentation label map, obtained via k-means on the Nystr¨ om-NCut eigenvectors, is shown at lower right. 2 where rij represents the squared distance between feature vectors at i and j. This kernel is in general indefinite2 ; moreover, it takes on negative values.3 Nevertheless, this kernel makes intuitive sense and, empirically, NCut works well with it. An example using the two-step Nystr¨ om method with this kernel on color and proximity is shown in Figure 3. 2
3
In Multidimensional Scaling (MDS) [6] one applies a “centering operation” to a squared distance matrix to isolate the positive semidefinite component corresponding to the inner products between the embedded coordinates. This centering operation (not repeated here) is more complicated than the one-minus transformation used in this kernel, and though interesting in its own right, is beyond the scope of the current discussion. In principle this means the degree could be negative, viz. if enough negative entries conspire in a single row of W to dominate the positive entries. In such cases, one could do clipping, however in our experiments we found that this was unnecessary.
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension (a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
539
Fig. 3. Segmentation of Firetruck image using 2nd degree polynomial kernel. The feature vector for each pixel contains RGB color values and (x, y) coordinates. The form of 2 2 the kernel is Wij = 1−rij /α where rij represents the Mahalanobis distance between feature vectors at pixels i and j. Since this kernel is indefinite, we applied Nystr¨ om-NCut using the proposed two-step method. (a) Original image. (b-h) Segments obtained via k-means clustering on Nystr¨ om-NCut eigenvectors as in Figure 1.
6
Properties of the Approximation
Since the Nystr¨ om method only requires us to diagonalize an m×m matrix to find ˆ −1/2 W ˆD ˆ −1/2 , this approach can be very efficient. the leading eigenvectors of D A key question is how well a given set of samples allows us to approximate these eigenvectors. Fowlkes et al. [7] provide empirical results on a large set of natural images to show that roughly 100 samples do a good job when using color and proximity. In this section we wish to shed some light on the geometric interpretation of the use of B T A−1 B as an approximation to C. As we saw in Section 3, the quality of the approximation depends on the extent to which the rows of C are spanned by the rows of B. This is true for W in general. When W is positive semidefinite, we can say more. In particular, we can express the blocks A and B as A = X T X and B = X T Y where X ∈ IR(m+n)×m and Y ∈ IR(m+n)×n ; in the parlance of the kernel-machines literature [16], the columns of X and Y represent the empirical feature mapping. Let X = QR, with Q ∈ IR(m+n)×m , QT Q = I, and upper-triangular R ∈ IRm×m denote the QR decomposition of X. In other words, Q represents an orthonormal basis for the space spanned by the columns of X. Then the matrix B T A−1 B simplifies as follows: B T A−1 B = Y T X(X T X)−1 X T Y = Y T QR(RT QT QR)−1 RT QT Y = Y T QRR−1 R−T RT QT Y = Y T QQT Y = (QT Y )T (QT Y )
540
S. Belongie et al.
Recall that the exact values of C are given by Y T Y , i.e. the inner products between the columns of Y . The quantity (QT Y )T (QT Y ) represents the inner products of the columns of Y after projecting them onto the subspace spanned by X. Thus if Y is spanned well by X then Y T QQT Y will be a good approximation to Y T Y .
7
Conclusion
In this paper we have introduced a modification to the Nystr¨om approximation to Normalized Cuts (Nystr¨ om-NCut) that does not require the measure of similarity between pairs of pixels to be a positive definite function. The proposed change involves separating the steps of the Nystr¨ om extension and orthogonalization. As this necessitates an additional O(m3 ) operation, where m is the number of samples used in the approximation, it is important to know whether a kernel is positive definite in order not to waste computation and sacrifice numerical precision unnecessarily. In light of this, we examined a number of kernels, both positive definite and indefinite, and showed image segmentation results using both versions of Nystr¨ om-NCut. In the process we have provided what we believe is the first proof that the Gaussian weighted χ2 kernel is positive definite. Finally, we provided some geometrical insight into the nature of the approximation for the case of positive definite kernels. Acknowledgments. We wish to thank Olivier Chapelle and Gianluca Donato for helpful discussions. This research was supported in part by NSF Grant no. DMS 0100472 and Office of Naval Research grant no. N00014-01-1-0890, as part of the MURI program.
A
2
Proof of Positive Definiteness of e−χij 2
We now prove that e−χij is positive definite. We begin by considering the χ2ij term by itself. Noting that (hi (k) − hj (k))2 = (hi (k) + hj (k))2 − 4hi (k)hi (k), we can rewrite χ2ij as χ2ij = 1 − 2
K hi (k)hj (k) hi (k) + hj (k)
k=1
We wish to show that the matrix Q with entries given by Qij = 2
K hi (k)hj (k) hi (k) + hj (k)
k=1
Spectral Partitioning with Indefinite Kernels Using the Nystr¨ om Extension
541
is positive definite. Consider the quadratic form cT Qc for an arbitrary finite nonzero vector c: cT Qc =
n
ci cj Qij
i,j=1
=2
n K
hi (k)hj (k) hi (k) + hj (k)
ci cj
k=1 i,j=1
=2
K n
ci cj hi (k)hj (k)
k=1 i,j=1
=2
K n
k=1 i,j=1
=2
>0
0
0
1
1
n 1 1 ci hi (k)xhi (k)− 2 cj hj (k)xhj (k)− 2 dx
i=1
n K 1
k=1
xhi (k)+hj (k)−1 dx
ci hi (k)xhi (k)− 2 cj hj (k)xhj (k)− 2 dx
n K 1
k=1
=2
0
1
0
1
2 1
ci hi (k)xhi (k)− 2
j=1
dx
i=1
Thus Q is positive definite. (Alternatively, one can show the positive definiteness of Q using properties of Hadamard products as follows. We begin by noting that Q can be written as a sum of K matrices of the form 2xi xj xi + xj where xi > 0. That is to say, Q is a sum of matrices of harmonic means between all pairs of entries in hi (k) over all k. Using ◦ to denote the Hadamard (componentwise) product [2], this matrix can be rewritten as 2xi xj 1 = [2xi xj ] ◦ xi + xj xi + xj The first matrix is positive definite since it is simply a constant times the outer product of x with itself. The second matrix is also positive definite since it is a Hilbert matrix [12]. By Schur’s theorem [2], the Hadamard product of two positive definite matrices is also positive definite. Finally, since the sum of positive definite matrices is also positive definite [9], this establishes that Q is positive definite.) 2 Returning now to e−χij , we note that it can be written as a positive constant times eQij . Since the exponential of a positive definite function is also positive 2 definite [1], we have established that e−χij is positive definite.
542
S. Belongie et al.
References 1. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984. 2. R. Bhatia. Matrix Analysis. Springer Verlag, 1997. 3. O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram based image classification. IEEE Trans. Neural Networks, 10(5):1055–1064, September 1999. 4. F. R. K. Chung. Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. AMS, 1997. 5. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1991. 6. J. de Leeuw. Multidimensional scaling. UCLA Dept. of Statistics, Preprint no. 274, 2000. 7. C. Fowlkes, S. Belongie, and J. Malik. Efficient spatiotemporal grouping using the Nystr¨ om method. In Proc. IEEE Conf. Comput. Vision and Pattern Recognition, December 2001. 8. Yoram Gdalyahu, Daphna Weinshall, and Michael Werman. Stochastic image segmentation by typical cuts. In CVPR, 1999. 9. R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge Univ Press, 1985. 10. T. Leung and J. Malik. Contour continuity in region-based image segmentation. In H. Burkhardt and B. Neumann, editors, Proc. Euro. Conf. Computer Vision, volume 1, pages 544–59, Freiburg, Germany, June 1998. Springer-Verlag. 11. J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image segmentation. Int’l. Journal of Computer Vision, 43(1):7–27, June 2001. 12. R. Mathias. An arithmetic-geometric-harmonic mean inequality involving Hadamard products. Linear Algebra and its Applications, 184:71–78, 1993. 13. P. Perona and W. T. Freeman. A factorization approach to grouping. In Proc. 5th Europ. Conf. Comput. Vision, 1998. 14. J. Puzicha and S. Belongie. Model-based halftoning for color image segmentation. In ICPR, volume 3, pages 629–632, 2000. 15. J. Puzicha, T. Hofmann, and J. Buhmann. Non-parametric similarity measures for unsupervised texture segmentation and image retrieval. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 267–72, San Juan, Puerto Rico, Jun. 1997. 16. B. Sch¨ olkopf and A. Smola. Learning with Kernels. Cambridge, MA: MIT Press, 2001. in preparation. 17. B. Sch¨ olkopf, A. Smola, and K.-R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. 18. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):888–905, August 2000. 19. Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proc. 7th Int’l. Conf. Computer Vision, pages 975–982, 1999. 20. C. Williams and M. Seeger. Using the Nystr¨ om method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, pages 682–688, 2001.
A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-of-Gaussian Background Models Michael Harville Hewlett-Packard Laboratories 1501 Page Mill Rd. ms 3U-4, Palo Alto, CA 94304, United States
[email protected]
Abstract. Time-Adaptive, Per-Pixel Mixtures Of Gaussians (TAPPMOGs) have recently become a popular choice for robust modeling and removal of complex and changing backgrounds at the pixel level. However, TAPPMOG-based methods cannot easily be made to model dynamic backgrounds with highly complex appearance, or to adapt promptly to sudden “uninteresting” scene changes such as the repositioning of a static object or the turning on of a light, without further undermining their ability to segment foreground objects, such as people, where they occlude the background for too long. To alleviate tradeoffs such as these, and, more broadly, to allow TAPPMOG segmentation results to be tailored to the specific needs of an application, we introduce a general framework for guiding pixel-level TAPPMOG evolution with feedback from “high-level” modules. Each such module can use pixel-wise maps of positive and negative feedback to attempt to impress upon the TAPPMOG some definition of foreground that is best expressed through “higher-level” primitives such as image region properties or semantics of objects and events. By pooling the foreground error corrections of many high-level modules into a shared, pixel-level TAPPMOG model in this way, we improve the quality of the foreground segmentation and the performance of all modules that make use of it. We show an example of using this framework with a TAPPMOG method and high-level modules that all rely on dense depth data from a stereo camera.
1
Introduction and Motivation
Many computer vision and video processing applications, in domains ranging from surveillance to human-computer interface to video compression, rely heavily on an early step, often referred to as “foreground segmentation” or “background removal”, that attempts to separate novel or dynamic objects in the scene (“foreground”) from what is normally observed (“background”). Recently, methods employing Time-Adaptive, Per-Pixel Mixtures Of Gaussians (TAPPMOGs) have become a popular choice for modeling scene backgrounds at the pixel level [4,5,6,8,10,12,15,16,20]. In these methods, the time series of observations at a given image pixel is treated as independent of that for all other pixels, A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 543–560, 2002. c Springer-Verlag Berlin Heidelberg 2002
544
M. Harville
and is modeled using a mixture of Gaussians. The per-pixel models are updated as new observations are obtained, with older observations losing influence over time. At each time step, a subset of the Gaussians in each per-pixel model is selected as representative of the scene background, and current observations that are not well-modeled by those Gaussians are designated as foreground. We refer to a TAPPMOG as a “pixel-level” foreground segmentation method because it classifies the current scene image as foreground or background at each pixel using only image information that has been collected at or near that pixel over time. Many other pixel-level foreground segmentation techniques exist, but TAPPMOG methods have gained favor because, unlike most others, they possess the following combination of attributes: • TAPPMOGs slowly adapt their background model to persistent scene appearance modifications such as the insertion or removal of an object, the moving of an object such as a rolling swivel chair from one static position to another, and the change in global illumination that occurs when a light is turned on or when the sun passes behind clouds. While such changes may be of temporary interest, in some applications and hence may be an acceptable part of the foreground for some period of time, it is usually not desirable to pay special attention to them indefinitely. • TAPPMOGs are relatively proficient at modeling the largely repetitive scene appearance changes associated with dynamic objects such as moving foliage, ocean waves, or rotating fans. Although many applications are interested in dynamic objects, most would also like to disregard those whose behavior, when observed over long periods of time, does not seem to vary much. • TAPPMOGs are suitable for real-time software implementation. For applications such as interactive human-computer interfaces and long-running activitymonitoring systems, it is critical that the foreground segmentation process does not require excessive computational resources or time. Other proposed methods of pixel-level background modeling, such as ones based on Wiener prediction filtering [17] and Bayesian decision-making with normalized histogram probability representations [11], can be argued to also possess the above attributes. Little work has been done in comparing the pixellevel results of such methods with those of TAPPMOGs for the same input video sequences, but informal comparison of published results suggests that none of the alternatives perform significantly better than the best TAPPMOG variants. Nevertheless, the foreground produced by TAPPMOGs still falls far short of the ideal for many applications. TAPPMOGs struggle with performance tradeoffs in two areas in particular: Learning rate: The speed with which a TAPPMOG’s Gaussian mixtures begin to reflect more recent observations in favor of older ones is controlled by a parameter called the “learning rate”. A high learning rate allows the TAPPMOG to adapt its background model more quickly to sudden, persistent scene appearance changes, such as those caused by the turning on of a light or the
A Framework for High-Level Feedback
545
re-positioning of a chair, that are foreground distractions to applications concerned with analyzing the activities of people, automobiles, animals, or other dynamic, purposeful objects. Such distractions can severely compromise overall system performance if they are not accurately detected and compensated for by further - and often computationally expensive - processing. Unfortunately, a high learning rate also causes people who do not move enough to fade more quickly into the background. When a person is expected to remain in roughly the same location relative to the camera for extended time periods, as is the case in many vision-based human-computer interface applications, background adaptation is an especially challenging problem. Also, at a “high-traffic” scene location, such as at a busy retail store doorway, where many different foreground objects (in this case, people) frequently pass, a high learning rate will cause the background model to degrade more quickly into a reflection of some average properties of the passing foreground objects. Background model inclusivity: Although we may wish to treat as background those dynamic objects, such as foliage in wind, that exhibit somewhat recurrent patterns of motion, it becomes increasingly difficult to model these objects as their motion patterns include a greater number of significant frequency components, or as their visual textures, even when the objects are static, grow more complex. In such cases, TAPPMOGs usually require more Gaussians to adequately model the distribution of observations produced by the object at a single pixel over time. As more Gaussians are included in the background model, however, it becomes more likely that some representing the desired foreground will also be selected, thereby resulting in erroneous foreground omissions. For instance, it is difficult to determine, using only local image statistics, whether a small number of observations, modeled by one Gaussian, correspond to a foreground person who passed by quickly or to an occasionally recurring motion dynamic of a background object. One might attempt to address these tradeoffs through modifications of the basic TAPPMOG methods, or by choosing an entirely different framework for pixel-level background modeling. It seems likely, however, that TAPPMOGs already approach the limit of what can reasonably be accomplished through consideration of local information alone. Each of a TAPPMOG’s per-pixel Gaussian mixtures can, with sufficient components, accurately describe the appearance statistics of arbitrarily complex backgrounds, and can be modified very quickly or very slowly, as needed, by simple, well-principled processes. What is lacking is more intelligent guidance of the mixtures’ evolution, and it would seem wise to drive this guidance with analysis that operates on “higher” levels of information, such as image regions or entire frames, or object and event semantics. Put another way, in applications for which the definition of the ideal foreground segmentation depends on concepts such as region cohesiveness, frame-wide illumination changes, or object classification, we should not expect to produce this segmentation by relying solely on methods that consider each pixel in isolation. In this paper, therefore, we extend TAPPMOGs in a general way to incorporate corrective guidance from analysis concerned with higher level primitives such
546
M. Harville
as image regions, image frames, or object semantics. This “high-level feedback” allows for many significant improvements in background modeling, including: 1. The rapid incorporation into the background model of appearance changes associated with events that are uninteresting to the application. The definition of “uninteresting” can be application-dependent: some may want to ignore rapid, global illumination changes, while others may want to ignore an automobile after its driver has stopped and exited it. Almost any such definition can be accomodated through feedback from an appropriate high-level module. 2. Good segmentation of foreground objects of interest, even when they stop moving or occlude the background for indefinitely long times. Again, the definition of what foreground is “interesting” may be application-dependent, and most definitions can be serviced with a proper choice of high-level feedback. 3. Much better exclusion from the foreground of dynamic objects that move in repetitive ways. Higher levels of processing can detect the basic TAPPMOG methods’ failures to model such objects, and then feed back information on how to adjust the pixel-level model to prevent re-occurrence of the problems. More broadly, feedback enables higher levels of processing to tailor the general-purpose pixel-level segmentation produced by TAPPMOGs to the specific definition of “foreground” pertinent to an application. This, in turn, should improve the performance of all high-level modules that depend on the TAPPMOG segmentation, whether or not those modules are providing feedback. A number of other methods, including those of [2,10,13,17], can be interpreted as using high-level feedback to guide pixel-level foreground segmentation. All of these methods, however, modify their pixel-level background modeling only in response to specific types of high-level processing, such as the detection of people or illumination changes. None describe how to generalize the feedback influence to arbitrary forms of high-level analysis. Furthermore, with the exception of [10,17], they use pixel-level statistical models that are too simple to represent complex, dynamic backgrounds. We believe our method represents a novel and appealing combination of powerful, per-pixel background modeling with a flexible, general-purpose mechanism for enhancement via high-level input. In Section 2, we describe a particular variant of the basic TAPPMOG method in greater detail, and provide examples of its successes and shortcomings on a challenging test sequence. In Section 3, we discuss our framework for augmenting basic TAPPMOG methods with corrective feedback from high-level processing modules. Finally, in Section 4, we illustrate how several specific types of feedback can dramatically improve pixel-level foreground segmentation results.
2
Background Removal at the Pixel Level
Harville et. al. [8] have recently described a TAPPMOG-based method that, under the assumption of a static camera, seeks to segment novel objects in a scene, as well as non-novel objects that begin behaving in novel ways. It attempts to
A Framework for High-Level Feedback
547
ignore phenomena such as illumination changes, shadows, inter-reflections, persistent scene modifications, and dynamic backgrounds. This definition of “foreground” is consistent with a wide variety of vision applications, and hence the method is of broad utility. By modeling the scene in the combined feature space of per-pixel depth and luminance-normalized color, the method better adheres to this definition than most other choices for pixel-level background removal, and it better avoids foreground omissions due to color “camouflage” (similarity of appearance between a novel object and the background behind it). The method’s reliance on dense (per-pixel) depth has become more practical in recent years as hardware and software for stereo cameras has become much faster and cheaper [9,14,18]. We briefly review the method of [8] here, and show some example results. The results contain many successes, but also the types of failures discussed in Section 1 that motivate our use of high-level feedback. 2.1
Online Clustering of Observations
The pixel-level background modeling method of [8] regards the time series of observations at each pixel as an independent statistical process. Each pixel observation consists of a color and a depth measurement. Color is represented in the YUV space, which allows for separation of luminance and chroma. Depth measurements, denoted as D, are produced by a real-time stereo camera implementation, but could also be computed by methods based on active illumination, lidar, or other means. The observation at pixel i at time t can be written as X i,t = [ Yi,t Ui,t Vi,t Di,t ]. The history of observations at a given pixel, [X i,1 , . . . , X i,t−1 ], is modeled by a mixture of K Gaussian distributions. K is the same for all pixels, typically in the range of 3 to 5. The probability of the current observation at pixel i, given the model built from observations until the prior time step, can be estimated as P (X i,t |X i,1 , . . . , X i,t−1 ) =
K
wi,t−1,k ∗ η(X i,t , µi,t−1,k , Σi,t−1,k )
(1)
k=1
where η is a Gaussian probability density function, where wi,t−1,k is the weight associated with the k th Gaussian in the mixture at time t − 1, , and where µi,t−1,k and Σi,t−1,k are the mean Y U V D vector and covariance matrix of this k th Gaussian. The weights wi,t−1,k indicate the relative proportions of past observations modeled by each Gaussian. A diagonal covariance matrix is used. For notational simplicity, we will denote the kth Gaussian of a mixture as ηk . To update a pixel’s mixture model as new observations are obtained over time, an on-line K-means approximation similar to that of [16] is used. When a new observation X i,t at a given pixel is received, an attempt is made to find a match between it and one of the Gaussians ηk for that pixel. If a matching ηk is found, its parameters are adapted using the current observation; if not, one of the Gaussians is replaced with a new one that represents the current observation. The matching process is carried out by first sorting the Gaussians
548
M. Harville
in a mixture in order of decreasing weight/variance, and then selecting as a match the first ηk whose mean is sufficiently near X i,t . A match between X i,t and ηk is allowed if each squared difference between corresponding components of X i,t and the mean µi,t−1,k of ηk is less than some small multiple β of the corresponding ηk component variance. The parameter β is typically chosen to be about 2.5, so that the boundary of the matching zone in YUVD-space for ηk encompasses over 95% of the data points that would be drawn from the true Gaussian probability density. This basic matching method is modified, however, to account for the possibility of unreliable, or “missing”, chroma or depth data. At low luminance, the chroma components (U and V) of the color representation become unstable, so chroma information is not used in comparing the current observation X i,t with the mean of Gaussian ηk when the luminance of either falls below a threshold Ymin . Similarly, because stereo depth computation relies on finding small area correspondences between image pairs, it does not produce reliable measurements in regions of little visual texture and in regions, often near depth discontinuities in the scene, that are visible in one image but not the other. Most stereo depth implementations attempt to detect such cases and label them with one or more special values, which can be denoted collectively as invalid. If the depth of X i,t is invalid, or if too many (more than some fraction ρ) of the observations that contributed to the building of ηk , as described below, had invalid depth, depth is omitted in comparing X i,t and ηk . If no mixture Gaussian is found to match X i,t , the Gaussian ranked last in weight/variance is replaced by a new one with mean equal to X i,t , a high variance, and a low weight. Otherwise, the parameters of the matching Gaussian ηk are recursively adapted toward X i,t . The mean is updated as follows: µi,t,k = (1 − α)µi,t−1,k + αX i,t
(2)
The variances are updated analogously, using the squared differences between corresponding components of X i,t and µi,t,k . Variances are not allowed to fall below some minimum value, so that matching does not become unstable in scene regions that are static for long time periods. The parameter α can be interpreted as the learning rate discussed in Section 1: as α is made smaller, the parameters of ηk will be perturbed toward new observations in smaller incremental steps. The weights for all Gaussians are updated according to wi,t,k = (1 − α)wi,t−1,k + αMi,t,k
(3)
Mi,t,k is 1 (true) for the ηk that matched the observation and 0 (false) for all others, so (3) causes the weight of the matched ηk to increase and all other weights to decay. Shadows and other illumination effects are specifically addressed in two ways. First, the variances for the luminance components of the Gaussians are not allowed to decrease below a substantial floor level. Since the luminance component of an object’s apparent color is typically more impacted than its chroma by shadows and other illumination changes, representation of this luminance with a higher variance allows it to be matched to current observations of the object
A Framework for High-Level Feedback
549
under a wider variety of lighting conditions. Second, where the current depth observation and the depth statistics of some Gaussian ηk are both reliable and are a match, the color matching criterion is relaxed by increasing the matching tolerance β. This helps in ignoring phenomena such as strong shadows, for which the shape of the background matches the current observations, but for which the color is significantly different. This second technique also helps model dynamic background objects, like video displays and moving foliage, whose color is highly variable but whose shape, at least at a coarse level, remains relatively constant. Finally, the learning rate α is greatly reduced at all pixels at which a scene “activity” level A is above a threshold H. A is computed independently at each pixel as a temporally-smoothed inter-frame luminance difference: Ai,t,k = (1 − λ)Ai,t−1,k + λ |Yi,t − Yi,t−1 | (4) This helps reduce the influence of dynamic foreground objects such as people on the model of the observation history, without slowing adaptation of the model to persistent changes such as a moved chair or increased lighting. 2.2
Background Model Estimation and Foreground Segmentation
At each time step, one or more of the Gaussians in each per-pixel mixture are selected as the background model, while any others are taken to represent foreground. The current observation at a pixel is designated as foreground if it was not matched to any of the ηk in the pixel’s current background model. Background Gaussians are selected at each pixel according to two criteria. First, among the Gaussians with reliable depth statistics (those for which the fraction of observations modeled that havevalid depth exceeds a threshold ρ) and whose normalized weight wk = wk / ( k wk ) exceeds a low threshold TD , the ηk with the largest depth mean is selected. This criterion is based on the idea that, in general, one does not expect to be able to see through the background. The threshold TD discourages the selection of a background Gaussian ηk based on spurious or transient observations. TD is typically set around 0.1 to 0.2, so that an ηk representing the true background may be selected even when this background is usually occluded or has not been visible for an extended time. At pixels where the true background usually does not produce reliable depth data, no Gaussian will satisfy the above criterion. For correlation-based stereo, this commonly occurs for walls and floors in the scene, which tend to have low visual texture. Furthermore, for dynamic background objects, the background is often best represented by multiple Gaussians, not just the one selected by the above criterion. The second criterion for selecting background Gaussians addresses both of these issues by adding ηk , in order of decreasing weight/variance, to the background model until the total weight of all selected Gaussians exceeds a second threshold T . This second criterion is typically the sole criterion used by TAPPMOG methods that have no access to depth data, and it favors the selection of Gaussians representing consistent, frequently observed modes of scene appearance. The parameter T helps determine the level of “background model inclusivity” discussed in Section 1: as T is increased, more of a pixel’s observation history Gaussians are likely to be selected as background.
550
M. Harville
Fig. 1. Comparison of results for the pixel-level segmentation method of [8] (bottom row) to those for a common alternative (middle row) that uses an RGB input space rather than YUVD. Input color (top row) and depth (not shown) were captured using the Tyzx real-time stereo camera head [18]. Locations of several of the test sequence challenges are indicated with text in middle row images. Parameter values were K = 4, α = 0.0006, β = 2.5, ρ = 0.2, Ymin = 16, T = 0.4, TD = 0.2, λ = 0.09, H = 5. Small, isolated foreground regions, and small holes within foreground, were removed by applying an area threshold to the results of connected-components analysis.
2.3
Experimental Results
The performance of the pixel-level foreground segmentation method of [8] was evaluated on a challenging color-and-depth test sequence captured by the Tyzx stereo camera head [18]. The camera head makes use of special-purpose hardware, based on [19], for computing depth by correlation of Census-transformed images [21]. The camera can produce spatially-registered, 320x240-resolution color and depth images at rates up to 60Hz; the test imagery was saved to files at a rate of 15Hz. The test sequence is 10 minutes long, with no image devoid of “foreground” people. It contains several dynamic background objects, namely several video displays (toward the upper left of the images) and a sign rotating about a vertical axis at about 0.5Hz (upper middle of images, sitting on oval-shaped table). During the first half of the sequence, two displays (“display1” and “display2”) are active and one (“display3”) is off, while two people walk around the room. Near the sequence midpoint, the chair in the lower left of the image is moved to a new floor position, “display2” is switched off, “display3” is switched on, and several more people enter the scene. One of these people stands in the middle back part of the room for the rest of the sequence, sometimes shifting his weight or moving his arms. The other new people walk around continuously in the lower right part of the images, creating a “high-traffic” area.
A Framework for High-Level Feedback
551
Figure 1 compares results for the method of [8] with those for a more standard TAPPMOG technique that uses an RGB input space rather than YUVD. The depth-aided method better eliminates shadows, and better excludes the video displays and rotating sign from the foreground. Segmentation of people in the high-traffic area is improved because Gaussians representing the floor are usually selected as background, on the basis of depth, despite the more frequently observed, but closer, people. Finally, depth allows better segmentation of people colored like the background, resulting in fewer foreground holes due to color camouflage. Still, the results of this pixel-level method are far from perfect. Although the video displays and rotating sign do not appear in the result frames in Figure 1, they fail to be excluded from the foreground in a significant fraction of other frames. The relatively static person at the back of the room contains substantial foreground holes after he been there for about 3 minutes. It is difficult to extend this time without further compromising the modeling of the dynamic background objects in the scene. Adaptation to the moving of the chair requires about 2 minutes, and cannot be shortened without causing all of the above problems to worsen. A rapid illumination change would cause almost all of the scene to appear as foreground until adaptation occurs. Hence, there remains great motivation for improvement of the method, and we believe this is best achieved by incorporating, as presented in the following sections, analysis that is concerned with more than just spatially local, per-pixel statistics.
3
Feedback for Background Correction
We extend the TAPPMOG modeling framework to make use of a wide variety of feedback computed by modules that consider image regions or frames, classify objects, or analyze other scene properties above the per-pixel level. Each module computing this feedback must satisfy two requirements. First, it must have some ability to detect foreground segmentation successes and/or failures according to some definition in terms of “higher” concepts than pixel-level statistics. The module’s detection ability does not have to be perfect, and the module’s definition of foreground does not need to encompass the entire application demands. Second, the module must be able to provide maps that designate which pixels in a given input frame are responsible for cases that satisfy this definition. We make use of two types of feedback: 1) positive feedback, which serves to enhance correct foreground segmentations, and 2) negative feedback, which aims to adjust the pixel-level background model in order to prevent the reoccurrence of detected foreground mistakes. The feedback interface between the TAPPMOG background model and the higher levels consists of two bitmaps, representing pixels where positive and negative feedback, respectively, should be applied. We denote these maps as P and N , respectively. In a system that uses multiple high-level modules to generate feedback maps, we will need some mechanism for combining these maps and resolving conflicts between them in order to produce the two bitmaps. We would like to take an
552
M. Harville
Fig. 2. Overview of framework for guiding TAPPMOG with high-level feedback.
approach that allows strong positive evidence to override negative evidence (or vice versa), permits individual modules to generate both positive and negative feedback, enables mutiple forms of relatively weak feedback to support each other when none are convincing in isolation, and allows us to refine the feedback bitmaps by cancelling out portions of them where conflicting information exists. We achieve this by implementing each high-level module so that it generates not a feedback bitmap, but rather a map of real numbers, where positive numbers reflect confidence that the segmented pixel should be part of the foreground, and negative numbers reflect the opposite. We then pixel-wise add together the feedback maps generated by all high-level modules, and threshold the summation map twice at 1 and -1 to produce P and N . This method allows us to factor the relative confidences associated with various high-level decisions into the final choice of corrective action to take at each pixel. The methods by which positive and negative feedback influence pixel-level background modeling are different, and are discussed in the subsections below. These methods are suitable for use not just with the TAPPMOG modeling scheme described in Section 2, but with most other TAPPMOG-based methods. Figure 2 summarizes the feedback process. 3.1
Positive Feedback
The goal of positive feedback is to prevent observations associated with correctly segmented foreground from being incorporated into the pixel-level background model. In the test sequence described in section 2.3, the two cases for which this would be most helpful are those of the relatively static person toward the back of the room, and the high-traffic area in the lower right of the frame. Our implementation of positive feedback is quite simple. First, one or more high-level modules detect correct foreground segmentation results, by one or more definitions, and contribute positive feedback at all pixels responsible for these results. This feedback propagates to the bitmap P as described above.
A Framework for High-Level Feedback
553
Next, for all pixels at which P = 1, we do not use the current pixel observation to update the Gaussian mixture model of the observation history. This results in no change in the background model at those pixels, and prevents the foreground objects from becoming part of the background over time. It is relatively unimportant that P be precise at the pixel level. Pixels mistakenly omitted from P cause some portions of true foreground to be incorporated into the observation history model. Extra pixels included in P cause the true background to be learned slightly more slowly. In both cases, the same error must repeat many times before the effect is significant. 3.2
Negative Feedback
An erroneous inclusion in the segmented foreground is, by definition, something that we would prefer to be well-described by the background model. The goal of negative feedback, therefore, is to adjust the background model so that it better describes such errors, without disrupting its ability to describe other aspects of the background. We achieve this with a combination of two processes: TAPPMOG error modeling: We model the distribution of observations associated with foreground errors at each pixel using almost exactly the same TAPPMOG process, described in Section 2.1, that is employed for modeling the full observation history. We denote the two TAPPMOG models of the observation history and the foreground errors as O and E, respectively. As described above, O receives the camera input directly, except where positive feedback, as indicated by P, occurs. E, on the other hand, receives only the portions of the camera input thought to be associated with foreground inclusion errors (background modeling failures), as indicated by the negative feedback bitmap N . No update of E occurs at pixels for which N contains a zero. A TAPPMOG is used to model foreground errors because the distribution of observations associated with errors at a given pixel can, at worst, be as complex as observation distributions for highly dynamic, variable backgrounds. Model merging: We periodically merge the error model E into the observation history model O, in the hope that changes in O will propagate into the subset of Gaussians selected as background. We merge, rather than replace O with E, because portions of O may still be accurate and useful. This is particularly true when the errors result from inadequate modeling of dynamic background objects. In general, errors occur because too little evidence in the observation history supports building an accurate model of them with sufficient weight to be chosen as background. Hence, the merging process aims to boost the relative proportion of evidence corresponding to things that were incorrectly omitted from the background model, without obliterating other highly-weighted evidence. Because the maximum complexities of what may be modeled by O and E are similar, we use mostly the same parameters for each. The main exception is that we use a higher learning rate, which we denote as αe , for E. Because error examples may be presented to this TAPPMOG rather infrequently, error model
554
M. Harville
means and variances might converge very slowly if we were to use the same learning rate as for O. In addition, from equation (3), we see that a higher learning rate will cause the weights associated with E to increase more quickly. When O and E are merged as described below, these higher weights help compensate for the under-representation of the errors in the observation history. While update of E with error information is done for every frame, we merge E with O at a low rate θ in the range of 0.2-1.0Hz, to avoid excessive computational cost. When merging the Gaussian mixtures of O and E at a particular pixel, we do not simply make a new mixture that has one mode for each of the modes in the two original mixtures, since this would cause the complexity of O to grow without bound. Instead, we seek to keep the number of Gaussians at each pixel in O at or below the limit K. A well-principled way to merge the two mixtures under this constraint would be to convert each to a histogram representation, and then use an iterative Expectation-Maximization method to fit a mixture of K Gaussians to the sum of the two histograms. This would be a rather costly procedure, particularly as the dimensionality of the observation feature space increases, so we instead adopt the following, more approximate approach that allows for real-time implementation: 1. Merging is performed at a pixel only if the total weight of the error Gaussians at that pixel exceeds a threshold κ. As κ is decreased, less error evidence is needed to trigger a modification of O, and each modification will have smaller impact. For larger κ, errors tend to be better modeled before they are merged into O, and individual merges occur less frequently but are more influential. 2. To merge E and O at a pixel, we first attempt to find observation history Gaussians ηko whose variance parameters may be increased to include the spaces spanned by nearby error model Gaussians ηje . Expansion of ηko to include ηje is permitted if their means are separated by less than β times the sum of their standard deviations. A single ηko may be expanded to include more than one ηje . When merging two Gaussians, we add their weights, set the new mean equal to their weighted sum, and select the minimum variance large enough to cause all points that would have matched either one of the original Gaussians to also match the merge result. 3. If an error Gaussian ηje is not near any ηko , we substitute ηje for the “poorest” observation model Gaussian ηko - namely that with the lowest weight/variance ratio - provided that we would not be replacing one Gaussian with another supported by far less evidence. This latter criterion is enforced by ensuring that the ratio of the weight of ηje to that of ηko is above a threshold minweightratio. 4. Gaussians ηje that are merged in either of the prior two steps are removed from E, while the others remain in it and will continue to evolve over time. 5. After merging is completed, we normalize the weights of the observation Gaussians so that they add up to their sum prior to the merge. This prevents the scale of these weights from fluctuating depending on the rate at which segmentation errors are occurring.
A Framework for High-Level Feedback
555
6. To prevent the accumulation of errors over arbitrarily long times, all weights for error Gaussians not merged into the observation history model are decayed by a multiplicative factor 0 < τ < 1. Values of τ at the higher end of this range allow for errors to be accumulated over longer time periods without being disregarded as noise. Examination of the negative feedback process reveals that it is relatively unimportant for N to be precisely correct at the pixel level. If N extends slightly beyond the true bounds of some erroneous foreground inclusion, the result will usually be the addition to E of further evidence to support the current background model. If N fails to include some of the pixels associated with a true error, E will just build up a little more slowly at those locations.
4
Examples of Feedback Generation and Usage
In this section, we illustrate the beneficial effects of using the feedback correction framework of Section 3 to connect two types of “high-level” analysis modules to the TAPPMOG method of Section 2. Since these are just two of the infinite number of high-level modules that could be employed in our framework, and since the emphasis of this paper is on the feedback itself, and not on the high-level analysis that generates it, we omit most of the algorithmic details of the highlevel modules. However, we note that, like the TAPPMOG method of Section 2, both modules make interesting use of dense depth imagery. One of these modules is a person detection and tracking method. We use the results of this method to produce feedback that enhances the TAPPMOG’s segmentation of people, and helps it to ignore all else. Positive feedback is generated for image regions where the person tracker believes people are present, while all other foreground pixels are assumed not to pertain to people, and are associated with negative feedback. Our person tracking method is described in [7], and has similarities to the methods of [1,3,18]. In brief, it uses depth data to create overhead, “plan-view” images of the foreground produced by the TAPPMOG of Section 2, and then uses templates and Kalman filtering to detect and track people in these images. For each tracked person, positive feedback (with a value of 1) is generated at all pixels within the camera-view bounding box of the set of pixels that contributed to the person’s plan-view image representation. This generally causes some true background pixels to be incorrectly labeled with positive feedback, but, as discussed in section 3.1, it is generally harmless when feedback maps are imprecise in this way. The overall positive feedback map is produced by summing the maps generated for the individual people. Negative feedback (with a value of -1) is generated at all foreground pixels not inside any of the individual bounding boxes. Note that other person detection and tracking methods, including ones that make no use of depth, can be substituted here, provided that the methods label the approximate image locations of people in the scene with a reasonable degree of reliability. In general, it is preferable that the person tracking methods produce false positives (image locations incorrectly labeled as people) rather than false negatives (person detection failures), since repeated failures to detect a person
556
M. Harville
Fig. 3. Pixel-level segmentation results with (third row) and without (second row) high-level feedback correction. Some significant differences are circled in second row; second row also contains some object labels. Input color (top row) and depth (not shown) were captured using the Tyzx real-time stereo camera head [18]. Feedback maps (bottom row) show N in white, P in gray, and “no feedback” in black. TAPPMOG parameters are the same as for Figure 1, but no cleanup of isolated foreground pixels or holes has been done. Feedback parameters: κ = 0.3, αe = 0.1, τ = 0.95, θ = 1.0, minweightratio = 0.1.
at a particular image location may produce strong negative feedback that causes the person’s appearance to be quickly incorporated into the background model. Repeated misclassification of an image region as a person, on the other hand, merely slows adaptation of the background model to changes in that region. Our second module detects rapid changes in global illumination, camera gain, or camera position. When it detects any of these events, it produces negative feedback (with a value of -1) at all current foreground locations so that the TAPPMOG will quickly update itself to reflect the new scene appearance. The feedback is generated not just at the event onset, but for a time window long enough to allow for good TAPPMOG adaptation to the changes. The module decides that one of these events may have occurred whenever the TAPPMOG suddenly produces a large amount of foreground that is well-distributed about the image. Shape information, from the depth data, is then used to discriminate these events from the possibility that a foreground object has closely approached the camera, occupying most of its field of view. In this case, we do not want to rapidly update the background model. The final feedback bitmaps P and N are produced by thresholding the summed feedback from the two modules at +1 and -1, respectively.
A Framework for High-Level Feedback
557
In Figure 3, we compare foreground segmentation results with and without this feedback for the same Tyzx test sequence described in Section 2.3. Several important differences are evident. First, without feedback, the relatively static person toward the back of the room (upper-middle of images) begins to fade into the background after less than 90 seconds of standing at his position (see figure column 4). After a couple more minutes, he becomes difficult to separate from noise, so that tracking and any other analysis of this person becomes very challenging. However, when feedback from the person-tracker is used to prevent background model update at his location, he is well-segmented as foreground for the entire five minutes that he stands at his position, and would remain so indefinitely if the test sequence were longer. This is achieved without sacrificing model adaptation to other scene changes such as the moving of the chair near the midpoint of the sequence. The moved chair causes temporary foreground errors for both methods (see figure column 2). With feedback, these errors are corrected in about 2 seconds, without disrupting segmentation elsewhere. Without feedback, chair errors linger for nearly two minutes, during which the person-tracker must detect and ignore them. Inclusion of dynamic foreground objects such as the video displays and the rotating sign was virtually eliminated by using negative feedback. Early in the sequence, negative feedback is generated when the sign and displays occasionally appear in the foreground (figure column 1), until they stop appearing. When connected-components analysis is used to extract “significant” foreground objects, the rotating sign appears in no frames beyond the 2-minute mark of the sequence. In contrast, without feedback, the sign appears in 74% of foreground frames beyond the 2-minute mark, including all frames in Figure 3. Similarly, in both sets of results, “display3” sometimes appears in the foreground soon after it is turned on around the 5.5-minute mark. After 2 minutes pass, significant parts of “display3” appear in only 5% of the foreground frames for the feedback-aided method, in contrast to 68% without feedback (figure columns 4 and 5). The foreground noise levels for the two methods are noticeably different toward the later frames of the test sequence. Without feedback, the variances in the TAPPMOG model drop to very low levels over time, so that imager noise frequently results in color or depth measurements that exceed the tolerance for matching to the background model. With feedback, the system learns to increase these variances where they become problematically small. The lower noise levels result in more cleanly-segmented foreground regions, and less higher-level processing and analysis dedicated to the removal of image clutter. Segmentation results with and without feedback are compared for a simulated change in global illumination in Figure 4. The lighting change was simulated by applying a gamma correction of 2.0 to a one-minute portion of the test sequence of Section 2.3, beginning at the 1000th frame (a little over a minute into the sequence). For both methods, almost all of the image appears as foreground immediately after the illumination change (see figure column 2). Without feedback, this condition persists for over 30 seconds while the background model adapts. The adaptation time could be reduced by increasing the TAPPMOG learning rate, but this would further lessen the time that a person, such as the one who appears near the back of the room later in the sequence, can remain relatively
558
M. Harville
Fig. 4. Pixel-level segmentation results with (third row) and without (second row) high-level feedback correction, for a simulated rapid illumination change. Input color (top row) and depth (not shown) were captured using the Tyzx real-time stereo camera [18]. Feedback maps (bottom row) show N in white, P in gray, and “no feedback” in black. TAPPMOG parameters are the same as for Figure 1, and feedback parameters are the same as for Figure 3. No cleanup of isolated foreground or holes has been done.
still before becoming part of the background. In contrast, when feedback is used, the illumination change is detected and causes negative feedback maps covering most of the frame to be generated for a short period of time. One such map is shown in figure column 2; note that the negative feedback is canceled where the person-tracker module estimates that people are present. Within 2 seconds, the background model is almost completely repaired, except where people occluded the background during the correction period (see figure column 3). When these regions are disoccluded, the person tracker module identifies them as not being people, and generates further, more localized negative feedback that repairs the background model here over the next 2 seconds (see figure columns 3-5). It is important to note that the correction of most of the background model in under 2 seconds is fast enough to allow person tracking methods to continue to operate through rapid, frame-wide changes. By using prior history and probabilistic methods to estimate person locations during a brief interruption of reliable measurements, tracking can recover gracefully when the background is repaired quickly. Obviously, continuity of tracking is much less likely when an interruption lasts for more than 30 seconds, as was the case without feedback.
A Framework for High-Level Feedback
5
559
Conclusions
Although TAPPMOG modeling is one of the leading choices for general, robust pixel-level foreground segmentation, we have shown that this robustness can be greatly improved, and tailored to the needs of a specific application, by applying a general framework for guiding the TAPPMOG process with corrective feedback from high-level system modules. Rather than seek to assess high-level semantics using low-level statistics, this framework encourages each module to focus on tasks appropriate to it. The pooling of feedback influences from different highlevel modules into a common TAPPMOG also creates a mechanism for high-level modules to share knowledge of errors they detect so that all modules that make use of the pixel-level segmentation may benefit. Finally, feedback can provide a means for fast background adaptation in response to selected events, thereby making it more feasible to slow the learning rate α for the basic TAPPMOG modeling process. The basic TAPPMOG process would therefore better reflect longer-term, persistent features of the scene, and would be less susceptible to influence from foreground objects that remain static too long. Because the most compute-intensive part of the feedback framework - namely, the error-modeling component - is itself based on TAPPMOG principles, any effort spent on speeding up TAPPMOG background modeling implementations (e.g. in MMX or hardware) can also be leveraged to make feedback correction a relatively light-weight addition. Also, the similar processes for background and error modeling at the pixel level, the guiding of these processes by feedback map connections from high-level modules, and the resultant focusing of higher-level attention on the foreground of greatest interest, are all reminiscent of aspects of biological visual cortex. We hope to further explore these similarities. Acknowledgments. The author would like to thank Tyzx Inc. and Interval Research Corp. for their support in publishing this work and their permission to use data sets obtained at Interval.
References 1. D. Beymer. “Person Counting using Stereo”. In Wkshp. on Human Motion, 2000. 2. K. Bhat, M. Saptharishi, P. Khosla. “Motion Detection and Segmentation Using Image Mosaics”. In IEEE Intl. Conf. on Multimedia and Expo 2000, Aug. 2000. 3. T. Darrell, D. Demirdjian, N. Checka, P. Felzenszwalb. “Plan-view Trajectory Estimation with Dense Stereo Background Models”. In ICCV’01, July 2001. 4. A. Elgammal, D. Harwood, L. Davis. “Non-Parametric Model for Background Subtraction”. In ICCV Frame-rate Workshop, Sep 1999. 5. N. Friedman, S. Russell. “Image Segmentation in Video Sequences: a Probabilistic Approach”. In 13th Conf. on Uncertainty in Artificial Intelligence, August 1997. 6. X. Gao, T. Boult, F. Coetzee, V. Ramesh. “Error Analysis of Background Adaptation”. In CVPR’00, June 2000. 7. M. Harville. “Stereo Person Tracking with Adaptive Plan-View Statistical Templates”. HP Labs Technical Report, April 2002.
560
M. Harville
8. M. Harville, G. Gordon, J. Woodfill. “Foreground Segmentation Using Adaptive Mixture Models in Color and Depth”. In Proc. of IEEE Workshop on Detection and Recognition of Events in Video, July 2001. 9. K. Konolige. “Small Vision Systems: Hardware and Implementation”. 8th Int. Symp. on Robotics Research, 1997. 10. A. Mittal, D. Huttenlocher. “Scene Modeling for Wide Area Surveillance and Image Synthesis”. In CVPR’00, June 2000. 11. H. Nakai. “Non-Parameterized Bayes Decision Method for Moving Object Detection”. In ACCV’95, 1995. 12. J. Ng, S. Gong. “Learning Pixel-Wise Signal Energy for Understanding Semantics”. In Proc. Brit. Mach. Vis. Conf., Sep. 2001. 13. J. Orwell, P. Remagnino, G. Jones. “From Connected Components to Object Sequences”. In Wkshp. on Perf. Evaluation of Tracking and Surveillance, April 2000. 14. Point Grey Research, http://www.ptgrey.com 15. S. Rowe, A. Blake. “Statistical Background Modeling for Tracking with a Virtual Camera”. In Proc. Brit. Mach. Vis. Conf., 1995. 16. C. Stauffer, W.E.L. Grimson. “Adaptive Background Mixture Models for RealTime Tracking”. In CVPR’99, Vol.2, pp.246-252, June 1999. 17. K. Toyama, J. Krumm, B. Brumitt, B. Meyers. “Wallflower: Principles and Practice of Background Maintenance”. In ICCV’99, pp.255-261, Sept 1999. 18. Tyzx Inc. ”Real-time Stereo Vision for Real-world Object Tracking”. Tyzx White Paper, April 2000. http://www.tyzx.com 19. J. Woodfill, B. Von Herzen. “Real-Time Stereo Vision on the PARTS Reconfigurable Computer”. Symposium on Field-Programmable Custom Computing Machines, April 1997. 20. M. Xu, T. Ellis. “Illumination-Invariant Motion Detection Using Colour Mixture Models”. In Proc. Brit. Mach. Vis. Conf., Sep. 2001. 21. R. Zabih, J. Woodfill. “Non-parametric Local Transforms for Computing Visual Correspondence”. In ECCV’94, 1994.
Multivariate Saddle Point Detection for Statistical Clustering Dorin Comaniciu1 , Visvanathan Ramesh1 , and Alessio Del Bue2 1
2
Imaging and Visualization Department Siemens Corporate Research 755 College Road East, Princeton, NJ 08540, USA Department of Biophysical and Electronic Engineering University of Genova Via All’Opera Pia 11a, 16145 Genova, Italy
Abstract. Decomposition methods based on nonparametric density estimation define a cluster as the basin of attraction of a local maximum (mode) of the density function, with the cluster borders being represented by valleys surrounding the mode. To measure the significance of each delineated cluster we propose a test statistics that compares the estimated density of the mode with the estimated maximum density on the cluster boundary. While for a given kernel bandwidth the modes can be safely obtained by using the mean shift procedure, the detection of maximum density points on the cluster boundary (i.e., the saddle points) is not straightforward for multivariate data. We therefore develop a gradient-based iterative algorithm for saddle point detection and show its effectiveness in various data decomposition tasks. After finding the largest density saddle point associated with each cluster, we compute significance measures that allow formal hypothesis testing of cluster existence. The new statistical framework is extended and tested for the task of image segmentation. Keywords: grouping and segmentation; image features; nonparametric clustering; cluster significance.
1
Introduction
Data clustering as a problem in pattern recognition and statistics belongs to the class of unsupervised learning. It essentially involves the search through the data for observations that are similar enough to be grouped together. There is a large body of literature on this topic [5,12,17]. Algorithms from graph theory [6, 10], matrix factorization [25,27], deterministic annealing [16], scale-space theory [20,26], and mixture models [7,23] were successfully used to delineate relevant structures within the input data. However, the clustering task is inherently subjective. There is no accepted definition of the term cluster and any clustering algorithm will produce some partitions. Therefore, the ability to statistical characterize the decomposition A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 561–576, 2002. c Springer-Verlag Berlin Heidelberg 2002
562
D. Comaniciu, V. Ramesh, and A. Del Bue
and to assess the significance of the resulted number of clusters is an important aspect of the problem. Approaches for estimating the number of clusters can be divided into global and local methods. The former evaluate some measure over the entire data set and optimize it as function of the number of clusters. The latter consider individual pairs of clusters and test whether they should be joined together. A general description of methods used to estimate the number of clusters is provided in [18] while a study described in [21] conducts a Monte Carlo evaluation of 30 indices for cluster validation. These indices are typically functions of the within and between cluster distances and belong to the class of internal measures, in the sense that they are computed from the same observation used to create a partition. Consequently, their distribution is intractable and they are not suitable for hypothesis testing. As a result, the majority of existing methods for estimating the validity of the decomposition do not attempt to perform a formal statistical procedure but rather look for a clustering structure under which the statistic of interest is optimal, i.e, maximize or minimize an objective function [19,24]. Validation methods that do not suffer from this limitation were recently proposed [29,30], but are computationally expensive, since they require simulating multiple datasets from the null distribution. In this paper we present a new and practical approach to compute the statistical significance (p-value) of each delineated cluster. A nonparametric model is assumed in which the clusters correspond to local maxima (modes) in the probability density function of the data [8, p.533]. The test statistic that we develop compares the estimated density of the mode with the estimated density of the highest saddle point on the cluster boundary. To find the saddle points in the multivariate density surface we develop and test a gradient-based iterative algorithm. Note that for both the mode and saddle point finding only the density estimate and its normalized gradient are used (and not the Hessian matrix of second derivatives). Because of this, our the methods scale well with the space dimension. The organization of the paper is as follows. Section 2 discusses the importance of the modes and saddle points of the density for characterizing the underlying data structure. The mean shift-based data decomposition is shortly reviewed in Section 3. Section 4 describes the new algorithm for saddle point detection, while the test statistic is developed in Section 5. Experimental results are shown in Section 6. The application of our new statistical framework for image segmentation is discussed in Section 7.
2
Importance of Modes and Saddle Points
Clustering using the nonparametric estimation of the data density is achieved by identifying local maxima (modes) and their basins of attractions in the multivariate surface of the data density function. The modes of the density are usually detected using the gradient ascent mean shift procedure [4,3], discussed in the
Multivariate Saddle Point Detection for Statistical Clustering
563
next section. All the data points lying in the basin of attraction of a mode will form a separated cluster. In the case of a density function with constant values at a peak, the points on this peak are considered a single mode, called a plateau. Similarly, all the data points lying in the basin of attraction of a plateau form a separated cluster. The number of observed modes depends on the bandwidth of the kernel used to compute the density estimate. In general the number of modes decreases with the increase of the bandwidth. The most common test for the true number of modes in a population [28] is based on critical bandwidths, the infimum of those bandwidths for which the kernel density estimate is at most m-modal. A different approach is proposed for the univariate case in [22] where the validity of each mode is tested separately. The test statistic is a measure of the size of the of mode, the absolute integrate difference between the estimated density and the same density with the mode in question removed at the level of the higher of its two surrounding antimodes. The p-value of the test is estimated through resampling. Note that an antimode is defined for the univariate data as the location with the lowest density between two modes. The main advantage of this technique is that each individual suspected mode is examined, while the bandwidth used in the test can be selected adaptively as smallest bandwidth at which the mode still remains a single object. Our work is related to the technique in [22] by defining a similar procedure for the testing of multivariate data. However, since we would like to avoid resampling, we will define a test statistic whose distribution can be evaluated through statistical inference, by taking into account its sampling properties. In addition, since the antimodes defined for univariate data translate into saddle points for the multivariate case we will need an algorithm for saddle point computation. To give the reader an initial view on the problem we present in Figure 1a a sample data set drawn from two bivariate normals, while Figure 1b, Figure 1c, and Figure 1d show the corresponding probability density estimate obtained with a two dimensional normal kernel with bandwidth h = 0.6, h = 0.9, and h = 1.35, respectively. The detected modes are marked with green dots, while the saddle points are marked with red dots. A number of observations can be made using Figure 1. First, for a given bandwidth, the number of observed modes determines the number of distinct structures in the density estimate. The mode density is an indication of the compactness of the associated structure. The difference between the mode density and the saddle point density is an indication of the isolation of the observed structure. In addition, both the mode density and the mode-saddle density difference decrease with the increase of the bandwidth. When the mode density becomes equal to the saddle density the observed structures are amalgamated into a new one. Hence, the appropriate analysis bandwidth should be the smallest bandwidth at which the mode in question still remains a single object. For a rigorous treatment of the evolution of the zero crossings of the gradient of a function along the bandwidth see [1,32]. The catastrophe theory [11] inves-
564
D. Comaniciu, V. Ramesh, and A. Del Bue
(a)
(b)
(c)
(d)
Fig. 1. Importance of modes and saddle points. (a) Input data containing 100 points from each bivariate N([-1.8,0],I) and N([1.8,0],I). (b) Density estimate with a two dimensional symmetric normal kernel with h = 0.6. The modes are marked with green dots, while the saddle point is marked with a red dot. (c) Density estimate with h = 0.9. (d) Density estimate with h = 1.35. The two modes and the saddle point collapse into one mode.
tigates the behavior of the singularities of a function in families of functions such as the family of densities generated by using various bandwidths.
3
Mean Shift Based Data Decomposition
In this section we define the mean shift vector, introduce the iterative mean shift procedure, and describe its use in the data decomposition.
Multivariate Saddle Point Detection for Statistical Clustering
3.1
565
The Mean Shift Procedure
Given n data points xi , i = 1 . . . n in the d-dimensional space Rd , the multivariate mean shift vector computed with kernel K in the point x is given by [9,3] n x−xi x K i i=1 h −x , (1) mK (x) ≡ n x−xi i=1 K h where h is the kernel bandwidth. In the following we will use the symmetric normal kernel defined as 1 K(x) = (2π)−d/2 exp − x2 (2) 2 It can be shown that the mean shift vector at location x is proportional to the normalized density gradient estimate computed with kernel K mK (x) = h2
ˆ K (x) ∇f . fˆK (x)
(3)
The normalization is by the density estimate in x obtained with kernel K. Note that this formula changes a bit for kernels different from the normal [3]. The relation captured in (3) is intuitive, the local mean is shifted toward the region in which the majority of the points reside. Since the mean shift vector is aligned with the local gradient estimate it can define a trajectory leading to a stationary point of the estimated density. Local maxima of the underlying density, i.e., the modes, are such stationary points. The mean shift procedure is obtained by successive – computation of the mean shift vector mK (x), – translation of the kernel K(x) by mK (x), and is guaranteed to converge at a nearby point where the density estimate has zero gradient [3]. 3.2
Data Decomposition Let us denote by yj j=1,2... the sequence of successive locations of the kernel K, where j n y −xi xi K i=1 j h , j = 1, 2 . . . (4) yj+1 = y −xi n i=1 K h is the weighted mean at yj computed with kernel K and y1 is the center of the initial kernel. By running the mean shift procedure for all input data, each point xi , i = 1, . . . , n becomes associated to a point of convergence denoted by yi where the underlying density has zero gradient. A test for local maximum is therefore needed. This test can involve a check on the eigenvalues of the Hessian matrix
566
D. Comaniciu, V. Ramesh, and A. Del Bue
of second derivatives, or a check for the stability of the convergence point. The latter property can be tested by perturbing the convergence point by a random vector of small norm, and letting the mean shift procedure to converge again. Should the convergence point be the same, the point is a local maximum. Depending on the local structure of the density hypersurface, the convergence points can form ridges or plateaus. Therefore, the mean shift procedure should be followed be a simple clustering which links together the convergence points that are sufficiently close to each other. The algorithm is given below [3]. Mean Shift Based Decomposition 1. For each i = 1, . . . , n run the mean shift procedure for xi and store the convergence point in yi . 2. Identify clusters {Bu }u=1...m of convergence points by linking together all yi which are closer than h from each other. 3. For each u = 1 . . . m join together in cluster Du all the data points xi having the corresponding convergence point in Bu .
Fig. 2. Decomposition example. The mean shift procedure with h = 0.6 is applied for the data set presented in Figure 1a. The trajectory of each point is shown together with the two modes superimposed on the density surface. The view angle is from above.
A decomposition example is shown in Figure 2. Observe that the mean shift trajectories are smooth, verifying a property remarked in [3], that the cosine of the angle between two consecutive mean shift vectors is strictly positive when a normal kernel is employed. The advantage this type of decomposition is twofold. First is requires a weak assumption about the underlying data structure, namely, that a probability density can be estimated nonparametrically. In addition, the method scales well with the space dimension, since the mean shift vector is computed directly from the data.
Multivariate Saddle Point Detection for Statistical Clustering
4
567
Saddle Point Detection
This section presents an algorithm for finding the saddle points associated with a given bandwidth h and a partition {Du }u=1...m obtained through mean shift decomposition. We are interested in the detection of saddle points of first order, having the Hessian matrix with one positive eigenvalue and all other eigenvalues negative. Select a cluster index v and define the complementary cluster set Du . (5) Cv ≡ u= v
In the following we will drop the index v for the simplicity of the equations. We define two functions 1 x − xD K fˆD,K (x) = (6) nhd h xD ∈D and
1 x − xC ˆ fC,K (x) = K nhd h xC ∈C whose superposition at x equals the density estimate at x n 1 x − xi ˆ = fˆD,K (x) + fˆC,K (x). fK (x) ≡ K nhd i=1 h
(7)
(8)
Computing now the gradient of expression (8), multiplying by h2 , and normalizing by fˆK it results that mK (x) = h2
ˆ K (x) ∇f = αD (x)mD,K (x) + αC (x)mC,K (x) fˆK (x)
(9)
where
x−xD x K xD ∈D D h −x (10) mD,K (x) = −xD x xD ∈D K h −xC x x K xC ∈C C h mC,K (x) = −x (11) −xC x xC ∈C K h are the mean shift vectors computed only within the sets D and C respectively, and fˆD,K (x) fˆC,K (x) αD (x) = αC (x) = (12) fˆK (x) fˆK (x) with αD (x) + αC (x) = 1. Equation (9) shows that the mean shift vector at any point x is a weighted sum of the mean shift vectors computed separately for the points in the sets D and C.
568
D. Comaniciu, V. Ramesh, and A. Del Bue
Our idea is to exploit this property for the finding of saddle points. Let us assume that xs is a saddle point of first order located on the boundary between D and C. The boundary condition is mK (xs ) = 0
(13)
which means that the vectors αD (xs )mD,K (xs ) and αC (xs )mC,K (xs ) have equal magnitude, are collinear, but point towards opposite directions. This explains the instability of xs . Indeed, a slight perturbation of xs towards C and along the line defined by αD (xs )mD,K (xs ) and αC (xs )mC,K (xs ) will induce an increase in the magnitude of αC (xs )mC,K (xs ), which will move more xs towards C, and so on. The same effect stands when xs is perturbed towards D. Note that the process is one dimensional, no matter the dimensionality d of the space.
(a)
(b)
Fig. 3. Solving first order saddle point instability. (a) A slight perturbation of xs towards C along the line defined by αD (xs )mD,K (xs ) and αC (xs )mC,K (xs ) will determine the point xs to start moving towards C. (b) By employing the new vectors rD (xs ) and rC (xS ) the saddle point becomes stable and has a basin of attraction.
A simple modification, however can make a saddle point stable. Let us define the vectors αC (x)mC,K (x) (14) rD (x) = αD (x)mD,K (x) αD (x)mD,K (x) and rC (x) =
αD (x)mD,K (x) αC (x)mC,K (x) αC (x)mC,K (x)
(15)
obtained by switching the norms of αD (xs )mD,K (xs ) and αC (xs )mC,K (xs ). This time, in the case of a perturbation, the resultant r(x) = rD (x) + rC (x)
(16)
will point towards the saddle point and not away from the saddle point. Since the saddle point is of first order, it will be also stable for the directions perpendicular to r(x) hence it will be a stable point with basin of attraction. Our algorithm will use the newly defined basin of attraction to converge to the saddle point. The convergence proof will be given in a subsequent paper, the
Multivariate Saddle Point Detection for Statistical Clustering
569
flavor of the proof being similar to the mean shift convergence discussed in [4, 3]. However, while all the mean shift paths converge to a local stationary point, the saddle point detection algorithm should be started close to a valley, i.e., at locations having divergent mean shift vectors coming from the sets D and C αD (x)αC (x)mD,K (x) mC,K (x) < 0
(17)
Since the data is already partitioned it is simple to search for points that verify condition (17). If one starts the search from a point in D just follow the mean shift path defined by mC,K (x) till the condition (17) is satisfied. Nevertheless, if the cluster D is isolated, the function fˆC,K (x) (7) will be close to zero for the data points belonging to x ∈ D and can generate numerical instability. Therefore a threshold should be imposed on this function before computing mC,K (x). The overall algorithm for finding the saddle points lying on the border of D is given below. Saddle Point Detection Given a data partitioning into a cluster D and another set C containing the rest of the data points. For each xD ∈ D, if the value of fˆC,K (xD ) (7) is larger than a threshold 1. Follow the mean shift path defined by mC,K (x) (11) until the condition (17) is satisfied. 2. Follow the mean shift path defined by r(x) (16) until convergence. Observe that similarly to the mean shift iterations, the saddle point iterations can stop at saddle point of order larger than one. Therefore, the convergence point should be tested, either using the Hessian, or through the stability method discussed in Section 3.2. Also note that the algorithm from above uses multiple initializations in the search for the saddle points. Prior knowledge regarding the border of cluster D can be used to reduce the number of executions. An example of saddle point finding is shown in Figure 4. The two stages of the algorithm are visible for some of the trajectories. When the second part of the algorithm is initialized, the trajectory can have a sharp turn, to return back towards the valley. Afterwards, however, the trajectory converges smoothly to a saddle point.
5
Significance Test for Cluster Validity
Denote by xs the saddle point with the largest density lying of the border of a given cluster characterized by the mode ym . The point xs represents the “weakest” point of the cluster border. It requires the least amount of probability mass which should be taken from the neighborhood of ym and placed in the neighborhood of xs such that the cluster mode disappears, as described in [22].
570
D. Comaniciu, V. Ramesh, and A. Del Bue
Fig. 4. Saddle point finding example. The algorithm was applied twice, for the data clustered as in figure Figure 2, once for the left cluster and once for the right cluster. The trajectories are shown together with the two modes and the detected saddle point superimposed on the density surface. The view angle is from above.
To characterize this process, we will assume in the following that the amount of probability mass in the neighborhood of the mode is proportional with fˆK (ym ), the probability density at the mode location, and the amount of probability mass in the neighborhood of the saddle point is proportional to fˆK (xs ), the density at xs . This approximation is shown in Figure 5, which presents a vertical slice in the density function. Note that more evolved formulas can be derived based on the mean shift trajectory starting from the saddle point, however, for larger dimensions it is difficult to compute the exact amount of probability mass in a neighborhood.
Fig. 5. Approximation of the probability mass. The probability mass in the neighborhood of ym and xs is assumed proportional to fˆK (ym ) and fˆK (xs ), respectively.
Multivariate Saddle Point Detection for Statistical Clustering
571
Using the approximation from above, we model the location of data points belonging to the cluster of cardinality nc as a Bernoulli random variable which has a probability of fˆK (ym ) pˆ = (18) fˆK (ym ) + fˆK (xs ) to lie in the mode neighborhood, and a probability of fˆK (xs ) fˆK (ym ) + fˆK (xs )
qˆ = 1 − pˆ =
(19)
to lie in the saddle point neighborhood. Taking now into account the sampling properties of the estimator pˆ (seen here as a random variable), the distribution of pˆ can be approximated under weak conditions as normal, with mean and variance given by σp2 =
µp = p
pˆ(1 − pˆ) nc
(20)
The null hypothesis which we test is the mode existence H0 : p ≥ 0.5
versus
H1 : p < 0.5
(21)
Hence, the test statistic is written as z=
pˆ − 0.5 σp
(22)
and using (18) and (20) we have √ z=
nc fˆK (ym ) − fˆK (xs )
2 fˆK (ym )fˆK (xs )
(23)
The p-value of the test is the probability that z, which is distributed with N (0, 1), is positive ∞ 1 exp(−t2 /2)dt (24) Prob(z ≥ 0) = √ 2π −z A confidence of 0.95 is achieved when z = 1.65. Using the framework from above, the clusters delineated with h = 0.6 shown in Figure 1b have a confidence of 0.99 and 0.98, respectively, derived using the mode densities of 0.0614 and 0.0598 and a saddle point density of 0.0384. When h = 0.9 (Figure 1c) the two clusters have a confidence of 0.82 and 0.86, their mode densities are 0.0444 and 0.0460, while the saddle point density is 0.0369.
572
6
D. Comaniciu, V. Ramesh, and A. Del Bue
Clustering Experiments
Ideally, the input data should be analyzed for many different bandwidths and the confidence of each delineated cluster computed. This will guarantee the detection of significant clusters even when they exhibit different scales. An alternative method, less expensive is to choose one scale and join the least significant clusters until they become significant. We should, however, be cautious in joining too many clusters, because the approximation used in the computation of the pvalue of the test assumes a certain balance between the peak and the saddle point neighborhood. We applied the agglomerative strategy for the decomposition of the nonlinear structures presented in Figure 6a. A bandwidth h = 0.15 was employed. In the final configuration two modes and two saddle point were detected. The two clusters have a confidence equal to 1.00. The mode densities are 0.7744 and 0.8514 while the saddle densities are 0.2199 and 0.1957. The right saddle point has the largest density. Note that the density values are one order of magnitude larger than in the previous experiment. This is not a concern, since both coordinates were rescaled. Also, note that our test statistic (23) accepts the rescaling of the measured density. 1.5
Y
1
0.5
0
−0.5 −2.5
−2
−1.5
−1
−0.5 X
(a)
0
0.5
1
1.5
(b)
Fig. 6. Clustering of nonlinear structures. (a) Input data containing 270 data points. (b) The trajectories for saddle point detection are shown. Our algorithm detected two modes and two saddle points. The view angle is from above.
The next experiment was performed with h = 0.6 for the data shown in Figure 7a. Initially the algorithm detected four peaks that can be seen in the density surface shown in Figure 7b. However, the upper left peaks were joined together, their clusters having low statistical confidence (0.73 and 0.57). The cluster confidence for the final configuration are 0.921, 0.96 and 0.95. One can observe that the upper left cluster has the lowest confidence.
Multivariate Saddle Point Detection for Statistical Clustering
573
4 3 2
Y
1 0 −1 −2 −3 −4 −4
−2
0
2
X
4
6
8
(a)
(b)
Fig. 7. Three clusters. (a) Input data containing 100 data points. (b) The trajectories for saddle point detection are shown. Our algorithm detected three modes with high confidence. Only the largest density saddle point for each cluster is shown. The view angle is from above.
7
Application to Image Segmentation
We apply the image segmentation framework described in [2], which employs mean shift to delineate clusters in a joint space of dimension d = r + 2 which includes the spatial coordinates. For the gray level case, r = 1, while for color images r = 3. A clustering example for image-like data is shown in Figure 8. The 200 points have the x coordinate data points in increasing order (for each unit x coordinate there is one data point of variable y). A bandwidth of h = 0.4 was employed. The algorithm detected first 5 clusters of confidence 0.59, 0.79, 0.61, 0.99, and 0.99, which were reduced to 4 clusters of confidence 0.83, 0.61, 0.99, and 0.99, and finally to three clusters of confidence 1.00, 0.99, and 0.99. All the clusters that were merged belong to the elongated structure from the left. In addition, we performed a Monte Carlo test to verify the stability of the decomposition. Out of 50 simulations only in 7 cases the number of clusters was different different from three. A confidence of 0.9 was used. The same agglomerative algorithm was employed for the segmentation of two test images. Only clusters with confidence larger than 0.9 were retained. The contours of the decomposition are shown in Figure 9. The same bandwidth hr = 20 was used for the color information, while for the spatial domain we used hs = 3 and hs = 4 (the second test image is 512 pixels, much larger than the first one). The segmentation exhibits high quality contours. Compare for example our result on Woman image with the segmentation in [31].
574
D. Comaniciu, V. Ramesh, and A. Del Bue 4 3 2
y
1 0 −1 −2 −3 −4 −4
−2
0
2
x
4
6
8
(a)
(b)
Fig. 8. Clustering of image-like data. (a) Input data containing 200 data points. (b) The trajectories for saddle point detection are shown. Our algorithms detected three modes with high confidence. Only the largest density saddle point for each cluster is shown. The view angle is from above.
(a)
(b)
Fig. 9. Clustering of image data. (a) Woman. (b) Lenna. Only clusters with confidence larger than 0.9 are selected.
8
Discussion
The results presented in this paper show that hypothesis testing for nonparametric clustering is a promising direction for solving decomposition problems and evaluating the significance of the results. Although our simulations are not comprehensive, we believe that the proposed algorithms are powerful tools for image data analysis. The natural way to continue this research is to investigate the
Multivariate Saddle Point Detection for Statistical Clustering
575
data in a multiscale approach and use our confidence measure to select clusters across scales. We recently discovered that the problem of finding the saddle points of a multivariate surface appears in condensed matter physics and theoretical chemistry [13]. The computation of the energy barrier for the atomic transitions from one stable configuration to another requires the detection of the saddle point of the potential energy surface corresponding to a maximum along a minimum energy path. Numerical algorithms for solving this problem were developed for the case when both the initial and final states of the transitions are known [15] or only the initial state of the transition is known [14]. Compared to these methods which perform constrained optimization on one surface, our technique exploits the clustering of the data points the guide the optimization relative to two surfaces whose superposition represents the initial surface.
References 1. J. Babaud, A. Witkin, M. Baudin, and R. Duda. Uniqueness of the gaussian kernel for scale-space filtering. IEEE Trans. Pattern Anal. Machine Intell., 8(1):26–33, 1986. 2. D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proc. Int. Conf. Computer Vision, Kerkyra, Greece, pages 1197–1203, September 1999. 3. D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell., 24(5):To appear, 2002. 4. D. Comaniciu, V. Ramesh, and P. Meer. The variable bandwidth mean shift and data-driven scale selection. In Proceedings International Conference on Computer Vision, Vancouver, Canada, volume I, pages 438–445, July 2001. 5. R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, second edition, 2001. 6. P. F. Felzenszwalb and D. P. Huttenlocher. Image segmentation using local variation. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, pages 98–103, June 1998. 7. C. Fraley and A. Raftery. How many clusters? which clustering method? - answers via model-based cluster analysis. Computer Journal, 41:578–588, 1998. 8. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, second edition, 1990. 9. K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Information Theory, 21:32–40, 1975. 10. Y. Gdalyahu, D. Weinshall, and M. Werman. Self organization in vision: Stochastic clustering for image segmentation, perceptual grouping, and image database organization. IEEE Trans. Pattern Anal. Machine Intell., 23(10):1053–1074, 2001. 11. R. Gilmore. Catastrophe Theory for Scientists and Engineers. Dover, 1993. 12. J. Hartigan. Statistical theory in clustering. Journal of Classification, 2:63–76, 1985. 13. G. Henkelman, G. Johannesson, and H. Jonsson. Methods for finding saddle points and minimum energy paths. In S. Schwartz, editor, Progress on Theoretical Chemistry and Physics, pages 269–300. Kluwer, 2000.
576
D. Comaniciu, V. Ramesh, and A. Del Bue
14. G. Henkelman and H. Jonsson. A dimer method for finding saddle points on high dimensional potential surfaces using only first derivatives. Journal of Chemical Physics, 111:7010–7022, 1999. 15. G. Henkelman, B. Uberuaga, and H. Jonsson. A climbing image nudged elastic band method for finding saddle points and minimum energy paths. Journal of Chemical Physics, 113:9901–9904, 2000. 16. T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Trans. Pattern Anal. Machine Intell., 19(1):1–13, 1997. 17. A. Jain, M. Murty, and P. Flyn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999. 18. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. 19. L. Kauffman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. J. Wiley & Sons, 1990. 20. Y. Leung, J. Zhang, and Z. Xu. Clustering by scale-space filtering. IEEE Trans. Pattern Anal. Machine Intell., 22(12):1396–1410, 2000. 21. G. Milligan and M. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50:159–179, 1985. 22. M. Minnotte. Nonparametric testing of the existence of modes. The Annals of Statistics, 25(4):1646–1660, 1997. 23. A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In Advances in Neural Information Processing Systems 11. MIT Press, 1999. 24. E. J. Pauwels and G. Frederix. Finding salient regions in images. Computer Vision and Image Understanding, 75:73–85, 1999. 25. P. Perona and W. Freeman. A factorization approach to grouping. In Proceedings 5th European Conference on Computer Vision, Freiburg, Germany, pages 655–670, 1998. 26. S. J. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recog., 30:261–272, 1997. 27. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell., 22(8):888–905, 2000. 28. B. Silverman. Using kernel density estimates to investigate multimodality. J. R. Statist. Soc. B, 43(1):97–99, 1981. 29. R. Tibshirani, G. Walther, D. Botstein, and P. Brown. Cluster validation by prediction strength. Technical Report 21, Dept. of Statistics, Stanford University, 2001. 30. R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Technical Report 208, Dept. of Statistics, Stanford University, 2000. 31. Z. Tu, S. Zhu, and H. Shum. Image segmentation by data driven markov chain monte carlo. In Proceedings International Conference on Computer Vision, Vancouver, Canada, volume II, pages 131–138, July 2001. 32. A. Yuille and T. Poggio. Scaling theorems for zero crossings. IEEE Trans. Pattern Anal. Machine Intell., 8(1):15–25, 1986.
Parametric Distributional Clustering for Image Segmentation Lothar Hermes, Thomas Z¨oller, and Joachim M. Buhmann Rheinische Friedrich Wilhelms Universit¨ at Institut f¨ ur Informatik III, R¨ omerstr. 164 D-53117 Bonn, Germany {hermes, zoeller, jb}@cs.uni-bonn.de, http://www-dbv.informatik.uni-bonn.de
Abstract. Unsupervised Image Segmentation is one of the central issues in Computer Vision. From the viewpoint of exploratory data analysis, segmentation can be formulated as a clustering problem in which pixels or small image patches are grouped together based on local feature information. In this contribution, parametrical distributional clustering (PDC) is presented as a novel approach to image segmentation. In contrast to noise sensitive point measurements, local distributions of image features provide a statistically robust description of the local image properties. The segmentation technique is formulated as a generative model in the maximum likelihood framework. Moreover, there exists an insightful connection to the novel information theoretic concept of the Information Bottleneck (Tishby et al. [17]), which emphasizes the compromise between efficient coding of an image and preservation of characteristic information in the measured feature distributions. The search for good grouping solutions is posed as an optimization problem, which is solved by deterministic annealing techniques. In order to further increase the computational efficiency of the resulting segmentation algorithm, a multi-scale optimization scheme is developed. Finally, the performance of the novel model is demonstrated by segmentation of color images from the Corel data base. Keywords: Image Segmentation, Clustering, Maximum Likelihood, Information Theory
1
Introduction
Image understanding and visual object recognition crucially rely on image segmentation as an intermediate level representation of image content. Approaches to image segmentation which lack supervision information are often formulated as data clustering problems. Regardless of the particular nature of the image primitives in question, these methods share as a common trait that they search for a partition of pixels or pixel blocks with a high degree of homogeneity. The specific choice of a clustering algorithm, however, is dependent on the nature of the given image primitives which might be feature vectors, feature relations or feature histograms. In this paper, we advocate to characterize an image site A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 577–591, 2002. c Springer-Verlag Berlin Heidelberg 2002
578
L. Hermes, T. Z¨ oller, and J.M. Buhmann
by the empirical color distributions extracted from its neighborhood, which we regard as a robust and statistically reliable descriptor of local color properties. One way to design a clustering technique for this type of data is to apply a statistical test to the measured histograms. This processing step yields pairwise dissimilarity values, for which a multitude of grouping techniques can be found in the literature (e.g. [7,13,16]). Alternatively, feature histograms can be grouped directly by histogram clustering [10,12]. The histogram clustering approach characterizes each cluster by a prototypical feature distribution, and it assigns feature histograms to the nearest prototype distribution. Closeness is measured by the Kullback–Leibler–Divergence. As a consequence, this method retains the efficiency of central clustering approaches like k-means clustering, but it avoids the restrictive assumption that features are vectors in a Euclidean space. Histogram clustering in its original form is invariant to permutations of histogram bins. In computer vision where the histogramming process is prone to noise induced errors, this invariance neglects information about the order of bins and the distance of bin centers in feature space. We, therefore, suggest to replace the non-parametric density estimation via histograms by a continuous mixture model, which no longer suffers from the shortcomings of a non-adaptive discrete binning process. The resulting statistical model can be interpreted as a generative model for pixel colors, but we can also establish an interesting connection to the novel information theoretic concept of the Information Bottleneck principle [17]. The search for a good grouping solution is posed as a combinatorial optimization problem. Due to the fact that the cost landscape may have a very jagged structure, powerful optimization techniques with regularization or smoothing behavior should be applied to avoid poor local minima. We use deterministic annealing which is embedded in a multi-scale framework for additional computational efficiency. To give an impression of the performance of the new parametric distributional clustering algorithm, color segmentations of pictures taken from the Corel gallery are shown in the results section.
2
The Clustering Model
Notation & Model Definition: To stress the generality of the proposed clustering model, the discussion of the cost function is initially detached from the application domain of image segmentation. Assume a set of objects oi , i = 1, . . . , n to be given. These entities are supposed to be clustered in k groups. The cluster memberships are encoded by Boolean assignment variables Miν , ν = 1, . . . , k which are summarized in a matrix M ∈ M = {0, 1}n×k . We set Miν = 1, if object oi is assigned to cluster ν. To avoid multiple group associations, we furthermore enforce ν≤k Miν = 1. Each object oi is equipped with a set of ni observations Xi = {xi1 , . . . , xini }, xij ∈ Rd . These observations are assumed to be drawn according to a particular Gaussian mixture model, which is characteristic for the respective cluster ν of the object. Thus, the generative model for an observation x given the group membership of its associated object is defined as
Parametric Distributional Clustering for Image Segmentation
p(x| ν) =
l
pα| ν g(x| µα , Σα ).
579
(1)
α=1
Here, gα (x) = g(x| µα , Σα ) denotes a multivariate Gaussian distribution with mean µα and covariance matrix Σα . In order to achieve parsimonious models, the Gaussians gα are considered to form a common alphabet from which the cluster specific distributions are synthesized by a particular choice of mixture coefficients pα| ν . In order to further limit the number of free parameters, the covariance matrices Σα , α = 1, . . . , l are not altered after being initialized by a preprocessing step, i.e. conventional mixture model estimation. Thus, the remaining free continuous parameters are the means of the Gaussians, the mixture coefficients and the probabilities of the various groups pν , ν = 1, . . . , k. Gathering these parameters in the set Θ = {pν , pα| ν , µα |α = 1, . . . , l; ν = 1, . . . k}, the complete data likelihood P (X , M| Θ) is given by Miν pν p(Xi | ν, Θ) p(X , M| Θ) = p(X | M, Θ) · P (M| Θ) = =
i≤n ν≤k
[pν p (Xi | ν, Θ)]
Miν
.
(2)
i≤n ν≤k
Replacing the sum by a product in eq. (2) is justified since the binary assignment variables Miν select one out of k terms. In the special case of color image segmentation, the abstract objects oi can be identified with individual pixel positions or sites. The observations X correspond to locally measured color values. The discrete nature of image data induces a partition of the color space. Computational reasons suggest to further coarsen this partition which leads to a discretization of the color space into regions Rj . Considering the different color channels as being independent, these regions correspond to one dimensional intervals Ij . In practice, the intervals Ij are chosen to cover a coherent set of different color values to alleviate the computational demands. If other image features are available, they can be integrated in this framework as well. For instance, the application of our method to combined color and texture segmentation has been studied and will be discussed in a forthcoming publication. Denote by nij the number of occurrences that an observation at site i is inside the interval Ij . Inserting in (2) and setting Gα (j) = Ij gα (x)dx, the complete data likelihood is given by p(X , M| Θ) =
i≤n ν≤k
pν
j≤m
nij Miν pα| ν Gα (j)
.
(3)
α≤l
Model Identification: Determining the values of the free parameters is the key problem in model identification for a given data set, which is accomplished by maximum likelihood estimation. To simplify the subsequent computations the log-likelihood corresponding to equation 2 is considered:
580
L. Hermes, T. Z¨ oller, and J.M. Buhmann
L(θ| X , M) = log p(X , M| Θ)
= Miν log pν + nij log pα| ν Gα (j) . ν
i
(4)
α
j
This equation has to be optimized with respect to the following entities: (1) pν , (2) pα| ν , (3) the means of the Gaussians µα and (4) the hidden variables M. The method of choice for these kinds of problems is the well known Expectation– Maximization–Algorithm (EM) [4]. It proceeds iteratively by computing posterior probabilities P (M| Θold ) in the E-step and maximizing the averaged complete data log–likelihood E[L(Θ| M)] with respect to Θ in the M-step. Extending this interpretation, EM can be viewed as maximizing the following joint function of the parameters Θ and the hidden states M (see [3,5,9]): F = E [log p(X , M| Θ) + log p(M)] .
(5)
Apart from a difference in the sign, this equation is identical to the generalized free energy F at temperature T = 1 known from statistical physics. Setting the corresponding cost function C = −L, the free energy for arbitrary temperatures T is given by the following expression: F = E[C] − T · H.
(6)
Here, H denotes the entropy of the distribution over the states M. This formal equivalence provides an interesting link to another well known optimization paradigm called Deterministic Annealing (DA) [14]. The key idea of this approach is to combine the advantages of a temperature controlled stochastic optimization method with the efficiency of a purely deterministic computational scheme. A given combinatorial optimization problem over a discrete state space is relaxed into a family of search problems in the space P(M) of probability distributions over that space. In this setting, the generalized free energy takes the role of the objective function. The temperature parameter T controls the influence of the entropic term, leading to a convex function in the limit of T → ∞. At T = 0 the original problem is recovered. The optimization strategy starts at high temperature and it tracks local minima of the objective function while gradually lowering the computational temperature. Setting qiν = E[Miν ] = p(Miν = 1), the expected costs of a given configuration is given by:
qiν log pν + nij log pα| ν Gα (j) . (7) E[C] = − i
ν
j
α
E-Step–Equations: Maximizing eq. (7) with respect to P (M), which basically recovers the E-Step of the EM–scheme, requires to evaluate the partial costs
Parametric Distributional Clustering for Image Segmentation
581
of assigning an object oi to cluster ν. The additive structure of the objective function allows us to determine these partial costs h as
hiν = − log pν − nij log pα| ν Gα (j) . (8) j
α
Utilizing the well known fact from statistical physics that the generalized free energy at a certain temperature is minimized by the corresponding Gibbs distribution, one arrives at the update equations for the various qiν :
1 1 qiν ∝ exp(− hiν ) = exp log pν + nij log pα| ν Gα (j) . (9) T T α j M-Step–Equations: In accordance with [9], the estimates for the class probabilities pν must satisfy
k ∂ F −λ· pµ − 1 = 0 , (10) ∂pν µ=1 where λ is a Lagrange parameter enforcing a proper normalization of pν . Expanding F and solving for pν leads to the M-step formulae n
pν =
1 qiν , ν = 1, . . . , k. n i=1
(11)
While lacking a closed-form solution for the second set of parameters pα| ν , their optimal values can be found by an iterated numerical optimization. Instead of directly solving
L ∂ F −λ· pγ| ν − 1 = 0 , (12) ∂pα| ν γ=1 which would be the analog to eq. (10), we repeatedly select two Gaussian components α1 and α2 . Keeping pγ| ν fixed for γ ∈ / {α1 , α2 }, pα2 | ν is directly coupled to pα1 | ν via pα2 | ν = 1 − pγ| ν − pα1 | ν , (13) γ ∈{α / 1 ,α2 }
so that only one free parameter remains. Inserting (13) into (12), we obtain
n m Gα (j) − Gα2 (j) ∂ (14) F (α1 , α2 ) = − qiν nij L1 ∂pα1 | ν γ=1 pγ| ν Gγ (j) j=1 i=1 and
582
L. Hermes, T. Z¨ oller, and J.M. Buhmann
n m 2 (Gα1 (j) − Gα2 (j)) ∂2 F (α , α ) = q n 1 2 iν ij 2 ≥0 . L ∂p2α1 | ν j=1 i=1 p G (j) γ γ=1 γ| ν
(15)
The joint optimization of α1 and α2 , therefore, amounts to solving a onedimensional convex optimization problem. Theoptimal value of α1 is either located on the boundary of the interval 0; 1 − γ ∈{α / 1 ,α2 } pγ| ν , or is equal to the zero-crossing of (14). In the latter case, it can be determined by the Newton method or by an interval bisection algorithm, which were both found to achieve sufficient precision after few optimization steps. Thecomputational demands of n this algorithm are dominated by the evaluation of i=1 qiν nij , which is linear in the number of sites, n. The computation of the remaining parts of (14) scales with the number of clusters, k, and the number of bins, m, and can thus be done efficiently. Some care should also be spent on the selection of α1 and α2 . Although the free energy will monotonously decrease even if α1 and α2 are randomly drawn, the convergence can be enhanced by choosing, in each iteration, α1 and α2 ∂ such that ∂pα | ν F (α1 , α2 ) is maximum. To adjust the mixture distribution 1 pα| ν for a fixed cluster ν, it is usually sufficient to repeat the selection and subsequent optimization of pairs (α1 , α2 ) for c · L times, where c is a small constant (e.g. c = 3). Although the optimization process might not have found the exact position of the global cost minimum at this time (incomplete M-step), any further optimization is unlikely to substantially influence the M-step result, and can thus be skipped. Finally it is possible to adapt the means µα . To improve the readability, we restrict our calculations to one-dimensional data (when operating in d dimensions, we assume diagonal covariance matrices, so that the estimation of the d-dimensional vector µα reduces to d one-dimensional optimization ⊕prob⊕ lems). Denote by x and x the boundaries of the interval I = xj ; xj , so j j j x⊕ that Gα (j) = xj gα (x)dx. µα can then be determined by gradient or Newton j
descent, the first derivative of F being given by
n ⊕ k m gα x ∂ j − gα xj F =− qiν nij pα| ν L . ∂µα ν=1 j=1 γ=1 pγ| ν Gγ (j) i=1
(16)
We observed in color segmentation experiments, that fixed means µα , initialized by a conventional mixture model procedure, produced satisfactory segmentation results. Adapting the means, however, can improve the generative performance of the PDC model. Multi-Scale Techniques: If the number of objects is large, e.g. in the case of large images, the proposed approach is computationally demanding, even if comparatively efficient optimization techniques like DA are used. In order to arrive
Parametric Distributional Clustering for Image Segmentation
583
at improved running times for the PDC algorithm, a multi-scale optimization scheme [6] [11] is applied. The idea of multi-scale optimization is to lower the computational complexity by decreasing the number of considered entities in the object space. In most application domains for image segmentation it is a natural assumption, that neighboring image sites contain identical, or at least similar, feature histograms. This domain–inherent structure is exploited to create a pyramid of coarsened data and configuration spaces by tying neighboring assignment variables. It is a well known fact that the reliable estimation of a given number of clusters requires a sufficient amount of data. Since the multi-scale optimization greatly reduces the cardinality of the configuration spaces at coarser levels, the splitting strategy and the coarse to fine optimization have to be brought in line. The inherent splitting behavior of DA optimization supports the coarse to fine hierarchy. Clusters degenerate at high temperatures, leaving only a reduced effective number kT of groups visible. While the computational temperature is continously lowered during optimization, clusters successively split at phase transitions [14]. Therefore, a scheme known as multi-scale annealing[11] is applied, which couples the splitting strategy and the annealing process.
3
Relation to the Information Bottleneck Framework
The Information Bottleneck principle has recently been proposed as a general information theoretical framework for describing clustering problems [17]. Essentially, it formalizes the idea that a given input signal X has to be efficiently ˜ and that on the other hand the relevant inencoded by a cluster variable X, formation about a context variable Y should be preserved as well as possible. This tradeoff is made explicit by the difference between two mutual information terms ˜ − λI X; ˜ Y , I X; X (17) ˜ The which has to be minimized to determine the optimal cluster variables X. quantity I (A; B) := H(A) − H(A|B) is the mutual information between two random variables A, B [2]. H(A) and H(A|B) are the entropy of A and the conditional entropy of A given B, respectively. λ > 0 is a control parameter that adjusts the tradeoff between good compression on the one hand and the level of information preservation on the other hand. The application of this general framework to our generative model is depicted in fig. 1. In our case, the signal X can be identified with the decision to select a single object i ∈ {1, . . . , n}. We can assume that objects are drawn according to a uniform distribution, i.e. pi = 1/n . The object i is then encoded by mapping it to a cluster ν ∈ {1, . . . , k}, which corresponds to a cluster vari˜ in the Information Bottleneck framework. As we assume deterministic, able X unique assignment of objects to clusters, the probability of cluster ν given the object i is a Boolean variable p (ν| i) = Miν . Accordingly, the conditional entropy
584
L. Hermes, T. Z¨ oller, and J.M. Buhmann
p (ν| i) = Miν ✒
✏ ✏ L p (j| ν) = α=1 pα| ν Gα (j) ✑ ✒ ✑
✄ ✂
❄ ✲ X ˜ : ν ∈ {1, ..., K}
X : i ∈ {1, . . . , n} ✻
❄ ✲ Y : j ∈ {1, . . . , m}
✻✻
✻
˜ I(X; X)
✁
˜ Y) I(X;
Fig. 1. The generative model and its relation to the Information Bottleneck principle.
˜ X) = − H(X|
i
pi
Miν log Miν = 0 vanishes, which implies ˜ = H(X) ˜ =− pν log pν . I(X; X) ν
(18)
ν≤k
As the next step, it is necessary to define the context variable Y , which in the Information Bottleneck framework is used to measure the relevant information ˜ As it is desirable to retain the information by which typical preserved by X. observations of an object i is characterized, it is the natural choice to let Y encode the observed bin indices j. ni denotes the number of observations for object i, so that the relative frequencies nij /ni , j ∈ {1, . . . , m}, form an object-specific normalized histogram. Furthermore, let pj denote the marginal probability that an observation is attributed to bin j. The conditional entropy between Y and ˜ can be rewritten using the Markov dependency between X, X ˜ and Y , i.e. X ˜ ˜ p(X|X, Y ) = p(X|X): ˜ =− ˜ log p(Y |X) ˜ = − ˜ X) log p(Y |X) ˜ H(Y |X) p(Y, X) p(Y, X, ˜ X,Y
=−
˜ X,Y
˜ p(X|X)P (Y |X)
˜ X,X,Y
X
1 ˜ . log p(Y |X) n
(19)
˜ Y = H (Y ) − H Y | X ˜ yields the Inserting these terms and replacing I X; bottleneck functional ˜ − λI X; ˜ Y I X; X 1 ˜ ˜ −λ ˜ − λH(Y ) p(X|X)P (Y |X) log p(Y |X) = H(X) n ˜ X,X,Y k n nij 1 Miν log pν + λ log pα| ν Gα (j) + λH(Y ) . (20) =− n i=1 ν=1 ni j≤m
α≤L
The entropy of Y is a constant and does not influence the search for optimal x ˜ parameters. We can, therefore, subtract λH(Y ) from (20) and multiply the
Parametric Distributional Clustering for Image Segmentation
585
equation by n without changing the minimum w.r.t. Miν andpα| ν . This operation yields the function
L n k m λ Miν log pν + nij log pα| ν Gα (j) . (21) C IB = − n i j=1 α=1 i=1 ν=1 Compared to (4), it is equipped with an additional weighting factor λ, that explicitly controls the influence of the cluster probabilities pν . If the number of observations is identical for each site i, i.e. ni = const ∀i ∈ {1, . . . , n}, we can set λ = ni to obtain our original cost function C = − log L. This calculation proves that the generative model introduced in this paper is equivalent to the Information Bottleneck framework suggested in [17].
4
Experimental Results
Implementation Details: Although the proposed approach to clustering histogram data is of general applicability, our primary interest is in the domain of image segmentation. In this contribution, we put a focus on segmentation according to color features. In this setting, the basic measurements are given by the three-dimensional color vectors. The objects oi , i = 1, . . . , n correspond to image sites located on a rectangular grid. In each dimension, the color values are discretized into 32 bins. For all sites, marginal feature histograms are computed in a local neighborhood. In order to determine initial values for the involved Gaussian distributions, a mixture model estimation step is performed prior to the PDC model optimization. The Generative Model: One of the essential properties of our model is given by its generative nature. It is, therefore, reasonable to evaluate its quality by generating a new image from the learned statistics, i.e., we conducted experiments in which a learned model was used to re-generate its input by sampling. Two examples of this procedure are depicted in fig. 2. These results demonstrate, that the color content of the original image is well represented in its generated counterpart. However, the spatial relationships between the pixels, and thus the texture characteristics, are lost. This effect is due to the histogramming process which destroys these relations. Consequently, they cannot be taken into account by our generative model. Evolution of Cluster-Assignments: a) The Multi-Scale Framework. In order to give some intuition about the dynamics of the multi-scale optimization, we produced a set of snapshots of the group assignments at various stages of the multi-scale pyramid (fig. 3). Cluster memberships are encoded by color/grey values. The series of images starts in the top left with a grouping at the coarsest stage, continuing to finer levels in a left-to-right and top-to-bottom fashion. For reference, the corresponding input image is also depicted. The interplay of the coarse to fine optimization and the splitting strategy is clearly visible. At
586
L. Hermes, T. Z¨ oller, and J.M. Buhmann
(a)
(b)
Fig. 2. Sampling from the learned model : a) original image b) sampled image with four segments.
the coarsest stage, the grouping starts with two clusters. Then, groups are successively split as long as there is sufficient data for a reliable estimation. Upon convergence, results are mapped to the next finer level, at which the grouping cost function, i.e. the generalized free energy resulting from the complete data log-likelihood, is further minimized. These steps are repeated, until final convergence is reached. Evolution of Cluster-Assignments: b) Phase Transitions in DA. Another interesting phenomenon in the development of assignments in the framework of deterministic annealing is the occurrence of phase transitions. To illustrate that point, we visualized a set of group assignments at various stages of the annealing process (fig. 4). For a better exposition of that particular point we dispensed with multi-scale optimization in these examples. Again, group memberships are visualized by different colors / grey levels. At high computational temperatures, the entropic term of the free energy dominates, leading to random cluster assignments. As the temperature parameter is gradually lowered, the most prominent structural properties of the data begin to emerge in the form of
Parametric Distributional Clustering for Image Segmentation
587
Fig. 3. Evolution of group assignments in the multi-scale framework.
Fig. 4. Evolution of group assignments in deterministic annealing.
stable group memberships. This process continues with the lesser pronounced characteristics of the data manifesting themselves, until a grouping solution with the predefined number of five clusters is reached at low temperatures. A theo-
588
L. Hermes, T. Z¨ oller, and J.M. Buhmann
(a)
(b)
Fig. 5. Segmentation results: a) human segmentation, b) PDC segmentation.
Parametric Distributional Clustering for Image Segmentation
(a)
(b)
Fig. 6. Segmentation results: a) human segmentation, b) PDC segmentation.
589
590
L. Hermes, T. Z¨ oller, and J.M. Buhmann
retical investigation of the critical temperatures for phase transitions in the case of K-Means clustering is given by Rose et al. in [15]. It is shown, that the first split is determined by the variance along the first principal axis of the data. The critical temperature for further phase transitions is more difficult to compute due to inter cluster influences. Because of the structural analogies of our method to K-Means, comparable results are expected to hold for PDC. Comparative Evaluation: Judging the quality of a given segmentation is difficult due to the fact that ground truth is unavailable in most cases. Furthermore the segmentation is often only one item in a large context of processing steps. In those cases it is only natural, as Borra and Sakar point out [1], to judge the segmentation quality with respect to the overall task. In contrast to this view, Malik et al. examined human image segmentation [8] experimentally. Their results indicate a remarkable consistency in the segmentation of given images among different human observers. This finding motivates their current effort to construct a database of human segmented images from the Corel collection for evaluation purposes, which is publicly available. This set of images has been chosen as our testbed, making a direct comparison between our novel segmentation model and human performance possible. Figures 5 and 6 depict the best (w.r.t. PDC) human segmentation in comparison to the segmentation results achieved by parametric distributional clustering for four segments. Segment boundaries for both human and machine segmentation are given by thick white lines. It is obvious that segmentations which require high-level sematic knowledge like shadows cannot be reproduced by our method but segmentations based on low level color information are reliably inferred.
5
Conclusion
In this contribution, a novel model for unsupervised image segmentation has been proposed. It is based on robust measurements of local image characteristics given by feature histograms. As one of the main contributions, it contains a continuous model for the group-specific distributions. In contrast to existing approaches, our method thus explicitly models the noise-induced errors in the histogramming of image content. Being based on the theoretically sound maximum likelihood framework, our approach makes all modeling assumptions explicit in the cost function of the corresponding generative model. Moreover, there exists an informative connection to information theoretic concepts. The Information Bottleneck model offers an alternative interpretation of our method as a way to construct a simplified representation of a given image while preserving as much of its relevant information as possible. Finally, the results demonstrate the good performance of our model, often yielding close to human segmentation quality on the testbed. Acknowledgement. The authors are grateful to N. Tishby for various discussions on the bottleneck framework for learning. We also thank J. Puzicha for
Parametric Distributional Clustering for Image Segmentation
591
valuable contributions at the beginning of this project. This work has been supported by the German Research Foundation (DFG) under grant #BU 914/3–1 and by cooperations with INFINEON and InfoTerra GmbH.
References 1. S. Borra and S. Sakar. A framework for performance characterization of intermediate–level grouping modules. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1306–1312, 1997. 2. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. 3. I. Csiz` ar and G. Tusnady. Information geometry and alternating minimization procedures. In E. J. Dudewicz et al, editor, Recent Results in Estimation Theory and Related Topics, Statistics and Decisions, Supplement Issue No. 1. Oldenbourg, 1984. 4. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977. 5. R. J. Hathaway. Another interpretation of the EM algorithm for mixture distributions. Statistics and Probability Letters, 4:53–56, 1986. 6. F. Heitz, P. Perez, and P. Bouthemy. Multiscale minimization of global energy functions in some visual recovery problems. CVGIP: Image Understanding, 59(1):125– 134, 1994. 7. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ 07632, 1988. 8. D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. ICCV’01, 2001. 9. R. M Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT Press, 1999. 10. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In 30th International Meeting of the Association of Computational Linguistics, pages 183–190, Columbus, Ohio, 1993. 11. J. Puzicha and J. M. Buhmann. Multiscale annealing for unsupervised image segmentation. Computer Vision and Image Understanding, 76(3):213–230, 1999. 12. J. Puzicha, T. Hofmann, and J. M. Buhmann. Histogram clustering for unsupervised segmentation and image retrieval. Pattern Recognition Letters, 20:899–909, 1999. 13. J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: Structure detection by optimization. Pattern Recognition, 2000. 14. K. Rose, E. Gurewitz, and G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11:589–594, 1990. 15. K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phse transitions in clustering. Physical Review Letters, 65(8):945–948, 1990. 16. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. 17. N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37th annual Allerton Conference on Communication, Control, and Computing, 1999.
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence John W. Fisher III and Trevor Darrell Massachusett Institute of Technology Artficial Intelligence Laboratory Cambridge, Massachusetts, USA
Abstract. We propose a probabalistic model of single source multimodal generation and show how algorithms for maximizing mutual information can find the correspondences between components of each signal. We show how non-parametric techniques for finding informative subspaces can capture the complex statistical relationship between signals in different modalities. We extend a previous technique for finding informative subspaces to include new priors on the projection weights, yielding more robust results. Applied to human speakers, our model can find the relationship between audio speech and video of facial motion, and partially segment out background events in both channels. We present new results on the problem of audio-visual verification, and show how the audio and video of a speaker can be matched even when no prior model of the speaker’s voice or appearance is available.
1
Introduction
Relating multi-modal signals is an important and challenging task for machine perception systems. For example, given audio and video signals, one would like to find the audio-visual correspondences: that portion of the signals which can be inferred to have come from the same underlying source. Applied to observation of human speakers, this ability can be useful for several different tasks. It can help localize the position of the speaker in the video frame (video localization), enhance the audio quality of the speaker by segmenting it from other sources (audio localization), and finally verify whether the person being observed is the person speaking (audio-visual verficiation). We propose an independent cause model to capture the relationship between generated signals in each individual modalitiy. Using principles from information theory and nonparametric statistics we show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences. We analyze the graphical model of multi-modal generation and show under what conditions related subcomponents of each signal have high mutual information. Non-parametric statistical density models can be used to measure the degree of mutual information in complex phenomena [6] which we apply to audio/visual data. This technique simultaneously learns projections of images in the video sequence and projections of sequences of periodograms taken from the audio A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 592−603, 2002. Springer-Verlag Berlin Heidelberg 2002
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
593
sequence. The projections are computed adaptively such that the video and audio projections have maximum mutual information (MI). Applied to audiovisual data, early results with this technique on video and audio localization have been reported [4], but without any derivation from a probabalistic framework. In practice, these techniques may fail to converge to a useful solution, since the projection parameters were underconstrained. In this paper we ground the mutual information algorithm in probabalistic model, and extend the informative subspace algorithm to include a prior bias towards small projection coefficients. We also present new results on the problem of audio-visual verification without prior models of user speech or appearance, an application not previously addressed in the literature. In the next section we review related work on audio-visual fusion and information theoretic adaptive methods. We present our probabalistic model for cross-modal signal generation, and show how audio-visual correspondences can be found by identifying components with maximal mutual information. We then review techniques for efficient estimation of mutual information using nonparametric entropy models. Finally, we show a new application to a verification task where we detect whether audio and video come from a single speaker. In an experiment comparing the audio and video of every combination of a group of eight users, our technique was able to perfecly match the corresponding audio and video from a single user. These results are based purely on the instantaneous cross-modal mutual information between the projections of the two signals, and do not rely on any prior experience or model of user’s speech or appearance.
2
Related Work
Humans routinely perform tasks in which ambiguous auditory and visual data are combined in order to support accurate perception. In contrast, automated approaches for processing multi-modal data sources lag far behind. This is primarily due to the fact that few methods adequately model the complexity of the audio/visual relationship. Classical approaches to multi-modal fusion usually either assume a statistical relationship which is too simple (e.g. jointly Gaussian) or defer fusion to the decision level when many of the joint (and useful) properties have been lost. While such pragmatic choices may lead to simple statistical measures, they do so at the cost of modeling capacity. Information theory motivates fusion at the measurement level without reagard to specific parametric densities. The idea of using information-theoretic principles in an adaptive framework is not new (e.g. see [3] for an overview) with many approaches suggested over the last 30 years. A critical distinction in most information theoretic approaches lies in how densities are modeled (either explicitly or implicitly), how entropy (and by extension mutual information) is approximated or estimated, and the types of mappings which are used (e.g. linear vs. nonlinear). Approaches which use a Gaussian assumption include Plumbley [12, 11] and Becker[1]. Additionally, [1] applies the method to fusion of artificially generated random dot sterograms.
594
J.W. Fisher and T. Darrell
With regards to audio/visual processing, there has been substantial progress on feature-level integration of speech and vision. For example, Meier et al [9], Stork [14] and others have built visual speech reading systems that can improve speech recognition results dramatically. However, many of these systems assume that no significant motion distractors are present and that the camera was “looking” at the user uttering the audio signal. Speech systems (both those that integrate viseme features and those that do not) are easily confused if there are nearby speakers also making utterances, either directed at the speech recognition system or not. Our system, described below, is designed to be able to detect and disambiguate cases where audio and video signals are coming from different sources. Since our method is not dependent on speech content, it would have the advantage of working on non-verbal utterances. Other audio/visual work which is closely related to ours is that of Hershey and Movellan [7] which examined the per-pixel correlation relative to an audio track, detecting which pixels have related variation. Again, an inherent assumption of this method was that the joint statistics were Gaussian. Slaney and Covell [13] looked at optimizing temporal alignment between audio and video tracks using canonical correlations (equivalent to mutual information in the jointly Gaussian case), but did not address the problem of detecting whether two signals came from the same person or not. The idea of simply gating audio input with a face detector is related to ours, but would not solve our target scenerio above where the user is facing the screen and a nearby person makes an utterance that can be mistakenly interpreted as a system command. We are not aware of any prior work which can perform audio-visual verification at a signal-level, without any prior speech or appearance model.
3
Probabalistic Models of Audio-Visual Fusion
We consider multimodal scenes which can be modeled probabalistically with one joint audio-visual source and distinct background interference sources for each modality. We note that the proposed method can be extended to multiple multimodal sources. Each observation is a combination of information from the joint source, and information from the background interferer for that channel. We use a graphical model, figure 1 to represent this relationship. In the diagrams, B represents the joint source, while A and C represent single modality backgroud interference. Our purpose here is to analyze under which conditions our methodology should uncover the underlying cause of our observations. Figure 1a shows an independent cause model for our typical case, where {A, B, C} are unobserved random variables representing the causes of our (highdimensional) observations in each modality {X a , X v }. In general there may be more causes and more measurements, but this simple case can be used to illustrate our algorithm. An important aspect is that the measurements have dependence on only one common cause. The joint statistical model consistent
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
595
with the graph of figure 1a is P (A, B, C, X a , X v ) = P (A)P (B)P (C)P (X a |A, B)P (X v |B, C) . Given the independent cause model a simple application of Bayes’ rule (or
A
A Xa
B
a XB
B
v XB
B
Xv
v XB
Xv C
(a)
a XB
Xa B
C
a XA
A
v XC
C
(b)
(c)
a XB
B
v XB
(d)
Fig. 1. Graphs illustrating the various statistical models exploited by the algorithm: (a) the independent cause model - X a and X v are independent of each other conditioned on {A, B, C}, (b) information about X a contained in X v is conveyed through joint statistics of A and B, (c) the graph implied by the existence of a separating function, and (d) two equivalent Markov chains which can be extracted from the graphs if the separating functions can be found.
the equivalent graphical manipulation) yields the graph of figure 1b which is consistent with P (A, B, C, X a , X v ) = P (X a )P (C)P (A, B|X a )P (X v |B, C) , which shows that information about X a contained in X v is conveyed through the joint statistics of A and B. The consequence being that, in general, we cannot disambiguate the influences that A and B have on the measurements. A similar graph is obtained by conditioning on X v . Suppose decompositions of the measurement X a and X v exist such that the following joint densities can be written: a a v P (A, B, C, X a , X v ) = P (A)P (B)P (C)P (XA |B)P (XCv |C) |B)P (XB |A)P (XB a a v where X a = [XA , XB ] and X v = [XB , XCv ]. An example for our specific application would be segmenting the video image (or filtering the audio signal). In this case we get the graph of figure 1c and from that graph we can extract the Markov chain which contains elements related only to B. Figure 1d shows equivalent graphs of the extracted Markov chain. As a consequence, there is no influence due to A or C. Of course, we are still left with the formidable task of finding a decomposition, but given the decomposition it can be shown, using the data processing inequality [2], that the following inequality holds: a v a , XB ) ≤ I(XB , B) I(XB a v v I(XB , XB ) ≤ I(XB , B)
(1) (2)
596
J.W. Fisher and T. Darrell
a v More importantly, these inequalities hold for functions of XB and XB (e.q. a a v v Y = f (X ; ha ) and Y = f (X ; hv )). Consequently, by maximizing the mutual information between I(Y a ; Y v ) we must necessarily increase the mutual information between Y a and B and Y v and B. The implication is that fusion in such a manner discovers the underlying cause of the observations, that is, the joint density of p(Y a , Y v ) is strongly related to B. Furthermore, with an approximation, we can optimize this criterion without estimating the separating function directly. In the event that a perfect decomposition does not exist, it can be shown that the method will approach a “good” solution in the KullbackLeibler sense.
4
Informative Subspaces from Linear Projections
From the perspective of information theory, estimating separate projections of the audio video measurements which have high mutual information makes intuitive sense as such features will be predictive of each other. The advantage is that the form of those statistics are not subject to the strong parametric assumptions (e.g. joint Gaussianity) which we wish to avoid. We can find these projections using a technique that maximizes the mutual information between the projections of the two spaces. Following [4], we use a nonparametric model of joint density for which an analytic gradient of the mutual information with respect to projection parameters is available. First we review that technique, and then present new extensions to add a prior bias term on the projection coefficients. We have found that without this extension the technique may not converge to a useful subspace. In principle the method may be applied to any function of the measurements, Y = f (X; h), which is differentiable in the parameters h (e.g. as shown in [4]). Here we consider a linear fusion model which results in a significant computational savings at a minimal cost to the representational power (largely due the nonparametric density modeling of the output): !
v y1v · · · yN a a y1 · · · yN
"
! =
hTv 0T 0T hTa
"!
xv1 · · · xvN xa1 · · · xaN
" (3)
where xvi ∈ !Nv and xai ∈ !Na are lexicographic samples of images and periodograms, respectively, from an A/V sequence. The linear projection defined by hTv ∈ !Mv ×Nv and hTa ∈ !Ma ×Na maps A/V samples to low dimensional features yiv ∈ !Mv and yia ∈ !Ma . Treating xi ’s and yi ’s as samples from a random variable our goal is to choose hv and ha to maximize the mutual information, I (Y a ; Y v )), of the derived measurements.
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
597
Mutual information, which is a combination of entropy terms, is defined for continuous random variables as [2] I (Y v ; Y a ) = h (Y a ) + h (Y v ) − h (Y a , Y v ) # # pY v (y) log (pY v (y)) dy pY a (y) log (pY a (y)) dy + = RY v RY a # # − pY a ,Y v (x, y) log (pY a ,Y v (x, y)) dxdy . (4) RY a ×RY v
Mutual information indicates the amount of information that one random variable conveys on average about another. The usual difficulty of MI as a criterion for adaptation is that it is an integral function of probability densities. Furthermore, in general we are not given the densities themselves, but samples from which they must be inferred. To overcome this problem, we replace each entropy term in equation 4 with a second-order taylor-series approximation as in [6] ˆ a ) + H(Y ˆ v ) − H(Y ˆ v , Y a) Iˆ (Y v , Y a ) = H(Y # # 2 2 a = (ˆ pY (y) − pu (y)) dy + (ˆ pY v (y) − pu (y)) dy RY a RY v # (ˆ pY v ,Y a (x, y) − pu (x, y))2 dxdy − RY a ×RY v
(5)
(6)
where RY a is the support of one feature output, RY v is the support of the other, pu is the uniform density over that support, and pˆ(x) is a Parzen density [10] estimate computed from the projected samples. The Parzen density estimate is defined as 1 * pˆ (y) = κ (y − yi , σ) (7) N i where k ( ) is a gaussian kernel (in our case) and σ is the standard deviation. The Parzen density estimate has the capacity to capture relationships with more complex structure than typical parametric families of densities. Note that this is essentially an integrated squared error comparison between the density of the projections to the uniform density (which has maximum entropy over a finite region). An advantage of this particular combination of secondorder entropy approximation and nonparametric density estimator is that the gradient terms (appropriately combined to approximate mutual information as in 6) with respect to the projection coefficients can be computed exactly by evaluating a finite number of functions at a finite number of sample locations in the output space as shown in [5, 6]. The update term for the individual entropy terms in 6 (note the negative sign on the third term) of the ith feature vector at iteration k as a function of the value of the feature vector at iteration k − 1 is (where yi denotes a sample of either Y a or Y v or their concatenation depending on which term of 6 is being computed) ' 1 * $ (k−1) ∆yi (k) = br (yi (k−1) ) − κa y i − yj (k−1) , Σ (8) N j"= i
598
J.W. Fisher and T. Darrell
1 br (yi )j ≈ d
% & ) ) ( & d d κ yi + , Σ − κ yi − , Σ 2 2 j j
κa (y, Σ) = κ(y, Σ) ∗ κ# (y, Σ) & T ) $ '−1 y y = − 2M+1 π M/2 σ M+2 exp − 2 y 4σ
(9)
(10)
where M = Ma , Mv , or Ma +Mv depending on the entropy term. Both br (yi ) and κa (yi , σ) are vector-valued functions (M -dimensional) and d is the support of the output (i.e. a hyper-cube with volume dM ). The notation br (yi )j indicates the jth element of br (yi ). Adaptation consists of the update rule above followed by a modified least squares solution for hv and ha until a local maximum is reached. In the experiments that follow Mv = Ma = 1 with 150 to 300 iterations. In [4] early results were demonstrated using this method for the video-based localization of a speaking user. However, the technique often failed to converge, since the projection coefficients were under-determined. To improve on the method, we thus introduce a capacity control mechanism in the form of a prior bias to small weights. Capacity Control. The method of [6] requires that the projection be differentiable, which it is in this case. Additionally some form of capacity control is necessary as the method results in a system of underdetermined equations. To address this problem we impose an L2 penalty on the projection coefficients of ha and hv . Furthermore, we impose the criterion that if we consider the projection hv as a filter, it has low output energy when convolved with images in the sequence (on average). This constraint is the same as that proposed by Mahalanobis et al [8] for designing optimized correlators the difference being that in their case the projection output was designed explicitly while in our case it is derived from the MI optimization in the output space. The adaptation criterion, which we maximize in practice, is then a combination of the approximation to MI (equation 5) and the regularization terms: ¯ −1 hv J = Iˆ (Y v , X a ) − αv hv T hv − αa ha T ha − βhv T R V
(11)
¯ −1 is average where the last term derives from the output energy constraint and R V autocorrelation function (taken over all images in the sequence). This term is more easily computed in the frequency domain (see [8]) and is equivalent to pre-whitening the images using the inverse of the average power spectrum. The scalar weighting terms αv , αu , β, were set using a data dependent heuristic for all experiments. The interesting thing to note is that computing hv can be decomposed into three stages: 1. Pre-whiten the images once (using the average spectrum of the images) followed by iterations of
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
599
2. Updating the feature values (yiv ’s), and 3. Solving for the projection coefficients using least squares and the L2 penalty. The pre-whitening interpretation makes intuitive sense in our case as it accentuates edges in the input image. It is the moving edges (lips, chin, etc.) which we expect to convey the most information about the audio. The projection coefficients related to the audio signal, ha , are solved in a similar (and simultaneously) without the initial pre-whitening step.
5
Empirical Results
We now present new experimental results in which the general method described previously is used to first to localize the speaker in the video (i.e., audio-based video localization) and second to measure whether the audio signal is consistent with the video signal (i.e., audio-visual verification). Our motivating scenerio for this application is a user interacting with an anonymous handheld device or kiosk using spoken commands. Given a recieved audio signal, we would like to verify whether the person speaking the command is in the field of view of the camera on the device, and if so to localize which person is speaking. If there are numerous handheld devices in an area, one would like them not all to respond to a command given to one of them. Simple techniques which check only for the presence of a face (or moving face) would fail when two people were looking at their individual devices and one spoke a command. Since future devices may be anonymous and interchangeable, we are interested in the case where no prior model of the voice or appearance of users are available to perform the verification and localization. The technique described in the previous sections is thus an appropriate approach. We collected audio-video data from eight subjects. In all cases the video data was collected at 29.97 frames per second at a resolution of 360x240. The audio signal was collected at 48000 KHz, but only 10Khz of frequency content was used. All subjects were asked to utter the phrase “How’s the weather in Taipei?”. This typically yielded 2-2.5 seconds of data. Video frames were processed as is, while the audio signal was transformed to a series of periodograms. The window length of the periodogram was 2/29.97 seconds (i.e. spanning the width of two video frames). Upon estimating projections the mutual information between the projected audio and video data samples is used as the measure of consistency. All values for mutual information are in terms of the maximum possible value, which is the value obtained (in the limit) if the two variables are uniformly distributed and perfectly predict one another. In all cases we assume that there is not significant head movement on the part of the speaker during the utterance of the sentence. While this assumption might be violated in practice one might account for head movement using a tracking algorithm, in which case the algorithm as described would process the images after tracking.
600
5.1
J.W. Fisher and T. Darrell
Video Localization of Speaker
Figure 2a shows a single video frame from one sequence of data. In the figure there is a single speaker and a video monitor. Thoughout the sequence the video monitor exhibits significant flicker. Figure 2c shows an image of the pixel-wise standard deviations of the image sequence. As can be seen, the energy associated with changes due to monitor flicker is greater than that due to the speaker. Figure 2b shows the absolute value of the output of the pre-whitening stage for the video frame in the same figure. Note that the output we use is signed. The absolute value is shown instead because it illustrates the enhancements of edges in the image. Figure 4a shows the associated periodogram sequence where the horizontal axis is time and the vertical axis is frequency (0-10 Khz). Figure 2d shows the coefficients of the learned projection when fused with the audio signal. As can be seen the projection highlights the region about the speaker’s lips. Figure 3a
(a)
(b)
(d)
(e)
(c)
Fig. 2. Video sequence contains one speaker and monitor which is flickering: (a) one image from the sequence, (b) magnitude of the image after pre-whitening, (c) pixel-wise image of standard deviations taken over the entire sequence, (d) image of the learned projection, hv , (e) image of hv for incorrect audio
shows results from another sequence in which there are two people. The person on the left was asked to utter the test phrase, while the person on the right moved their lips, but did not speak. This sequence is interesting in that a simple face detector would not be sufficient to disambiguate the audio and video stream. Furthermore, viseme based approaches might be confused by the presence of two faces. Figures 3b and 3c show the pre-whitened images as before. There are significant changes about both subjects lips. Figure 3d shows the coefficients of the learned projection when the video is fused with the audio and again the region about the correct speaker’s lips is highlighted.
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
(a)
(b)
(d)
(e)
601
(c)
Fig. 3. Video sequence containing one speaker (person on left) and one person who is randomly moving their mouth/head (but not speaking): (a) one image from the sequence, (b) magnitude of the image after pre-whitening, (c) pixel-wise image of standard deviations taken over the entire sequence, (d) image of the learned projection, hv , (e) image of hv for incorrect audio.
5.2
Quantifying Consistency between the Audio and Video
In addition to localizing the audio source in the image sequence we can also check for consistency between the audio and video. Such a test is useful in the case that the person to which a system is visually attending is not the person who actually spoke. Having learned a projection which optimizes MI in the output feature space, we can then estimate the resulting MI and use that estimate to quantify the audio/video consistency. Using the sequences of figure 2 and 3 we compared the fusion result when using separately recorded audio sequence from another speaker. The periodogram of the alternate audio sequence is shown in figure 4b. Figures 2e and 3e show the resulting hv when the alternate audio sequence is used. In the case that the alternate audio was used we see that coefficients related to the video monitor increase significantly in 4e while energy is distributed throughout the image of 3e. For figure 2 the estimate of mutual information was 0.68 relative to the maximum possible value for the correct audio sequence. In contrast when compared to the periodogram of 4b, the value drops to 0.08 of maximum. For the sequence of figure 3, the estimate of mutual information for the correct sequence was 0.61 relative to maximum, while it drops to 0.27 when the alternate audio is used. 5.3
Eight-Way Test
Finally, data was collected from six additional subjects. These data were used to perform an eight-way test. Each video sequence was compared to each audio sequence. No attempt was made to optimally align the mismatched audio
602
J.W. Fisher and T. Darrell
(a)
(b) Fig. 4. Gray scale magnitude of audio periodagrams. Frequency increases from bottom to top, while time is from left to right. (a) audio signal for image sequence of figure 2. (b) alternate audio signal recorded from different subject.
sequences. Table 1 summarizes the results. The previous sequences correspond to subjects 1 and 2 in the table. In every case the matching audio/video pairs exhibited the highest mutual information after estimating the projections. Table 1. Summary of results over eight video sequences. The columns indicate which audio sequence was used while the rows indicate which video sequence was used. In all cases the correct audio/video pair have the highest relative MI score. v1 v2 v3 v4 v5 v6 v7 v8
6
a1 0.68 0.20 0.05 0.12 0.17 0.20 0.18 0.13
a2 0.19 0.61 0.27 0.24 0.05 0.05 0.15 0.05
a3 0.12 0.10 0.55 0.32 0.05 0.05 0.07 0.10
a4 0.05 0.11 0.05 0.55 0.05 0.13 0.05 0.05
a5 0.19 0.05 0.05 0.22 0.55 0.14 0.05 0.31
a6 0.11 0.05 0.05 0.05 0.05 0.58 0.05 0.16
a7 0.12 0.18 0.05 0.05 0.20 0.05 0.64 0.12
a8 0.05 0.32 0.05 0.10 0.09 0.07 0.26 0.69
Conclusions and Future Work
We have presented an information theoretic approach to the problem of finding cross-modal correspondence. A new probabalistic formulation of joint signal generation was proposed, and we showed how maximizing mutual information could find desired correspondences. Informative subspaces can be found with non-parametric density models and second order approximations to mutual information. To find these robustly, we proposed new prior terms on projection
Probabalistic Models and Informative Subspaces for Audiovisual Correspondence
603
coeficients. Our approach was applied to the problem of audio-visual localization and verification, detecting the correspondence between the speech and appearance of a human speaker without having any prior model of that user in either domain. In all the cases that we tried, our technique was able to correctly pair the video with the corresponding audio from a particular individual, and localize where in a video frame the user’s face was present.
References 1. Suzanna Becker. An Information-theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto, 1992. 2. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., New York, 1991. 3. G. Deco and D. Obradovic. An Information Theoretic Approach to Neural Computing. Springer-Verlag, New York, 1996. 4. John W. Fisher III, Trevor Darrell, William T. Freeman, and Paul. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13, 2000. 5. John W. Fisher III and Jose C. Principe. Entropy manipulation of arbitrary nonlinear mappings. In J.C. Principe, editor, Proc. IEEE Workshop, Neural Networks for Signal Processing VII, pages 14–23, 1997. 6. John W. Fisher III and Jose C. Principe. A methodology for information theoretic feature extraction. In A. Stuberud, editor, Proceedings of the IEEE International Joint Conference on Neural Networks, 1998. 7. John Hershey and Javier Movellan. Using audio-visual synchrony to locate sounds. In S. A. Solla, T. K. Leen, and K-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 813–819, 1999. 8. A. Mahalanobis, B. Kumar, and D. Casasent. Minimum average correlation energy filters. Applied Optics, 26(17):3633–3640, 1987. 9. Uwe Meier, Rainer Stiefelhagen, Jie Yang, and Alex Waibel. Towards unrestricted lipreading. In Second International Conference on Multimodal Interfaces (ICMI99), 1999. 10. E. Parzen. On estimation of a probability density function and mode. Ann. of Math Stats., 33:1065–1076, 1962. 11. M. Plumbley. On information theory and unsupervised neural networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK, 1991. 12. M. Plumbley and S Fallside. An information-theoretic approach to unsupervised connectionist models. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. Morgan Kaufman, San Mateo, CA, 1988. 13. Malcolm Slaney and Michele Covell. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In T. K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, 2000. 14. G. Wolff, K. V. Prasad, D. G. Stork, and M. Hennecke. Lipreading by neural networks: Visual preprocessing, learning and sensory integration. In Proc. of Neural Information Proc. Sys. NIPS-6, pages 1027–1034, 1994.
Volterra Filtering of Noisy Images of Curves Jonas August Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract. How should one filter very noisy images of curves? While blurring with a Gaussian reduces noise, it also reduces contour contrast. Both non-homogeneous and anisotropic diffusion smooth images while preserving contours, but these methods assume a single local orientation and therefore they can merge or distort nearby or crossing contours. To avoid these difficulties, we view curve enhancement as a statistical estimation problem in the three-dimensional (x, y, θ)-space of positions and directions, where our prior is a probabilistic model of an ideal edge/line map known as the curve indicator random field (cirf). Technically, this random field is a superposition of local times of Markov processes that model the individual curves; intuitively, it is an idealized artist’s sketch, where the value of the field is the amount of ink deposited by the artist’s pen. After reviewing the cirf framework and our earlier formulas for the cirf cumulants, we compute the minimum mean squared error (mmse) estimate of the cirf embedded in large amounts of Gaussian white noise. The derivation involves a perturbation expansion in an infinite noise limit, and results in linear, quadratic, and cubic (Volterra) cirf filters for enhancing images of contours. The self-avoidingness of smooth curves in (x, y, θ) simplified our analysis and the resulting algorithms, which run in O(n log n) time, where n is the size of the input. This suggests that the Gestalt principle of good continuation may not only express the likely smoothness of contours, but it may have a computational basis as well.
1
Introduction
Is noise a help or a hindrance? Despite intriguing phenomena such as stochastic resonance, noise is largely viewed a necessary evil for applied science. Filtering images of contours, the problem studied here, is especially susceptible to noise because spatial differentiation is often involved in obtaining initial measurements (Fig. 1, top). Large noise levels, however, can actually simplify the filtering problem analytically, making difficult problems (asymptotically) manageable [22]. For example, here we derive nonlinear contour enhancement filters that operate in a neighborhood of an infinite noise limit. This work is part of a larger project to understand visual contour computations as the Bayesian inference of a random field [2]. Beginning with a random
I am grateful to Steven Zucker for his deep insights and his caring and patient advice in supporting this work. I thank Patrick Huggins, Athinodoros Georghiades, and Ohad Ben-Shahar for helpful discussions. Funding was provided by AFOSR and the Heinz Endowment. Please direct correspondence to
[email protected].
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 604–620, 2002. c Springer-Verlag Berlin Heidelberg 2002
Volterra Filtering of Noisy Images of Curves
605
process model of individual curves, we have developed a prior for images of these curves, which we call the curve indicator random field (cirf), and we have obtained the complete cirf statistics. The main contributions of this paper are three new linear and nonlinear Volterra filters for computing the posterior mean of the cirf for noisy contour images.
a:
Image
Local Responses
Thresholding
b:
Fig. 1. (a) (Left) A guide wire imaged fluoroscopically during surgery. Finding the wire in noise is important [11] because surgery can take hours, all the while exposing the patient to radiation. Reducing exposures increases the noise level, however, and makes local filtering perform poorly (e.g., logical/linear positive-contrast line operator response tuned to thin curve, center, and its thresholding, right). In addition, local operators often employ numerical differentiation, which increase noise. (b) Crossing contours (left) in noise (center left) are distorted by Gaussian blurring (center right and right with blur sizes 5 and 10, respectively), revealing loss of distinction among overlapping contours, as well as reduced contrast. In this paper we exploit a model of random images of contours to derive nonlinear filters for overcoming noise without merging or attenuating contours.
Given the apparently slight distinction between a set of random curves and an image of them (the cirf), it is not surprising that the cirf has been overlooked as an object of study in vision. But there are significant advantages to making the cirf a focus for understanding visual contours. First, it provides an exact notion of an ideal edge or line map (Fig. 1, top), satisfying the two desiderata of being (a) nonzero-valued along the true contours, and (b) zero-valued else-
606
J. August
where. The cirf therefore provides a basis for formalizing saliency [24,13] as the problem of estimating the cirf. Second, unlike dictionary-based relaxation labeling [8] and line processes in Markov random fields [7,14], the cirf does not require an explicit and arbitrary enumeration and weighting of local field configurations to specify its joint statistical structure. Instead, by formally defining the field as a function of the curves depicted by it, the statistical properties of the field become a derivable consequence of those of the contours. Third, being a field, the cirf makes it possible to formalize what is even meant by observing contours under various forms of corruption. Without the cirf, a concept of a noisy contour might mean a smooth curve made rough; with the cirf, a noisy image of a contour is simply the pointwise addition of one field (the cirf) with another (white noise), for example. Fourth, the filters we derive provide a different notion of blur scale than convolution by a Gaussian (Fig. 1, bottom), or linear scale-space [12]. In particular, smoothing will take place along the (putative) contours, and unlike nonhomogeneous [19] and anisotropic diffusion [26], it allows for multiple, nearby and crossing curves. Finally, the cirf provides a local representation of contour intersections; in contrast, parameterized curves require a strictly global computation to determine intersections. As ink builds up where contours cross in an artist’s sketch, so too does the (local) value of the cirf increase at intersections. There are two major sources of inspiration for this project. The first is the set of relaxation labeling [21] approaches to contour computation, which explicitly emphasize the need to include contextual interaction in inference, in contrast to purely local techniques [4] that detect edges largely independently. These approaches formalize contour inference as a labeling problem using a notion of label compatibility derived from good continuation in orientation [28] or curvature as well [18,10]. The second source of inspiration is the work on curves of least energy [9] or elastica [16] and its probabilistic expression as stochastic completion fields [27]. The cirf was introduced to unify these works in a probabilistic framework.
2
A Prior on Images of Curves
Our random field inference framework for contour enhancement begins with a Markov process model for each contour. Mumford [16], Williams and coworkers [27] imagined a particle at Rt (in the space of positions (x, y) and directions θ in R2 × S) whose direction is slightly perturbed at each time instant t before taking its next step forward. The particle’s probability density p(x, y, θ, t) 2 σκ ∂ ∂ ∂2 −1 diffuses according to ∂p is ∂t = Qp, where Q := 2 ∂θ 2 − cos θ ∂x − sin θ ∂y − λ the generator of Markov process Rt (the direction process), σκ bounds the direction perturbations and λ is the mean of the (exponentially-distributed) contour length T . In our framework the Markov process Rt models all image contours, irrespective of how they are observed. Indeed, the particular (stationary) Markov process contour model is unspecified in our framework; more exotic processes, which include scale [25] or curvature κ [3] can be used as well. At this level of
Volterra Filtering of Noisy Images of Curves
607
generality, the Markov process Rt takes on values (states or sites) i in a finite state space I, e.g., i = (x, y, θ) (where space is discretized). The Markov process is summarized by its Green’s operator G = (gij ), for i, j ∈ I, where gij is the average amount of time the process spends in site j given that it started in site i (Fig. 3). A natural image will have an unknown number N (assumed ¯ ) of contours, Rt(1) , . . . , Rt(N ) , which are to be Poisson-distributed with mean N 1 N independent and identically distributed (i.i.d.) with lengths T1 , . . . , TN . So far we have only described the individual contours, but we know of them only through a (spatially distributed) field of measurements M = (Mi ) from an image (e.g., orientation-selective edge operator responses), which are corrupted due to noise and blur. To cope with such ambiguity, we suggest that contour organization be formalized as the estimation of an ideal field U of assertions of local contour existence. Definition 1. The curve indicator random field ( cirf) U = (Ui ) is: Ui :=
N n=1
0
Tn
(n)
{Rtn = i}dtn ,
∀i ∈ I. (n)
Here, the integral over tn is called the local time at i for process Rtn ; the different processes (curves) are combined via superposition. In words, Ui is the (random) amount of time that particles spent in state i, and so non-zero values of Ui indicate that a contour passed through i (Fig. 2). We suggest that the cirf be used a prior on images of curves not only because it captures an intuitive notion of the “perfect” edge map, but also because its can be tractably and fully characterized. In particular, the following is a formula for the cirf cumulants1 — a complete set of joint statistics of this prior—and provides, in effect, an exact description of contextual interactions in an image of curves. Proposition 1. Suppose the Markov process Rt is spatially homogeneous and the endpoint distributions are spatially uniform. Let the contour density be ¯ λ|I|−1 . The k-th (joint) cumulant of the cirf at sites i1 , . . . , ik is η := N cumi1 ,... ,ik := cum{Ui1 , . . . , Uik } = η gj1 j2 · · · gjk−1 jk , where the sum is over all permutations j1 , . . . , jk of i1 , . . . , ik . The proof requires the Feynman-Kac formula, a broadly-applied result in quantum physics (see [2] for details). Since cirf cumulants of order greater than two are (generally) nonzero, the cirf is therefore a non-Gaussian random field [17]. The cumulants are the coefficients of a Taylor series expansion of the cumulant generating functional cgfU (c) := ln mgfU (c) of U , where mgfU (c) := E exp c, U is the moment generating functional of U at c and a, b denotes the inner product between vectors a, b ∈ R|I| . We now turn the main question for this paper: how can the cirf be used in a filter? 1
Cumulants are a function of the moments of a distribution, but can be easier to work with. The lowest-order cumulants are the mean and covariance.
608
J. August
Fig. 2. Modeling images of curves with the cirf prior. The contours in these images (left: angiogram (top) and surface of a Jupiter moon (bottom)) are primarily characterized by their number (N ), smoothness (σκ−1 ), and extent (λ). Given only noisy local measurements m = (mi ), i = (x, y, θ), (center: logical/linear line operator responses [10]), we seek to enhance contours using a prior on images of curves. Several random samples of the cirf prior in R2 ×S (right: various parameter settings) resemble (up to N , σκ , and λ) the qualitative structure of the local responses, but without the corruption. These random samples of the prior are intended only to reflect the “contourness” of the local responses, not the particular contours, which are modeled by the posterior. At each (x, y), the maximum over θ was taken for both local responses and the cirf. The goal is to bias the enhancement method toward contour images (as opposed to pebble or sand texture images, for example).
3 3.1
Estimating Noisy CIRFs The Likelihood
The likelihood represents how our ideal contour image U relates to the measurements M , characterizing the corruptions in local edge and line operator responses. We focus on noise contamination in the measurements because (1) edge/line detection requires differentiation, which amplifies noise and (2) the local signal-to-noise ratio (snr) drops for fixed sensor noise, when the local contrast (i.e., signal) is low. For concreteness and simplicity, our measurement model is M = U + N , where N is white Gaussian noise N with zero-mean and variance 2 σN , and the likelihood is √ 1 −2 ||m − u||2 . p(m|u) = ( 2πσN )−|I| exp − σN 2
(1)
Volterra Filtering of Noisy Images of Curves
609
By a change in variables in m, other observation models with i.i.d. noise can be included, including those estimated from histograms of edge/line operator responses both on and off the actual contours [6,2]. 3.2
The Posterior
Using Bayes’ rule, the posterior distribution P (du|m) is proportional to p(m|u) times P (du), where du is an infinitessimal region around realization u. The cirf prior is written P (du) := P{U ∈ du} because a density for U exists only in a generalized sense2 . Expanding the norm in the likelihood (1), the posterior therefore becomes 1 −1 P (du) := Z exp( m − u, u )P (du) = P (du|m), 2 −2 where the subscript := σN (the inverse noise variance) indicates conditioning on the measurements m, and the normalizing constant is 1 Z := exp( m − u, u )P (du). 2
Although we do not have an explicit expression for the prior P (du) (and so neither for the posterior), in §4 we shall find that we can indirectly use the prior through its cumulants in order to obtain Volterra filters. 3.3
Bayes Estimation
Filtering here means the estimation of the (unknown) cirf U given a noisy realization m. In Bayes decision theory, estimation is formalized by specifying the posterior and a loss function loss(U, u) that penaltizes estimate u when the true unknown is U ; the Bayes estimate is then u := arg minu E[loss(U, u)|m]. Maximum a posteriori (map) estimation corresponds to the 0/1 loss function, which penalizes all errors equally; this loss is zero only when the estimate is perfect (at all sites i), and 1 otherwise. Despite the popularity of the map estimator, we believe an additive loss function such as the squared error ||U − u||2 is more appropriate for high-noise filtering because it penalizes small errors at a few sites much less than large errors at many sites, unlike the 0/1 loss function (map). Therefore we define our filtering task to be the computation of the minimum mean squared error ( mmse) estimate u = arg min E ||U − u||2 , u
(2)
where again denotes the conditioning on m. By a standard argument [5], the mmse estimate u is equal to the posterior mean E U of the cirf U . 2
Specifically, at each site i there is positive mass for ui = 0.
610
J. August
Since the posterior mean is generally difficult to compute, we seek simplifications. A typical approach would be to approximate the cirf prior with a Gaussian, giving rise to mmse linear filtering [2]. We avoid this method, however, because it ignores all the cirf cumulants beyond the mean and covariance. Instead, we focus on an aspect of our problem that we have not yet considered: extremely noisy conditions. In this way we can simplify our filtering problem tremendously, while exploiting the higher-order statistical structure of the cirf.
4
Volterra Filtering at Large Noise Levels
Volterra filters constitute one class of nonlinear filters that subsumes linear filters [23]. By definition, the output of a k-th order Volterra filter is a k-th degree polynomial3 function of the input signal. Analogous to the connection between mmse linear filters and second-order statistics, mmse Volterra filters are related to higher-order statistics. In particular, a mmse k-th order Volterra filter generally requires up to the (2k)-th order moments of the input and unknown [1, 20]. Unfortunately, the computational complexity of simply applying the general k-th order Volterra is O(nk+1 ), where n = |I|; solving for the mmse filter coefficients is much more costly in general. Here we derive linear, quadratic and cubic Volterra filters for approximating the posterior mean of the cirf corrupted by large amounts of white Gaussian noise. Our approach is based on the following observation: in such noisy conditions, the inverse noise variance will be small. This suggests that we consider the infinite noise limit where approaches zero; in particular, we perform a Taylor series expansion of the posterior mean E U around = 0. We then exploit simplifying aspects of the cirf, especially the self-avoiding nature of the direction process, to obtain Volterra filters that run in in O(n log n) time. Now we proceed with the derivation. Observe that the posterior mean can be computed by differentiating the logarithm of the normalizing constant, or
∂ exp( m − 12 u, u ) ∂ ln Z = Z−1 P (du) = ui P (du) = E Ui . (3) ∂(mi ) ∂(mi ) Therefore to get a Taylor series for the posterior mean, we need only get a Taylor series for ln Z and then differentiate with respect to mi . Where do we get a Taylor series for ln Z ? Observe that the normalizing constant Z is just
the moment generating functional of the random vector W = m − 12 U, U in disguise, or, equivalently, W has the cumulant generating functional cgfW () such that ln Z = ln exp(w)P (du) = ln mgfW () = cgfW (). 3
Note the distiction between the (nonlinear) polynomial filters here and those linear filters whose kernel is based on some (e.g., Hermite) polynomial.
Volterra Filtering of Noisy Images of Curves
611
But the coefficients in a Taylor series expansion of a cumulant generating functional are by definition the cumulants, and therefore we need to obtain the cumulants of W , a (quadratic) polynomial in U . We now only sketch the rest of our derivation, as the full version takes 20 pages [2]. Since the cumulants of W are not directly known, we computed them in terms of McCullagh’s so-called generalized cumulants (an algebraic structure that eases computations of polynomials of moments and cumulants) of U [15]. We then differentiated with respect to mi to get the first few terms in the Taylor expansion of E Ui as a function of the generalized cumulants of U . Besides spatial homogeneity, two properties of the cirf played a significant role in simplifying our calculations. First, the contour density η is small (0 ≤ η 1) in a large number of images, as contours constitute a small fraction of the total image (which is why pen ink does not run out quickly). In the continuum, an extreme form of this claim certainly holds: the bounding contours in an image of smooth surfaces in an occlusion relationship constitute a set of measure zero (with respect to the standard area measure in the plane). This low density is even more appropriate for the cirf based on the direction process, where the curves live in a three-dimensional space. The low density of the cirf was crucial for simplifying McCullagh’s formula for expressing generalized cumulants in terms of a sum of products of (ordinary) cumulants: we neglect all terms which are a product of two or more cirf cumulants (having magnitude O(η 2 ) by Prop. 1). Thus the low density of the cirf allowed us to express the posterior mean as a weighted O(η)-sum of cumulants of U , each of the form i,j,k,l,... cumr,i,j,k,... mi mj · · · δkl · · ·. The second cirf property that simplified our derivation is the approximate self-avoidingness of the direction process. If our Markov processes were perfectly straight, they would never return to their starting points, even if infinitely long. We observe in Fig. 3 that the direction process rarely returns to its starting position. Many permutations in cumulants of the cirf (Prop. 1) could therefore be neglected. To support the error-prone task of explicitly summing over the permutations in the cirf cumulants in the O(η)-sum for the posterior mean, we constructed a diagram corresponding to each permutation (Fig. 4). Nodes indicate sites, arrows indicate ordering in the permutation, and multiplicative constants such as η are temporarily suppressed. In Fig. 4a, permutation (r, i) of term i cumr,i mi is i gri mi = (Gm)r , or “G acts on input m, and the result is evaluated at readout r” (G is understood for arrows away from r, G∗ is understood for arrows toward r, and “X” represents an input m at a node). Using the diagram we see that the reverse permutation (i, r) gives rise to G∗ m (evaluated at r). In Fig. 4b, permutation (j, i, r) of term i,j cumr,i,j mi mj is G∗ diag(m)G∗ m, where diag(m) is a diagonal matrix with m along the diagonal; the reverse permutation (r, i, j) is G diag(m)Gm using the reversed diagram. In Fig. 4c, permutation (r, i, j, k) of term i,j,k cumr,i,j,k mi δjk has a loop, which represents contour intersection (at node j = k) and arises from the Kronecker delta δjk , which in turn comes from the quadratic term in W . Terms that have larger loops can require taking
612
J. August Average Number of Returns for Markov Chains
Returns
100
Line, x Plane, (x,y) Plane with Orientation, (x,y,theta) Space, (x,y,z)
10
1
0
500
1000 1500 Average Length
2000
2500
Fig. 3. (Left) The Green’s operator of the direction process (see §2); brightness indicates average time spent at a site, given that process started near lower left, traveling diagonally to the upper right (parameters: σκ = 1/24 with 44 directions; λ = 100 throughout this paper). (Right) The self-avoidingness of the direction process: a comparison of average number of returns to starting location for various Markov processes (chains), plotted as a function of total contour length. The average number of returns is at least one because the first visit is included. Unlike random walks on the line, in the plane, or even in space, the average number of returns of the direction process (“plane with orientation”) does not increase with contour length; the direction process is therefore (approximately) self-avoiding. Spatial stepsize (∆x, etc.) is one, except ∆θ = 2π/(number of directions); also, σκ ≈ ∆θ throughout this paper unless specified explicitly.
componentwise products of operators such as G G∗ , and are neglected due to the self-avoidingness of the direction process. See [2] for details. These diagrams are loosely analogous to Feynman diagrams, which support perturbation calculations in statistical physics and quantum field theory. Independently of this work, Ruderman [22, pp. 29–33] used Feynman diagrams to calculate the posterior mean of an underlying signal observed in Gaussian white noise, where the perturbation is in terms of the non-Gaussian underlying signal. In contrast, the perturbation used here is due to the Gaussian noise; the underlying non-Gaussian signal (the cirf) is modeled exactly.
G
a: r
G*
b: i
r
G* i
c: j
r
i
{j,k}
Fig. 4. Diagrams dramatically simplified the cumulant permutation computations in the lengthy derivation of the Volterra cirf filters. See text.
Volterra Filtering of Noisy Images of Curves
613
Using this diagram-based technique, we obtained linear, quadratic, and cubic Volterra filters which asymptotically approximate the posterior mean of the cirf. Let a b denote the vector which is the componentwise product of vectors a and b, i.e., (a b)i = ai bi . Result 1 (High-noise MMSE Volterra CIRF Filters) Suppose that the curve indicator random field U (for approximately self-avoiding Markov pro2 cesses) is corrupted with additive Gaussian white noise of variance σN =: −1 to produce measurement vector m. Let ζ := gii λ, where λ is the average curve length. Then the minimum mean squared error estimate of U given m has the following approximate asymptotic expansion as σN → ∞ ( → 0): Linear Filter: u ˜(1) = η{1 − 2ζ + (Gm + G∗ m)} + O(η2 + η 2 )
(4)
Quadratic Filter: u ˜(2) = η{1 − 2ζ + 32 ζ 2 + (1 − 2ζ)(Gm + G∗ m) +2 (G diag(m) Gm + Gm G∗ m + G∗ diag(m) G∗ m)} +O(η3 + η 2 )
(5)
Cubic Filter: u ˜(3) = η{1 − 2ζ + 32 ζ 2 − 43 ζ 3 +(1 − 2ζ + 32 ζ 2 )(Gm + G∗ m) +2 (1 − 2ζ)(G diag(m) Gm + Gm G∗ m + G∗ diag(m) G∗ m) +3 (G diag(m) G diag(m) Gm + G diag(m) Gm G∗ m +Gm G∗ diag(m) G∗ m + G∗ diag(m) G∗ diag(m) G∗ m)} +O(η4 + η 2 ). (6) These filters compute quickly: by implementing the operator G in the Fourier domain [2], its computational complexity is O(n log n) for the direction process, where n = |I|. Since multiplying a diagonal matrix with a vector has O(n) complexity, the net complexity for all of these filters is therefore O(n log n) for the direction process. This is far better than for the general k-th order Volterra filters, which have O(nk+1 ) complexity.
5
Results
Here we apply our high-noise mmse Volterra cirf filters to some noisy synthetic and real images. To test robustness, we evaluated responses to image of a hori2 . For the sythetic images in zontal straight line under varying noise variance σN this paper, we mapped from the planar image to a discrete R2 × S by copying the image to each direction, so that our measurements m = m(x, y, θ) were constant in θ. Noise was quantified using the peak signal-to-noise ratio (snr), or
614
J. August
2 10 log10 (Imax − Imin )2 /σN (in decibels), for an image I having minumum value Imin and maximum Imax . The 0 dB result is shown in Fig. 5 (left), showing significant noise cleaning (at each (x, y), maximum response over θ is shown). The error between noisy and noisy-free responses (Fig. 5, right) shows improvement for increasing order of nonlinearity.
Volterra cirf Filter Performance 45
linear quadratic cubic
40
Angle [degrees]
35 30 25 20 15 10 5 0 0.0
1.9
4.4 8.0 Peak SNR [dB]
14.0
∞
Fig. 5. Noisy image of a line (top left) is cleaned via linear, quadratic, and cubic cirf filters (second from top to bottom, respectively; parameters: = 10, ζ = 0, σκ = 1/10, with 64 directions). (Right) Noise performance of the high-noise mmse Volterra filters in R2 × S. Plotted is the angle between the mean-shifted noisy and noise-free filter responses for each filter type and at each noise level. Angle is measured between two vectors in R|I| . Notice the advantage of the cubic filter over the others.
Next we tested the filter’s response on both open and closed curves (Fig. 6). Observe that the response is well resolved at the contour crossings; even though we have no initial estimate of local direction. Fig. 7 is a more strenuous test at contour crossings. Unlike isotropic, nonhomogeneous, or even anisotropic diffusion, the Volterra filter performs smoothing along even overlapping contours, and therefore the preservation of junctions is possible. In addition, quite powerful orientation-selectivity is apparent (fixed θ slices of the response are shown in bottom rows). We studied the effect of parameter variations in Fig. 8. For most examples, we let ζ ≥ 0 be small (usually 0) and let > 0 be about 10, although the exact settings did not appear to have a great effect. To apply the Volterra filters to real images, we set the filter input m to be the (directional) output of a logical/linear operator (with parameter degree
Volterra Filtering of Noisy Images of Curves
Noisy Image of Contours
Linear cirf Filter
615
Cubic cirf Filter
Fig. 6. Filtering a noisy image of a circle with a diagonal line (top left, peak snr=4.2 dB) with linear and cubic cirf filters (responses at top center and right, resp.). The cubic response was thresholded and the resulting isosurface in R2 × S is shown from different viewpoints (bottom, with θ ∈ [0, π)). The filter is orientation selective and separates the contours in R2 × S. Parameters were = 10.5, ζ = 0, σκ = 1/5, with 32 directions.
d = 0.1 in Iverson’s implementation4 , and other values at their default), instead of copying over direction. For the ship’s wake in Fig. 9 (top left), we zoom in the highlighted portion (top right) and then obtain logical/linear negativecontrast line responses and their thresholding (second row). Responses to highnoise Volterra filters ( = 10.5, ζ = 0, σκ = 1/15, λ = 100, and 96 directions) is shown in the bottom three rows of Fig. 9. Observe how the many bogus responses in the thresholded local responses are not present in the quadratic and cubic cirf filter responses. A prostate boundary is enhanced with the cubic cirf filter in Fig. 10. Returning to our opening example of a noisy guide wire image (Fig. 1), we see that the cirf Volterra filters reduce noise while enhancing the guide wire (Fig. 11).
6
Conclusion
In this paper we took a Bayesian estimation approach to contour enhancement at large noise levels, where we used an exact prior for images of curves, the cirf. 4
Code and settings available at http://www.ai.sri.com/˜leei/loglin.html.
616
J. August
Image Without Noise
Noisy Image
Cubic cirf Filtered Noisy Image
Fig. 7. Crossing lines are teased apart with the cubic cirf filters. From left to right in top row: original image, image corrupted with additive white Gaussian noise (peak snr=8 dB), and cubic cirf filter response (parameters were = 10.5, ζ = 0, σκ = 1/10, with 64 directions). Observe the noise reduction without merging the nearby junctions, unlike Gaussian blurring (Fig. 1). The direction-specific (i.e., without taking the max over direction) cubic responses (middle row, left to right: θ = 0◦ , . . . , 135◦ ) and the corresponding thresholded responses (bottom row) reveal that the initially strong responses at the inappropriate directions are chiseled away by the action of the filter.
a
b
c
d
e
f
g
h
i
Fig. 8. The effect of parameter varations. Since η amounts to a simple threshold, we only varied and ζ in cubic cirf filter (6) applied to Fig. 6 (top left), as follows: ζ = 0 (a,b,c), 1 (d,e,f), 50 (g,h,i); = 0.5 (a,d,g), 10.5 (b,e,h), 50 (c,f,i). At large ζ, cubic responses (g,h,i) resembled linear response (Fig. 6, center); otherwise, the cubic cirf filter was insensitive to parameter changes.
Exploiting both the high-noise limit and the approximate self-avoidingness of the direction process model of random, smooth curves, we obtained mmse Volterra filters for contour enhancement that run in O(n log n) time, where n is the size of the input. These filters perform well even for crossing or nearby curves, thus this might be interpreted as an extension of anisotropic diffusion ideas where multiple local directions are allowed.
Volterra Filtering of Noisy Images of Curves
Image
Original
Zoom
Response
Thresholding
Local Responses
Linear
Quadratic
Cubic
Filter
Fig. 9. Finding a ship’s wake (see text for explanation).
617
618
J. August
Fig. 10. To find the boundary of a prostate imaged with ultrasound (left), we obtained blurred, directed local edge measurements (center, region around transducer and upper corners were cropped out), and then applied the cubic cirf filter (right).
Volterra cirf Filter Responses
Thresholded Responses
Linear
Quadratic
Cubic
Fig. 11. Finding a guide wire in the image of Fig. 1. Responses to high-noise Volterra cirf filters ( = 10.5, ζ = 0, λ = 100, σκ = 1/10, with 64 directions). Observe how the cubic filter at σκ = 1/10 enhances the guide wire, by increasing, not reducing, contrast (cf. Fig. 1). To produce these two dimensional images from the actual discrete (x, y, θ)-space responses, the maximum over θ was taken.
Volterra Filtering of Noisy Images of Curves
619
References 1. P. Alper. A consideration of discrete volterra series. IEEE Transactions on Automatic Control, 10:322–327, 1965. 2. J. August. The Curve Indicator Random Field. PhD thesis, Yale University, 2001. 3. J. August and S. W. Zucker. A markov process using curvature for filtering curve images. In Energy Minimization Methods for Computer Vision and Pattern Recognition, 2001. 4. J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679–698, 1986. 5. G. Casella and R. L. Berger. Statistical Inference. Duxbury, Belmont, CA, 1990. 6. D. Geman and B. Jedynak. An active testing model for tracking roads in satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1):1– 14, 1996. 7. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. 8. E. R. Hancock and J. Kittler. Edge-labeling using dictionary-based relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2):165–181, 1990. 9. B. K. P. Horn. The Curve of Least Energy. ACM Transactions on Mathematical Software, 9:441–460, 1983. 10. L. A. Iverson. Toward Discrete Geometric Models for Early Vision. PhD thesis, McGill University, Montreal, 1994. 11. S. N. Kalitzin, B. M. ter Haar Romeny, and M. A. Viergever. Invertible orientation bundles on 2d scalar images. In Proc. Scale-Space ’97, LICS, pages 77–88. Springer, 1997. 12. J. J. Koenderink. The structure of images. Biological Cybernetics, 50:363–370, 1984. 13. M. Lindenbaum and A. Berengolts. A probabilistic interpretation of the saliency network. In European Conf. on Computer Vision, 2000. 14. J. L. Marroquin. A markovian random field of piecewise straight lines. Biological Cybernetics, 61:457–465, 1989. 15. P. McCullagh. Tensor Methods in Statistics. Monographs in Statistics and Applied Probability. Chapman and Hall, Cambridge, England, 1987. 16. D. Mumford. Algebraic Geometry and Its Applications, chapter Elastica and Computer Vision, pages 491–506. Springer-Verlag, 1994. 17. C. L. Nikias and A. P. Petropulu. Higher-Order Spectra Analysis: A Nonlinear Signal Processing Framework. Prentice Hall, Englewood Cliffs, NJ, 1993. 18. P. Parent and S. W. Zucker. Trace inference, curvature consistency, and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):823–839, August 1989. 19. P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):629–39, jul 1990. 20. B. Picinbono and Y. Gu. Mean square estimation and projections. Signal Processing, 19:1–8, 1990. 21. A. Rosenfeld, R. A. Hummel, and S. W. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man, and Cybernetics, 6(6):420–433, 1976.
620
J. August
22. D. L. Ruderman. Natural Ensembles and Sensory Signal Processing. PhD thesis, U. C. Berkelely, 1993. 23. M. Schetzen. The Volterra and Wiener Theories of Nonlinear Systems. Wiley, New York, 1980. 24. A. Sha’ashua and S. Ullman. Structural saliency: The detection of globally salient structures using a locally connected network. In Proceedings, Second International Conference on Computer Vision, 1988. 25. K. Thornber and L. Williams. Analytic solution of stochastic completion fields. Biological Cybernetics, 75:141–151, 1996. 26. J. Weickert. Anisotropic diffusion in image processing. ECMI. Teubner, Stuttgart, 1998. 27. L. Williams and D. Jacobs. Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Computation, 9(4):837–858, 1997. 28. S. W. Zucker, R. Hummel, and A. Rosenfeld. An application of relaxation labelling to line and curve enhancement. IEEE Trans. Computers, C-26:393–403, 922–929, 1977.
Image Segmentation by Flexible Models Based on Robust Regularized Networks Mariano Rivera1,2 and James Gee1 1 Department of Radiology, University of Pennsylvania 3600 Market Street, Suite 370, Philadelphia, PA, USA 19104 {mrivera,gee}@grasp.cis.upenn.edu 2 Centro de Investigacion en Matematicas A.C. Apdo. Postal 402, Guanajuato, Gto., Mexico 36020
[email protected] http://www.cimat.mx/˜mrivera
Abstract. The object of this paper is to present a formulation for the segmentation and restoration problem using flexible models with a robust regularized network (RRN). A two-steps iterative algorithm is presented. In the first step an approximation of the classification is computed by using a local minimization algorithm, and in the second step the parameters of the RRN are updated. The use of robust potentials is motivated by (a) classification errors that can result from the use of local minimizer algorithms in the implementation, and (b) the need to adapt the RN using local image gradient information to improve fidelity of the model to the data. Keywords. Segmentation, Restoration, Edge-preserving Regularization.
1
Introduction
Image segmentation from different cues (such as gray level, local orientation or frequency, texture, and motion) is typically formulated as a clustering problem, where the consideration of spatial interactions among pixel labels provides additional, useful constraints on the problem. Thus, in order to classify a pixel, one takes into account the observed value (which is generally noisy) and the classification of neighboring pixels. Specifically, let g be the image that is to be segmented into K classes according to its gray level, then—in the context of Markov random fields [1]— the classification c and the means of the gray levels µ = {µ1 , µ2 , ..., µK } can be computed by minimizing a cost functional of the general form: V1 (µcr − gr ) + λ V2 (cr , cs ) , (1) U (c, µ) = r∈L
s∈Nr
where r indicates the position of a pixel in the regular lattice L, cr ∈ C (with C = {1, 2, ..., K}) is the classification for pixel r, µq is the centroid of class q, the neighborhood Nr = {s : |s − r| = 1}. Functional (1) has two terms, their A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 621–634, 2002. c Springer-Verlag Berlin Heidelberg 2002
622
M. Rivera and J. Gee
contribution to the total cost controlled by the positive parameter λ. Within the context of Bayesian regularization, the first term corresponds to the negative of the log-likelihood and promotes fidelity of the segmentation to the observed 2 data. Assuming Gaussian noise, for instance, V1 (x) = |x| . The second term, regularization term, encodes our a priori knowledge that the images contain spatially extended regions. To achieve this, the potential of the regularization term is constructed to penalize differences in the classification of pixel r with respect to the pixels within its neighborhood Nr , one of the most commonly used being the Ising potential [3]: V2 (x, y) = 1 − 2δ(x − y),
(2)
where δ(·) ∈ {0, 1} is the Kronecker’s delta function. The trouble with the Ising potential (2), however is that it leads to a combinatorial optimization problem. Stochastic algorithms that can minimize (1), such as simulated annealing [2][3], are extremely time consuming. In contrast, fast deterministic algorithms such as the Iterated Conditional Modes (ICM) [4], are easily trapped by “bad” local minima. Consequently, investigators have proposed fast segmentation schemes that produce “good results with reasonable computational cost” based on approximation of the cost functional (1) or variants of deterministic minimization algorithms; for example, the techniques as the Mean Field theory [5][6],[7] Multigrid-ICM [8], Relaxation Labeling [9][10], Annealing-ICM [11], or more recently Gauss-Markov Measure Fields [12]. In the early work of Darrell and Pentland [13], simple multi–layered models (based on polynomials) are used to represent and segment smooth piecewise gray scale images or optical flow fields for regions containing multiple motions. More recently, other authors have reported different approaches toward or applications of layered models. In the work of Black et al. [14], models were proposed for computing the optical flow from a pair of images under changes of form, illumination, iconic content (those produced by pictorial changes in the scene) and specular reflections. Specifically, affine transformations or linear combinations of basis displacement fields model the motion in the regions. The model parameters and the membership weights (that define the support region for each model) are computed by using the Expectation–Maximization (EM) algorithm. Recently, Marroquin et. al. [15] introduce the MPM–MAP algorithm, an EM like algorithm. In which the classification is computed by approximating the MPM estimator using Gauss–Markov measure field models [12] and applying parametric models based on a finite element basis functions. The resultant models have greater flexibility for capturing the inherent geometry of the images. Samson et al. [16] have reported a variational approach to classification (segmentation) and restoration, in which robust multimodal potentials are used to perform piecewise smooth restoration of images. The segmentation is automatically obtained by detecting the valleys (local minima) within which restored pixels lays. The minimization of the corresponding energy functional is achieved by means of an efficient deterministic algorithm. However, it is not clear, how spatially varying classes can be introduced, which may be required for represent, for example, illumination gradients.
Image Segmentation by Flexible Models
623
This work builds on the fore mentioned methods and recasts them in a general variational framework, We develops a formulation for the segmentation and restoration problem using flexible models within a robust regularized network (RRN). Without loss of generality, the technique (based on an iterative algorithm) is presented in the context of image segmentation based on gray level intensity. In the first step of the iterative algorithm, an approximation of the classification is computed by using a local minimization algorithm. In the second step, the parameters of the RRN are updated. The use of robust potentials is necessary for two reasons, which will be developed in full in the remainder of the text: 1. To account for errors in the classification step that result from the use of a local minimizer. 2. To improve the fidelity of the segmentation to the data, the flexibility of the RN must to be adapted according to the local image gradient. The plan of the paper is as follows. In the next section, we present a general framework for adaptive flexible models. The formulation of our models is based on Regularized Networks (RN) [17] with a robust potential for the data term and a quasi–robust potential for the regularization term [18]. Then, in subsection 2.2, the complete cost functional is described. Next, a general iterative minimization algorithm is proposed. In subsection 2.4, details of the implementation are discussed. Numerical experiments that demonstrate the performance of the method are presented in section 3. Finally, our conclusions are given in section 4.
2
Segmentation and Restoration with Flexible Models
In order to motivate our cost functional, first let us assume that the classification is given and it contains some errors as a product of noisy data or inexact models. The formulation of the energy functional for a specific class must therefore take into account that the data may be corrupted by outliers. Then, based on the introduced robust models, the complete functional for classification and restoration is presented. For minimizing the cost functional (which by construction is bounded by zero), we propose a deterministic iterative algorithm which alternates between two steps: updates of the classification (by means of a local minimization); and updates of the restoration (by means of a half-quadratic minimization scheme [20][21][22][23]). 2.1
Flexible Models Based on Robust Regularized Networks (RRN)
Without loss of generality, the algorithm is presented in the context of image segmentation based on gray intensity level. First, we note that an image can be viewed as an ensemble of regions each with smooth intensity gradient (i.e., we are assuming a piecewise smooth image model), so the parametric model that we use for representing the data in each region must be consistent with this
624
M. Rivera and J. Gee
assumption. For representing the restored data φ that belongs to the class cˆ, we choose a model based on a regularized network [17]. Given a class c = cˆ, the associated restored data φ is computed as the minimizer of the cost functional: R(θ) =
ρ1 (φr (θcˆ, cˆ) − gr ; k1 ) + γP (φ (θcˆ, cˆ)) ,
(3)
r∈L:cr =ˆ c
where P is a regularization term over the model whose contribution to the total cost is controlled by the parameter γ, and φr (θcˆ, cˆ) is the computed value for pixel r (that belongs to class cˆ) with corresponding parameter θcˆ: φr (θcˆ, cˆ) =
θcˆ,j Φj (r) ,
(4)
j
where Φj (r) represents the j th interpolation function evaluated at the pixel r whose contribution is controlled by the parameter θcˆ,j . ρ1 is a robust potential function for outlier rejection and k1 is the associated positive scale parameter [20][22][23][24][18][25]. In section 2.4 (Implementations Details), we specify the robust potential ρ1 and ρ2 and the form of the interpolation functions Φ. Examples of models with the form (4) include Radial Basis Functions (RBF) or Additive Splines [17][26][27]. We use a robust potential in the data term because the classification field cr may be corrupted by outliers as a product of noisy data or inexact models. The classical (L2 ) regularized network of RBF corresponds to choosing ρ1 (x) = x2 . For choosing the form of the regularization term P , we take into account that: 1. The data term in (3) is only evaluated at the pixels that belong to class cˆ. Therefore, the regularization term P allows us to interpolate over missing data (those sites that do not belong to cˆ). 2. Regions with an inhomogeneous smoothness. Therefore, we propose the following regularization term P in (3): ρ2 (φr (θcr , cr ) − φs (θcs , cs ) ; k2 ) P (φ (θ, cˆ)) = r∈L:cr =ˆ c s∈Nr :cs =ˆ c
+µ
2
(φr (θcˆ, cˆ) − φs (θcˆ, cˆ)) ,
(5)
r∈L s∈Nr
where the first term in (5) uses a robust potential function ρ2 (with scale parameter k2 ) over the discrete version of the gradient of φ at those sites that belong to cˆ. This allows the model to adapt its stiffness according to the data. The second term implements a quadratic potential over the gradient of φ on the whole domain. The quadratic potential has two effects: performs a smooth interpolation (or extrapolation), and avoids an over–relaxation of the stiffness of the adaptive model. The trade–off between the two terms is controlled by the positive parameter µ.
Image Segmentation by Flexible Models
2.2
625
Cost Functional for Restoration and Segmentation Based on Robust Regularized Network (RNN) Models
Given the flexible models, (3), (4) and (5), we can specify the complete cost functional for restoration and segmentation: 2 H(cr , θcr ; Cr , Θr ) + UR (c, θ) = λ3 (φr (θq , q) − φs (θq , q)) , (6) r∈L
s∈Nr q∈C
where H(cr , θcr ) is a local energy functional: H(cr , θcr ; Cr , Θr ) = ρ1 (φr (θcr , cr ) − gr ; k1 ) [λ1 (1 − δ(cr − cs )) + s∈Nr
+λ2 ρ2 (φr (θcr , cr ) − φs (θcr , cr ) ; k2 ) δ(cr − cs )], (7) Cr = {cs : s ∈ Nr } and Θr = {θcs : s ∈ Nr } and λ1 , λ2 and λ3 redefine the regularization parameters λ, γ and µ. Functional (6) was derived by assuming that the classification was given. Note that in order to compute the minimization with respect to the classification c (keeping θ fixed), we need to take into account an additional regularization term [weighted by λ2 in (7)] that does not appear in the original functional (1). We can understand the effect of this new term as follows: the first term [in (7)] penalizes unconditionally the granularity of the classification (an over– segmentation), while the second term [ in (7)] promotes the change of data model only if the gradient of the model φ is large enough (via the robust potential ρ2 ). 2.3
Minimization Algorithm
The computation of the global minimum of (6) (the segmentation c∗ and the parameters of the restoration defined by the set of models θ∗ ) corresponds to {c∗ , θ∗ } = arg min UR (c, θ). c,θ
(8)
An effective approximation is to alternatively minimize (6) with respect to c and θ by iterating between following two steps until convergence (given an initial guess for the model parameters θrt and setting t = 0): 1. Update the segmentation c, keeping fixed the restoration θt : ct+1 = arg min UR (c, θt ). c∈C
2. Update the restoration θ, keeping fixed the segmentation ct+1 : θt+1 = arg min UR (ct+1 , θ), θ
and set t = t + 1.
626
M. Rivera and J. Gee
The combinatorial optimization implied in step 1 can be performed by stochastic algorithms (SA) such as simulated annealing or Gibbs sampling . However, it is well known that stochastic methods are extremely time consuming, and for this reason, deterministic approaches have been preferred (e.g., Iterated Conditional Modes, Mean Field Annealing, Label Relaxation, and others [1]) in spite of the fact that only the computation of a local minimum is guaranteed. The use of heuristics, such as multigrid algorithms, graduated non-convexity or the provision of a good initial guess, reduce the risk of being trapped by “bad” local minimum. Thus, good quality results can be computed at a fraction of the time than with SA algorithms. With this in mind, we present a deterministic algorithm for minimizing (6). First, we note that for a given classification c, (6) is half–quadratic with respect to the restoration parameters θ [20][21][22][23]. Thus, in order to perform this minimization, we can use efficient algorithms (as reported in [18][23][28]) that guarantee convergence to a local minimum. Our iterative algorithm consists of the followings two steps: 1. Update the segmentation c (keeping fixed θ) by performing a local optimization using ICM: ct+1 = arg min H(cr , θrt ); r cr ∈C
for all r ∈ L,
where H is given in (7). 2. Update the parameters of the models θ (keeping c fixed) by using a Half– Quadratic minimization: θt+1 = arg min UR (ct+1 , θ), θ
and set t = t + 1. In practice, step 1 is performed more that once (in our experiments we performed three iterations), which guarantee that UR (ct+1 , θt ) ≤ UR (ct , θt ). On the other hand, the minimization indicated in step 2 is not completely achieved, in order to guarantee convergence it is only necessary that UR (ct+1 , θt+1 ) ≤ UR (ct+1 , θt ), thus, we performed only three iteration the weighted Gauss-Seidel algorithm [18][23]. 2.4
Implementation Details
The use of the ICM algorithm for computing the segmentation step results in a significant computational time reduction. However, it also increases the number
Image Segmentation by Flexible Models
627
of errors in the classification when compared to the use of a stochastic algorithm or other deterministic approaches, such as mean field theory based algorithms. Therefore, we introduce a potential term for outlier rejection needs to be used in the data term. On the other hand, we chose a quasi–robust potential for the regularization term [18] since we wanted the model to adapt its stiffness to smooth changes in the data, while at the same time disallowing internal discontinuities within regions—strong discontinuities must result in a change of the model. The quasi–robust potentials ρ(t) correspond to (in general) non-convex potentials that grow at a slower rate than quadratic ones, and have the following characteristics [18]: 1. ρ(t) ≥ 0 ∀t with ρ(0) = 0. 2. ρ (t) ≡ ∂ρ (t (f )) /∂f exists. 3. ρ(t) = ρ(−t), the potential is symmetric. 4.
ρ (t) 2t
exists.
5. limt→+∞
ρ (t) 2t
= m and limt→0+
ρ (t) 2t
= M , 0 ≤ m < M < +∞.
(9)
Fig. 1. (a) Potential function corresponding to the quasi–robust regularization potential (1 − µ)t2 + µρ2 (t2 ; k2 ), and its (b) corresponding weight function.
Condition (9.1) establishes that a residual equal to zero produces the minimum cost. Condition (9.2) constrains ρ to be differentiable, so that one can use efficient deterministic algorithms for minimizing the cost function. Condition (9.3) constrains ρ to penalize equally positive and negative values of t. Finally, conditions (9.4)–(9.6) impose the outlier rejection condition (particularly, (9.5) corresponds to the quasi–robust condition [18], for µ = 0 the potential is robust).
628
M. Rivera and J. Gee
Specifically, our method implements the Geman–McClure robust potential [19]: (x/k)2 ρ (x; k) = 1 + (x/k)2 for both the data and regularization potential (ρ1 and ρ2 , respectively). We recall that the combination of a hard–redescending robust potential (such as ρ) and a quadratic one results in a quasi–robust potential (see figure 1). We assume a smooth intensity variation for each class, so various parametric models are possible, as interpolation polynomials (for example, Hermite and Lagrange), radial basis functions (RBF), splines or as in [27][17], the function |x|. In this work, because of their computational efficiency, we use the finite support functions: ψ (x) ψ (y) for |x| , |y| ≤ 1 Φ(x, y) = 0 otherwise, where two choices for ψ are employed: 1. First order Lagrange linear interpolation polynomials (that correspond to bilinear interpolation): ψ1 (t) = 1 − |t|, see figure 2(a). 2. The two–times differentiable Gaussian like function: ψ2 (t) = 0.5 + 0.5 cos (πt), see figure 2(b).
Fig. 2. Interpolation functions, P hi, corresponding to(a) ψ1 (t) = 1 − |t| and (b) ψ2 (t) = 0.5 + 0.5 cos (πt) ,.
3
Related Work
Segmentation algorithms are, in general, implemented, as iterative, two–step procedures of the following form:
Image Segmentation by Flexible Models
629
1. Compute a classification keeping fixed the parameters of the models. 2. Compute the parameters of the models keeping fixed the classification. The specific implementation of each step depends on the algorithm. It is evident that a more accurate classification is obtained when the models on which the classification is based are themselves accurately estimated. In this work we compute the model parameters using robust methods and adapt the values according to intensity variations in the image. Ref. [15] presents an MPM–MAP segmentation method in which the classification step is computed by maximizing the posterior marginals using a Gauss– Markov random field (GMMF) [12]. A probability measure field representing the class membership of each pixel for every class is computed. The label field is then obtained by choosing the maximally probable class at each pixel. The advantage of the formulation is that the probability measure field can be computed using deterministic means, but at the cost of storing in memory the measure field for each class. The implementation in [15] uses a coarse membrane based on the finite basis functions (FE) to represents each class. Thus the parameters of the models are the nodal values of the FE model. The deformation energy of the membrane is used as the regularization term. The nodes of the FE–mesh are distributed in a regular lattice. Since linear interpolation functions are used, the model corresponds to the use of linear splines or ψ1 in our RN formulation. Specifically, the classification stage in the MPM–MAP was implemented as follows: cr = arg max pTr (q), q∈C
(10)
where pT is the result at the T th iteration of the Gauss–Seidel homogeneous diffusion over the likelihood p0 (see [12]): pt+1 r (q) =
1 t ps (q) for t = 0, 1, ..., T − 1; |Nr |
(11)
s∈Nr
where—by assuming Gaussian noise,—p0 is the likelihood of the data g given the models φ: 2 (φr (θq , q) − gr ) 1 0 pr (q) = √ , (12) exp − σ2 2πσ where the variance, σ, is a parameter (that in general depends on the class, i.e., σ = σq ). Then, in the MAP step the models φr (θr , cr ) are updated by minimizing (3) with the regularization term: 2 (φr (θq , q) − φs (θq , q)) . (13) P (φ (θ, c)) = q∈C r∈L s∈Nr
The parameters of the algorithm are the variance of the noise σ 2 , the number of iterations T, and the smoothness of the FE-membrane γ.
630
M. Rivera and J. Gee
In the next section, we present numerical experiments demonstrating that our formulation, based on a RRN, produces better results that the MPM–MAP method, even when the classification step is computed with a simple local optimization method (ICM algorithm).
4
Experiments
In this section, we show numerical experiments that demonstrate the performance of the presented method on the segmentation and restoration of real data. The first set of experiments present a comparison of the proposed algorithm and the MPM–MAP method with flexible models. Specifically, the flexible model is identical for the two algorithms (but with a quasi-robust potential in the RN in our case). The task is to segment a noisy image containing bright and dark objects. We assume that the objects have a relatively large size and smooth variations in their gray level. Figure 3 shows the test data: (a) the original image (128 × 128 pixels and normalized to the interval [0, 1]), (b) the hand segmented mask (for comparison purposes); (c) the noisy test image generated by adding gaussian noise with σ equal to the 10% of the original dynamic range. Panel 3-(d) shows the “ideal” segmented data. Figure 4 shows the computed results, the top row shows the results of the MPM–MAP algorithm and the bottom row corresponds to the results of our method. The interpolation nodes of the RN were distributed in a regular 17×17 grid. For the case of the MPM–MAP the likelihood was computed using the true value of the variance of the noise (σ = 0.1) and the regularization parameter was set γ = 15. The smoothing for each measure field was performed by computing 10 iterations (T ) of homogeneous diffusion. For the proposed algorithm, the parameters were k1 = 1, k2 = 0.1, λ1 = 0.24, λ2 = 0.3 and λ3 = 3. Images in the first column of figure 4 correspond to the “restorations” computed with the two methods, while the second column shows the computed segmentation masks. The third column shows the masked data, and finally, the error maps for the computed segmentation masks are shown in the last column. The percent of misclassified pixels was 4.9% and 1.8% for the MPM–MAP and the proposed algorithm, respectively. Note that the error of the reported method is mainly located within a thin region close to the assumed “true border”, where many of the pixels were difficult to classify manually. In contrast, MPM–MAP fails in areas where the gray levels of two regions are similar (for example, the bottom of the ball), while the proposed method takes better account of the gradients present in the image. Indeed, the error at the top of the ball is the result of a gradient in the gray level. Furthermore, the borders are not well located using MPM–MAP, because of the tendency of the algorithm to over–smooth shapes. In order to explain the last result, we note that the Gauss–Markov random field (GMMF) approach to maximizing of the a posteriori marginals (MPM step in MPM–MAP) smooths the shape of regions. As a consequence, region borders do not coincide with the edges in the image. This effect is more noticeable for
Image Segmentation by Flexible Models
631
Fig. 3. Test data: (a) Original image, (b) manual segmentation mask, (c) noisy version of the original data, and (d) masked version of (c).
Fig. 4. Restoration–segmentation computed using two models. Results of the MPM– MAP (Top row): Restoration, Segmentation mask , Masked data using the segmentation mask and Errors in the mask. Second row: Results computed with the proposed method
large values of the regularization parameter in the GMMF algorithm. In figure 5, we illustrate this effect. The task consists of segmenting the image into three constant levels with simplified shapes (i.e., for a large value of the regularization parameter). Figure 5(a) shows the initial guess (maximum likelihood estimate for the MPM–MAP). Figure 5(b) shows the computed segmentation using MPM– MAP, and 5(c), the corresponding segmentation computed with the proposed method (specifically, functional (1) with a robust data term, given that we are using constant models).
632
M. Rivera and J. Gee
Fig. 5. Segmentation with constant levels: (a) Initial guess (maximum likelihood estimate), (b) MPM–MAP and (c) proposed method.
5
Conclusion
We have presented a novel and general variational approach to segmentation and restoration based on the use of robust regularized networks (adaptive flexible models). The introduction of a robust potential in the data term allows us to use fast deterministic algorithms for estimating the segmentation (support region for each model), by compensating for outliers that typically confound such algorithm. Moreover, the quasi–robust potential in the regularization term allows us to adapt the stiffness of the flexible model according to the data image gradient, which results in more accurate segmentation along edges in the image. Acknowledgments. This work was funded in part by the U.S.P.H.S. under grants NS33662, LM03504, MH62100, AG15116 and AG17586. M. Rivera was also supported in part by CONACYT, Mexico under grants 34575-A and Sabbatical stance.
References 1. Li, S.Z.: Markov Random Field Modeling in Image Analysis, Springer-Verlag, New York (2001) 2. Kirkpatrick, S., Gellatt, C. D., Vecchi, M.P.: Optimization by simulated annealing, Science, 220 (1983) 671–680 3. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell., 6 (1984) 721– 741 4. Besag, J.: On the statistics analysis of dirty pictures (with discussions), Journal of the Royal Statistical Society, Series B, 48 (1986) 259–302 5. Peterson C., Soderberg, B.: A new method for mapping optimization problems onto neural networks, International Journal of Neural Systems, 1 (1989) 3–22 6. Geiger D., Girosi, F.: Parallel and deterministic algorithms for MRFs: Surface reconstruction and Integration, IEEE Trans. Pattern Anal. Machine Intell., 12 (1991) 401–412
Image Segmentation by Flexible Models
633
7. Zhang, J.: Mean field theory in EM procedures for MRFs,” IEEE Trans. Signal Processing, 40 (1992) 2570–2583 8. Heitz, F., Perez, P., Bouthermy, P.: Multiscale minimization of global energy functions in some visual recovery problems, Computer Vision, Graphics, and Image Processing: Image Understanding, 59 (1994) 125–136 9. Faugeras, O., Price, K.: Semantic description of aerial images using stochastic labeling, IEEE Trans. Pattern Anal. Machine Intell., 3 (1981) 638–642 10. Hummel, R.A., Zucker, S.W.: On the foundations of relaxation labeling process, IEEE Trans. Pattern Anal. Machine Intell., 5 (1983) 267–286 11. Li, S.Z., MAP image restoration and segmentation by constrained optimization, IEEE Trans. Image Processing, 9 (1998) 1134–1138 12. Marroquin, J.L., Velazco, F.A., Rivera, M., Nakamura,M.: Gauss–Markov Measure Field Models for low–level vision, IEEE Trans. Pattern Anal. Machine Intell., 23 (2001) 337–348 13. Darrell, T.J., Pentland, A.P.: Cooperative robust estimation using layers of support, IEEE Trans. Pattern Anal. Machine Intell., 17 (1995) 474–487 14. Black, M.J., Fleet D.J., Yacoob, Y.: Robustly estimating changes in image appearance, Comput. Vision Image Understand. 78 (2000) 8–31 15. Marroquin, J.L., Botello, S., Calderon F., Vemuri,B.C.: MPM-MAP for image segmentation, In Procc. ICPR 2000, Barcelona Spain, 4 (2000) 300–310 16. Samson, C., Blanc-Feraud, L., Aubert, G., Zerubia, J.: A variational model for image classification and restoration, IEEE Trans. Pattern Anal. and Machine Intell., 22 (2000) 460-472 17. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures, Neural Computing, 7 (1995) 219–269 18. Rivera, M., Marroquin, J.L.: Efficient Half-Quadratic Regularization with Granularity Control, Technical Report: 05.06.2001, I-01-07 (CC), CIMAT, Mexico (2001) 19. Geman, S., McClure, D.E.: Bayesian image analysis methods: An application to single photon emission tomography, Proc. Statistical Computation Section, Amer. Statistical Assoc., Washington, DC, (1985) 12–18 20. Geman, D., Reynolds, G.: Constrained restoration and the recovery of discontinuities, IEEE Trans. Image Processing, 14 (1992) 367–383 21. Geman, D., Yang, C.: Nonlinear image recovery with half-quadratic regularization, IEEE Trans. Image Processing, 4 (1995) 932–946 22. Black, M.J., Rangarajan, A.: Unification of line process, outlier rejection, and robust statistics with application in early vision,” Int. J. Comput. Vis., 19, (1996) 57–92 23. Charbonnier, P., Blanc-F´eraud, L., Aubert, G., Barlaud, G.: Deterministic edge– preserving regularization in computer imaging, IEEE Trans. Image Processing, 6 (1997) 298-311 24. Rivera, M., Marroquin, J.L.: The adaptive rest condition spring model: an edge preserving regularization techinque, in Proc.Interational Conference on Image Processing, IEEE-ICIP 2000, Vol. II, Vancouver, BC, Canada (2000) 805–807 25. Rivera, M., Marroquin, J.L.: Adaptive Rest Condition Potentilas: Second Order Edge-Preserving Regularization, In Proc. European Conference on Computer Vision, ECCV 2002, Springer-Verlag, Copenhagen, Denmark, May (2002) 26. Girosi, F., Poggio, T., Caprile, B.: Extensions of a theory of networks for approximation of and learning: outliers and negative examples,” In R. Lippmann, J. Moody and D. Tourezky, editors, Advances in Neural information processing systems 3, San Mateo, CA, Cambridge University Press (1991)
634
M. Rivera and J. Gee
27. Girosi, F., Jones, M., Poggio, T.: Priors, stabilizers and basis functions: from regularization to radial, tensor and additive splines, MIT AI Lab Memo, No. 1430, June (1993) 28. Weickert, J.: Efficient image segmentation using partial differentiable equations and morphology, Pattern Recognition, 34 (2001) 1813–1824
Principal Component Analysis over Continuous Subspaces and Intersection of Half-Spaces Anat Levin and Amnon Shashua Computer Science Department, Stanford University, Stanford, CA 94305
Abstract. Principal Component Analysis (PCA) is one of the most popular techniques for dimensionality reduction of multivariate data points with application areas covering many branches of science. However, conventional PCA handles the multivariate data in a discrete manner only, i.e., the covariance matrix represents only sample data points rather than higher-order data representations. In this paper we extend conventional PCA by proposing techniques for constructing the covariance matrix of uniformly sampled continuous regions in parameter space. These regions include polytops defined by convex combinations of sample data, and polyhedral regions defined by intersection of half spaces. The applications of these ideas in practice are simple and shown to be very effective in providing much superior generalization properties than conventional PCA for appearance-based recognition applications.
1
Introduction
Principal Component Analysis (PCA)[12], also known as Karhunen-Loeve transform, has proven to be an exceedingly useful tool for dimensionality reduction of multivariate data with many application areas in image analysis, pattern recognition and appearance-based visual recognition, data compression, time series prediction, and analysis of biological data — to mention a few. The typical definition of PCA calls for a given set of vectors a1 , ..., ak in an n-dimensional space, with zero mean, arranged as the columns of an n × k matrix A. The output set of principal vectors u1 , ..., uq are an orthonormal set of vectors representing the eigenvectors of the sample covariance matrix AA associated with the q < n largest eigenvalues. The matrix U U is a projection onto the principal components space with the property that (i) the projection of the original sample is “faithful” in a least-square sense, i.e., min
k
| a i − U U a i |2 ,
i=1
This work was done while A.S. was a visiting faculty at Stanford University. The permanent address of the authors is at the Hebrew University of Jerusalem, Israel.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 635–650, 2002. c Springer-Verlag Berlin Heidelberg 2002
636
A. Levin and A. Shashua
(ii) equivalently, that the projection of the sample set onto the lower dimensional space maximally retains the variance, i.e., the first principal vector u1 maximizes | A u1 |2 , and so forth. The representation of a sample point ai in the lower i dimensional space is defined by xi = U ai , and (iii) the covariance ma feature trix Q = i xi xi of the reduced dimension representation is diagonal, i.e., PCA decorrelates the sample data (thus, if the sample data is drawn from a normal distribution then the variables in the feature space are statistically independent). The strength of PCA for data analysis comes from its efficient computational mechanism, the fact that it is well understood, and from its general applicability. For example, a sample of applications in computer vision includes the representation and recognition of faces [25,26,16,3], recognition of 3D objects under varying pose [17], tracking of deformable objects [6] and for representations of 3D range data of heads [1]. Over the years there have been many extensions to conventional PCA. For example, Independent Component Analysis (ICA) [8,5] is the attempt to extend PCA to go beyond decorrelation and to perform a dimension reduction onto a feature space with statistically independent variables. Other extensions address the situation where the sample data live in a low-dimensional (non-linear) manifold in an effort to retain a greater proportion of the variance using fewer components (cf. [7,11,10,13,27,21]); and yet other (related) extensions derive PCA from the perspective of density estimation (which facilitate modeling nonlinearities in the sample data) and the use of Bayesian formulation for modeling the complexity of the sample data manifold [28]. In this paper we propose a different kind of extension to conventional PCA, which is orthogonal to the extensions proposed in the past, i.e., what we propose could be easily retrofitted to most of the PCA-extensions. The extension we propose has to do with the representation of the sample data. Rather than the data consist of points alone we allow for representation of continuous regions described by (i) convex combinations of sample points (polytops), and (ii) convex regions defined by intersection of half-spaces. In other words, we show how to construct the covariance matrix of a uniformly sampled polytop described by a finite set of sampled points (the generators of the polytop), and a uniformly sampled polyhedral defined by intersection of half-spaces. In the former case, the integration over the polytop region boils down to a very simple modification of the original covariance matrix of the sampled point set: replace AA with AΦA where Φ is a symmetric positive definite matrix which is described analytically per region. We show that despite the simplicity of the result the approach has a significant effect on the generalization properties of PCA in appearance-based recognition — especially when the raw data is not uniformly sampled. In the case of polyhedral regions described by intersections of half-spaces we show that although the concept of integration over the bounded region is not obvious it can be done at a cost of O(n3 ) in certain cases where the half-spaces are defined by inequalities over pairs of variables forming a tree structure. We demonstrate the application of this concept to intensity-ratio representations in appearancebased recognition and show a much superior generalization over conventional PCA.
Principal Component Analysis over Continuous Subspaces
1.1
637
Region-Based PCA — Definition and Motivation
We will start with a simple example. Given two arbitrary points a1 , a2 in the two-dimensional plane, the first principal component u (a unit vector) maximizes 2 2 the scatter: (a 1 u) + (a2 u) , which is the first eigenvector (associated with the largest eigenvalue) of the 2 × 2 matrix a1 a 1 + a2 a2 . Consider the case where we would like the entire line segment λa1 + (1 − λ)a2 , 0 ≤ λ ≤ 1 to be sampled, how would that change the direction of the principal component u? In other words, if W is a polytop defined by the convex combination of a set of points (in this case W is a line segment), one is looking for the evaluation of the integral: 2 | a u | da = u aa da u (1) max |u|=1
a∈W
a∈W
By substituting λa1 + (1 − λ)a2 for a in the integral 0
1
λ2 dλ =
0
1
(1 − λ)2 dλ =
1 , 3
0
1
aa and noting that
λ(1 − λ)dλ =
1 , 6
we obtain the optimization problem: 1 max u [a1 a 1 + a2 a2 + (a1 a2 + a2 a1 )]u 2
|u|=1
Therefore, the first principal component u is the largest1 eigenvector of the matrix: 1 0.5 (2) A A = AΦA 0.5 1 where A = [a1 , a2 ] the matrix whose columns consists of the sample points. In the following section we will generalize this simple example and handle any polytop (represented by its vertices). Once we have that at hand it is simple to accommodate a collection of polytops or a combination of points and polytops as input to the PCA process. The motivation for representing ploytopes in a PCA framework arises from the fact that in many instances in visual recognition it is known apriori that the data resides in polytopes: probably the most well known example is the case of varying illumination over Lambertian surfaces. There are both empirical and analytic justifications to the fact that a relatively small number of images are necessary to model the image variations of human faces under different lighting conditions. In this context, a number of researchers have raised the issue of how to optimally construct the subspace using a sample of images which may be biased. Integration over the polyhedral defined by a sample, even though it is a biased sample, would be a way to construct the image subspace. This is addressed in Section 2.1 of this paper. In the second part of the paper we consider polyhedrals defined by the intersection of half-spaces. Let the variables of a data point be denoted by x1 , ..., xn 1
We mean by “largest” the eigenvector associated with the largest eigenvalue.
638
A. Levin and A. Shashua
where the range of each variable is finite (say, denote pixel values) and consider the polyhedral defined by the relations: αij xj < xi < βij xj for a number of pairs of variables. Each inequality defines a pair of half-spaces (area on side of a hyperplane) thus a (sufficient and consistent) collection of inequalities will define a polyhedral cone whose apex is at the origin. As before, we would like to represent the entire bounded region in the PCA analysis — and we will show how this could be done in the sequel. Our motivation for considering regions defined by inequalities comes from studies in visual recognition showing that the ratio alone between pre-selected image patches provides a very effective mechanism for matching under variability of illumination. For example, [24,18] propose a graph representation (see Fig. 3b for an example) where regions in the image correspond to nodes in the graph and the edges connect neighboring regions which have a consistent relation “the average image intensity of node i is between 0.35 and 0.75 of node j”, for instance. A match between a test image and the model reduces to a graph matching problem. In this paper we show that the idea of a graph representation could be embedded into the PCA framework by looking for the principal components which best describe the region in space bounded by the hyperplanes associated with those inequalities. In other words, it is possible to recast data analysis problems defined by inequalities within a PCA framework — whatever the application may be.
2
PCA over Polytops Defined by Convex Combinations
Let W denote a polytop defined by the convex combinations of a set of linearly independent points a1 , ..., ad , and let Dd be the d − 1 dimensional manifold Dd = {µ = (µ1 , ..., µd } ∈ Rd | µi = 1, µi ≥ 0},
i
and let V (Dd ) = Dd 1dµ be the volume of Dd . The principal components are the eigenvectors of the covariance matrix: 1 Cov(W ) = aa da, V (W ) a∈W where V (W ) denotes the volume of the polytop W , and the inverse volume outside the integral indicates that we have assumed a uniform density function when sampling the points a ∈ W . Let A = [a1 , ..., ad ] be the n × d matrix whose columns are the generators of W , then for every vector µ ∈ Dd we have that Aµ ∈ W . Therefore, the covariance matrix representing the dense (uniform) sampling of the polytop W takes the form: 1 A µµ dµ A = AΦd A . (3) Cov(W ) = V (Dd ) µ∈Dd
Principal Component Analysis over Continuous Subspaces
639
Note that the matrix Φd = (1/V (Dd )) µµ dµ does not depend on the choice of the generators a1 , ..., ad , thus the integral needs to be evaluated only once for every choice of d. Note that since Dd is symmetric under the group of permutations of d letters, then µi µj dµ = µ1 µ2 dµ for all i =j and µ2i dµ = µ21 dµ. Thus, the matrix Φd has the form: αd βd · · · βd βd α d · · · βd 1 · · · · · · , Φd = V (Dd ) · · ··· · βd βd · · · αd 2 where αd = µ1 dµ and βd = µ1 µ2 dµ. We are therefore left with the task of evaluating three integrals αd , βd and V (Dd ). By induction on d one can evaluate those integrals (derivation is omitted due to lack of space) and obtain: V (Dd ) =
1dµ = Dd
1 , αd = (d − 1)!
Dd
µ21 dµ =
2 , βd = (d + 1)!
µ1 µ2 dµ = Dd
1 (d + 1)!
We therefore obtain the following result: Theorem 1. The covariance matrix of the uniformly sampled polytop W defined by the convex linear combinations of a set of d points a1 , ..., ad , arranged as the columns of a n × d matrix A, has the form: 1 1 A(I + ee )A , Cov(W ) = aa da = (4) V (W ) a∈W d(d + 1) where e = (1, 1, ..., 1) and “I” is the identity matrix. There are few points worth noting. First, when the data consists of a single polytop, then centering the data, i.e., subtracting the mean such that Ae = 0, then the discrete covariance matrix AA over the vertices of W is the same the covariance matrix cov(W ) of the entire polytop. Therefore, the integration makes a difference only when there are a number of polytops. Second, the matrix Φd = I + ee can be factored as Qd Q d: I + ee = (I + cee )(I + cee ) = Qd Q d, where
√
d+1−1 . d This property can be used to perform PCA on a d × d matrix instead of the n × n covariance matrix when d << n, as follows. Denote Aˆ = AQd (an n × d matrix), thus AˆAˆ = AΦd A . In case d << n then let y be an eigenvector of ˆ is the corresponding eigenvector of AˆAˆ . The Aˆ Aˆ (d × d matrix), then Ay importance of this property is that the computational complexity of recovering the principal vectors is proportional to the dimension of the polytop rather than the dimension n of the vector space. c=
640
A. Levin and A. Shashua
The third point to note is that the covariance matrix of two uniformly sampled polytops of dimensions d1 − 1 and d2 − 1 is the sum of the covariance matrices corresponding to each polytop separately. In other words, let a1 , ..., ad1 , arranged as columns of a matrix A, be the generators of the first polytop, and let b1 , ..., bd2 , arranged as columns of a matrix B, be the generators of the second polytop. The covariance matrix of the data covering the uniform sampling of both polytops is simply AΦd1 A + BΦd2 B . Thus, for example, given a collection of triplets of images of a class of 3D objects (say, frontally viewed human faces) where the i’th triplet, represented by a n × 3 matrix Ai , spans the illuminationcone of the i’th object, then the covariance matrix of the entire class is simply i Ai Φ3 A i . In case the number of triplets k << n the cost of computing the principal vectors is proportional to 3k rather than to n by noting that one can compute the eigenvectors of the 3k × 3k matrix B B where B = [A1 Q3 , ..., Ak Q3 ] is an n × 3k matrix. 2.1
Demonstrating the Strength of Results
In this section we will provide one example to illustrate the strength of our result. The simplicity of our result, replace AAT by AΦA , may be somewhat misleading — the procedure is indeed trivial, but its effects could be very significant as we show next. As mentioned in the introduction, empirical and analytic observations on the class of human faces have shown that a relatively small number of sample images of a Lambertian object are sufficient to model the image space of the object under varying lighting conditions. Early work showed that when no surface point is shadowed, as little as three images suffice [23,22]. Empirical results have shown that even with cast and attached shadows, the set of images is still well approximated by a low dimensional space [9]. Later work has shown that the set of all images form a polyhedral cone which is well approximated (for human faces) by a low dimensional linear subspace [4], and more recently that the illumination cone (for convex Lambertian objects) can be represented by a 9-dimensional linear subspace [2,19]. In this context, researchers [29,14] have also addressed the issue of how to construct the illumination cone from sample images, i.e., what would be the best representative sample?. A biased set of samples would produce a PCA space which is not effective for recognition. Closest to our work, [29] proposed a technique for integration over any three images employing a spherical parameterization of the 3D space, which in turn is specific to 3-dimensional subspaces. Therefore, the problem of “relighting” provides a good testing grounds for the integration over polytops idea. The integration can turn a biased sample into non-biased — and this is exactly the nature of the experiment below. Consider a training set of images of human frontal faces covering different people and covering various illumination conditions (direction of light sources). One would like to represent the training set by a small number of principal components [26,9]. The fact that the image space of a 3D object with matte (Lambertian) surface properties is known to occupy a small dimensional subspace suggests that the collection of training images per person forms a polytop
Principal Component Analysis over Continuous Subspaces
(a.1)
(b.1)
(b.5)
(b.9)
(a.2)
(b.2)
(b.6)
(b.10)
(a.3)
(b.3)
(b.7)
(b.11)
(a.4)
(a.5)
641
(a.6)
(b.4) (c.1)
(c.2)
(c.3)
(c.4)
(c.5)
(c.6)
(b.8)
(b.12)
Fig. 1. Representing polytops constructed by images of different people, each person sampled by a biased collection of light source directions. (a.1-6) a sample of training images, (c.1-3) the first three principal vectors of the raw data — notice that the light source is biased compared to the principal vectors in (c.4-6) computed from the polytop representations. (b.1-4) novel images of two persons (test images), and their projections on the principal space: (b.5-8) when computed from the raw data, and (b.9-12) when computed from the polytop representation. Note that the latter has much superior generalization performance.
which will be uniformly sampled when creating the covariance matrix of the entire training set. In other words, the training set would consist of a collection of polytops — one per person; where each polytop is defined by the convex combinations of the set of images of that person. In order to appreciate the difference between representing the polytops versus representing the sample data points alone, we constructed a biased set of images with respect to the illumination. We used the training set provided by Yale University’s “illumination dome” where for each of the 38 objects we sampled 42 images: 40 of them illuminated from light sources in the left hemisphere and only 2 from light sources located at the right hemisphere. Since each of the faces is represented by a (biased) sample of 42 images, whereas the polytop we wish to construct is only 3-dimensional we do the following. For each PCA-plane corresponding to the sample of a face, we reparameterize the space such that all the 42 images are represented with respect
642
A. Levin and A. Shashua
to their first 3 principal components, i.e., each image is represented by a 3coordinate vector. Those vectors are normalized thus they reside on a sphere. The normalized 2D coordinates then undergo a triangulation procedure. Let Ai be the n × 3 matrix representing the i’th triangle, then Ai Φ3 A to i represents covariance matrix of the uniformed sampled triangle. As a result, i Ai Φ3 A i represents the covariance matrix of the continuous space defined by the 42 sample images of the face. This is done for each of the 38 faces and the final covariance matrix is the sum of all the individual covariance matrices. Fig. 1a.1-6 shows a sample of images of a person in the training set. In row c.1-3 we show the first three principal vectors when the covariance matrix is constructed in the conventional way (i.e., AAT where the columns of A contain the entire training set). Note that the first principal vector (which typically represents the average data) has a strong shadow on the right hand side of the face. In row c.4-6, on the other hand, we show the corresponding principal vectors when the covariance matrix is constructed by summing up the individual covariance matrices one for each polytop (as described above). Note that the principal vectors represent an un-baised illumination coverage. The effect of representing the polytops is very noticeable when we look at the projection of novel images onto the principal vector set. In row b.1-4 we consider four novel images, two images per person. The projections of the novel images onto the subspace spanned by the first 40 principal vectors are shown in row b.5-8 for conventional PCA and in row b.9-12 when polytops are represented. The difference is especially striking when the illumination of the novel image is to the right (where the original training sample was very small). One can clearly see that the region-based PCA has a much superior generalization property than the conventional approach — despite the fact that a very simple modification was performed to the conventional construction of the covariance matrix.
3
PCA over Polyhedrals Defined by Inequalities
In this section we turn our attention to polyhedrals defined by intersection of half spaces. Specifically, we will focus on the family of polyhedrals defined by the inequalities: αij xj ≤ xi ≤ βij xj , (5) where x1 , ..., xn are the coordinates of the vector space representing the input parameter space, and αij , βij are scalars. Each such inequality between a pair of variables xi , xj represents a pair of hyperplanes, passing through the origin, bounding the possible data points in the desired (convex) polyhedral. These type of inequalities arise in appearance-based visual recognition where only ratios among certain key regions are considered when matching an image to a class of objects. Fig. 3b illustrates a sketch of a face image with the key regions marked and the pairs of regions for which the ratio of the average grey value is being considered for matching. In the following section we will discuss in more detail this type of application. In order to make the relation between a set of inequalities, as defined above, and polyhedrals bounding a finite volume in parameter space, it would be useful
Principal Component Analysis over Continuous Subspaces
643
to consider the following graph structure. Let the graph G(V, E) be defined such that the set of vertices V = {x1 , ..., xn } represent the coordinates of the parameter space, and for each inequality relation between xi , xj there is a corresponding edge eij ∈ E coincident with vertices xi , xj in the graph. Tree structures (connected graph with n−1 edges) are of particular interest because they correspond to a convex polyhedral: Claim. Let the associated graph of the set of inequalities of the type αij xj ≤ xi ≤ βij xj form a tree. Then, given that an arbitrary variable x1 is bounded 0 ≤ x1 ≤ 1, then all other variables are defined in a finite interval, and as a result the collection of resulting hyperplanes bound a finite volume in space. Proof: Due to the connectivity and lack of cycles in the graph, one can chain the inequality relations along paths of the graph leading to x1 and obtain the set of new inequalities of the form δj x1 ≤ xj ≤ γj x1 for some scalars δj , γj . Therefore, the hyperplane x1 = constant, which does not pass through the origin, bounds a finite volume because the range of all other variables is finite. At this point, till the end of this section, we will assume that the associated graph representing the set of input inequalities is a connected tree (i.e., has n-1 edges and has no cycles). Our goal is to compute the integral: xx dx1 · · · dxn , x∈W
where W is the region bounded by the hyperplanes corresponding to the inequalities and the additional hyperplane corresponding to xo = 1 where xo is one arbitrarily chosen variable (we will discuss how to choose xo later in the implementation section). Since the entries of the matrix xx are bilinear products of the variables, we need to find a way of evaluating the integral on monomials xµ1 1 · · · xµnn where µi are non-negative natural numbers. For a single constraint αij xj ≤ xi ≤ βij xj the integration over dxi is straightforward: βij xj
µ 1 µi +1 µi +1 µi +µj +1 xµ1 1 · · · xµnn dxi = − αij )xj xk k (6) (βij µi + 1 αij xj k= i,j
For multiple constraints our challenge is to perform the integration without breaking W into sub-regions. For example, consider the two inequalities below: αij xj ≤ xi ≤ βij xj αik xk ≤ xi ≤ βik xk Then, the integration over the variable xi (which is bounded both by xj and by xk ) takes the form: min{bij xj ,bik xk } xµ1 1 · · · xµnn dxi max{aij xj ,aik xk }
which requires breaking up the region W into 4 pieces. Alternatively, by noticing that αij xj ≤ xi ≤ βij xj is equivalent to β1ij xi ≤ xj ≤ α1ij xi the integration would take the form: 1 βik xk
αik xk
αij
1 βij
xi
xi
xµ1 1 · · · xµnn dxj dxi .
644
A. Levin and A. Shashua
Therefore, in order to simplify the complexity of the integration process one must permute the variables i → π(i) and switch the variables inside the inequalities such that after the re-ordering we have the following condition: for every i, there exist at most a single constraint αiρ(i) xρ(i) ≤ xi ≤ βiρ(i) xρ(i) where π(ρ(i)) > π(i), i.e., the integration over xi is performed before the integration over xρ(i) . In this case the integration over the region W takes the form: βπ ,ρ(π ) 1 βπ ,ρ(π ) 1 1 n−1 n−1 ··· xx dxπ1 · · · dxπn (7) 0
απn−1 ,ρ(πn−1 )
απ1 ,ρ(π1 )
where πi stands for π(i). Before we explain how this could be achieved via the associated graph, consider the following example for clarification. Let n = 4 and we are given the following inequalities: x2 ≤ x1 ≤ 2x2 x3 ≤ x1 ≤ 3x3 x1 ≤ x4 ≤ 2x1 Since x1 is bounded twice, we replace the first inequality with its equivalent form: 1 x1 ≤ x2 ≤ x1 . 2 We therefore have ρ(1) = 3, ρ(2) = 1 and ρ(4) = 1. Select the permutation (143), i.e., π(1) = 4, π(2) = 2, π(3) = 1 and π(4) = 3. The integration of the monomial x23 (for instance) over the bounded region W is therefore: 1 3x3 x1 2x1 x23 dx4 dx2 dx1 dx3 . 0
x3
1 2 x1
x1
2x The integration over x4 is performed first: x1 1 x23 dx4 = x1 x23 (according to x1 x1 x23 dx2 = 12 x21 x23 , eqn. 6), then the integration over x2 is performed: 0.5x 1 3x3 1 2 2 5 followed by the integration over x1 , x3 2 x1 x3 dx1 = 14 3 x3 and finally the inte 1 14 5 gration over x3 (the free variable), 0 3 x3 dx3 = 79 . The decision of which inequality to “turn around” and how to select the order of integration (the permutation π(i)) can be made through simple graph algorithms, as follows. We will assign directions to the edges of the graph G with the convention that a directed edge xi → xj represents the inequality αij xj ≤ xi ≤ βij xj . The condition that for every i there should exist at most a single inequality αij xj ≤ xi ≤ βij xj is equivalent to the condition that the associated directed graph will have at most one outgoing edge for every node. The algorithm for directing the edges of the undirected graph would start from some degree-1 node (a node with a single incident edge) and trace a path until a degree-1 node is reached again. The direction of edges would then follow the path. The process repeats itself with a new degree-1 node until no new nodes remain. Since G is a tree this process is well defined. The selection of the order of integration is then simply obtained by a topological sort procedure. The reason for that is that one can view every pair xi → xj as
Principal Component Analysis over Continuous Subspaces
(a.1)
(a.2)
(a.3)
(b.1)
(b.2)
(b.3)
(b.4)
(a.4)
(a.5)
(a.6)
(b.5)
(b.6)
(b.7)
(b.8)
645
Fig. 2. (a.1-6) A sample of the training set from the AR dataset. (b.1-4) the first four principal vectors computed by integrating over the polyhedral region defined by the inequalities, and (b.5-8) are the principal vectors computed from the raw point data (in feature space).
a partial ordering (xi comes before xj ). The topological sort provides a complete ordering (which is not necessarily unique) which satisfies the partial orderings. The complete order is the desired permutation. The example above is displayed graphically in Fig. 3a where the directed 4-node graph is shown and the topological sort result x4 , x2 , x1 , x3 (note that x2 , x4 , x1 , x3 is also a complete ordering which yields the same integration result). To summarize, given a set of n − 1 inequalities that form a connected tree, the covariance matrix of the resulting polyhedral is computed as follows. 1. Direct the edges of the associated graph so that there would be at most a single outgoing edge from each node. 2. “turn around” inequalities which do not conform to the edge direction convention. 3. Perform a topological sort on the resulting directed tree. 4. Evaluate the integral in eqn. 7 where the complete ordering from the topological sort is substituted for the permutation π(i). The complexity of this procedure is O(n) for every entry of the n × n matrix xx . 3.1
Experimental Details
In this section we illustrate the application of principal vectors defined by a set of inequalities in the domain of representing a class of images by intensity ratios — an idea first introduced by [24,18]. Consider a training set of human frontal faces, roughly aligned, where certain key regions have been identified. For example, [24] illustrates a manual selection of key regions and a manual determination of the inequalities on the average intensity of the key regions. The associated graph becomes the model of the class of objects and the matching against a novel image is reduced to a graph matching procedure.
646
A. Levin and A. Shashua
[.7cm]
(a)
(c)
(b) (d) Fig. 3. (a) The associated tree of the n = 4 example. (b) A graphical description of the associated tree on the face detection experiment using inequalities. (c) Typical examples of true and false positives and negative detections on the leading technique (first row in table 4) (d) Typical examples of the worst technique (third row in table 4)
In this section we will re-implement the intensity-ratio inequality approach, but instead of using a graph matching procedure we will apply a PCA representation on the resulting polyhedral defined by the associated tree. There are a number of advantages of doing so: for example, the PCA approach allows us to combine both raw data points, polytops defined by convex combinations of raw data points, and the polyhedrals defined by the inequalities. In other words, rather than viewing the intensity ratio approach as the engine for classification it could be just another cue integrated in the global covariance matrix. Second, by representing the polyhedral by its principal vectors one can make “soft” decisions based on the projection onto the reduced space, which is less natural to have in a graph matching approach. As for training set, we used 100 images from the AR set [15] representing aligned frontal human faces (see Fig. 2a). The key regions were determined by applying a K-means clustering algorithm on the covariance matrix; five clusters were found and those were broken down based on connectivity to 13 key regions. The average intensity value was recorded per region thus creating the vector x = (x1 , ..., x13 ) as the feature representation of the original raw images. For
Principal Component Analysis over Continuous Subspaces
647
every pair of variables xi , xj we recorded the sine of the angle between the vectors xi recorded over the entire training set and the vector xj over the training set — thus defining a complete associated graph with weights inversely proportional to the correlation between the pairs of variables. The minimal spanning tree of this graph was selected as the associated tree. Fig. 3b shows the key regions and the edges of the associated tree. Finally, for every pair of variables xi , xj which has an incident edge in the associated tree we determined the upper and lower bounds of the inequality by constructing the histogram of xi /xj and selected aij to be at the lower 30% point of the histogram and bij to be at the upper 30% of the histogram. This completes the data preparation phase for the region-based PCA applied to the polyhedral region defined by the associated tree. Fig. 2b.1-4 shows the first four principal vectors of the region-based PCA using the integration formulas described in the previous section, while Fig. 2b.5-8 show the principal vectors using conventional PCA on the feature space vectors. One can see that the first principal vector (b.1 and b.5) are very similar, yet the remaining principal vectors are quite different. In table 4 we compare the performance over various uses of PCA on the CMU [20] test set of faces (which constitute postcards of people). The best technique was the product of the conventional PCA score on the raw image representation and the region-based PCA score. The results are displayed in the first row of the table. The false detections (false positives) are measured as a fraction of the total number of faces in the CMU test set. The miss-detections (false negatives) are measured as the percentage of the total number of true faces in the test set. Each column in the table represents a different tradeoff between the false positives and negatives — the better detection performance is at the expense of false positives. Thus, for example, when the detection rate was set to 96% (the highest possible in this technique) the false detection rate was 1.7 the amount of the total number of faces in the training set, whereas when the detection rate was set to 89% the false detection rate went down to 0.67 of the total number of faces. In the second row we use only conventional PCA: the score on the raw image representation multiplied with the score on the clustered image (feature vector of 13 dimensions). The reduced performance is noticeable and significant. The worst performance is presented in the third row where only conventional PCA was used on the raw image representation. The region-based PCA performance is shown in the 4’th row: the performance is lower than the leading approach, but not much lower. And finally, conventional PCA on the clustered representation (13 dimensional feature vector) is shown in the 5’th row: note that the performance compared to the 4’th row is significantly reduced. Taken together, the region-PCA approach provides significant superiority in generalization properties compared to the conventional PCA - despite the fact that it is essentially a PCA approach. The fact that the relevant region of the parameter space is sampled correctly is the key factor behind the superior performance. In Fig. 3c-d we show some typical examples of detections which contain true detections, false positives and negatives on the leading technique (first row in the table) and the worst technique (third row in table).
648
A. Levin and A. Shashua False detections raw-PCA & region-PCA raw-PCA & PCA(13-dim) raw-PCA region-PCA conventional-PCA(13-dim)
1.7 96% 80% 60% 90% 79%
1.1 91% 76% 52% 86% 76%
0.67 89% 75% 54% 83% 72%
Fig. 4. Comparison of detection performance. The false detections (false positives) are measured as a fraction of the total number of faces in the CMU test set. The misdetections (false negatives) are measured as the percentage of the total number of true faces in the test set. Each column in the table represents a different tradeoff between the false positives and negatives — the better detection performance is at the expense of false positives. The rows in the table represent the different techniques being used. See text for further details.
4
Summary
The paper makes a number of statements which include: (i) in some data analysis applications it becomes important to represent (uniform sampling of) continuous regions of the parameter space as part of the global covariance matrix of the data, (ii) in case where the continuous regions are polytops, defined by the convex combinations of sample data, the construction of the covariance matrix is extremely simple: replace the conventional AA covariance matrix with AΦA where Φ is described analytically in this paper, and (iii) the general idea extends to challenging regions such as those defined by intersections of half spaces — there we have derived the equations for constructing the covariance matrix where the regions are formed by n − 1 inequalities on pairs of variables forming an associated tree structure. The concepts laid down in this paper are not restricted to computer vision applications and have possibly a wider range of applications — just as the conventional PCA is widely applicable. In the computer vision domain we have shown that these concepts are effective in the domains of appearance-based visual recognition where continuous regions are defined by the illumination space (Section 2) — which are known to occupy low-dimensional subspaces — and in intensity-ratio representations. In the former case the regions form polytops and we have seen that the representation of those polytops make a big effect in the generalization properties of the principal vectors (Fig. 1), yet the price of applying the proposed approach is minimal. In the case of intensity-ratio representations, the notion of representing bounded spaces, defined by inequalities, by integration over the bounded region is not obvious, but is possible and at a low cost of O(n3 ). We have shown that the application of this concept provides much superior generalization properties compared to conventional PCA (Table 4). Future work on these ideas include non-uniform sampling of regions in the case of polytops, handling the integration for general associated graphs (although in general the amount of work is exponential with the size and number of cycles in the graph) and exploring more applications for these basic concepts.
Principal Component Analysis over Continuous Subspaces
649
Acknowledgements. A.S. thanks Leo Guibas for hosting him during the academic year 2001/2. We thank Michael Elad and Gene Golub for insightful comments on the draft of this paper.
References 1. J.J. Atick, P.A. Griffin, and N.A. Redlich. Statistical approach to shape-fromshading: deriving 3d face surfaces from single 2d images. Neural Computation, 1997. 2. R. Basri and D. Jacobs. Photometric stereo with general, unknown lighting. In iccv, Vancouver, Canada, July 2001. 3. P.N Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In Proceedings of the European Conference on Computer Vision, 1996. 4. P.N Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In Proceedings of the European Conference on Computer Vision, 1996. 5. A.J. Bell and T.J. Sejnowski. An information maximization approach to blind separation and blind deconvolution. Neural Computation 7(6), pages 1129–1159, 1995. 6. Michael J. Black and D. Jepson. Eigen-tracking: Robust matching and tracking of articulated objects using a view-based representation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 329–342, Cambridge, England, 1996. 7. C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In iccv, Boston, Jun 1995. 8. P. Comon. Independent component analysis, a new concept? Signal processing 36(3), pages 11–20, 1994. 9. P. Hallinan. A low-dimentional representation of human faces for arbitrary lightening conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 995–999, 1994. 10. T. Hastie and W. Stuetzle. Principal curves. Journal of Americam Statistical Association 84, pages 502–516, 1989. 11. T. Heap and D. Hogg. Wormholes in shape space: Tracking through discontinuous changes in shape. In iccv, 1998. 12. I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986. 13. M.A. Kramer. Non linear principal component analysis using autoassociative neural networks. AI Journal 37(2), pages 233–243, 1991. 14. K.C. Lee, J. Ho and D. Kriegman. Nine points of light: Acquiring subspaces for face recognition under variable lighting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2001. 15. A.M. Martinez and A.C. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(1):228–233, 2001. 16. B. Moghaddam A. Pentland and B. Starner. View-based and modular eigenspaoes for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 84–91, 1994. 17. H. Murase and S.K. Nayar. Learning and recognition of 3D objects from appearance. In IEEE 2nd Qualitative Vision Workshop, pages 39–50, New York, NY, June 1993.
650
A. Levin and A. Shashua
18. E. Grimson P. Lipson and P. Sinha. Configuration based scene classification and image-indexing. In cvpr, San Juan, Puerto Rico, 1997. 19. R. Ramamoorthi and P. Hanrahan. On the relationship between Radiance and Irradiance: Determining the illumination from images of a convex Lambertian object. In Journal of the Optical Society of America (JOSA A), Oct. 2001, pages 2448-2459. 20. H. Schneiderman and T. Kanade. A statistical model for 3d object detection applied to faces and cars. In cvpr, South Carolina, June 2000. 21. V. Silva, J.B. Tenenbaum and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290, December 2000. 22. A. Shashua. Illumination and view position in 3D visual recognition. In Proceedings of the conference on Neural Information Processing Systems (NIPS), Denver, CO, December 1991. 23. A. Shashua. On photometric issues in 3D visual recognition from a single 2D image. International Journal of Computer Vision, 21:99–122, 1997. 24. P. Sinha. Object recognition via image invariances. Investigative Ophthalmology and Visual Science 35/4:#1735, 1994. 25. L. Sirovich and M. Kirby. Low dimensional procedure for the characterization of human faces. Journal of the Optical Society of America, 4(3):519–524, 1987. 26. M. Turk and A. Pentland. Eigen faces for recognition. J. of Cognitive Neuroscience, 3(1), 1991. 27. A.R. Webb. An approach to nonlinear principal components-analysis using radially symmetrical kernel functions. Statistics and computing 6(2), pages 159–168, 1996. 28. J.M. Winn C.M. Bishop. Non-linear bayesian image modelling. In Proceedings of the European Conference on Computer Vision, Dublin, Ireland, June 2000. 29. L. Zhao and Y.H. Yang. Theoretical analysis of illumination in PCA-based vision systems. Pattern Recognition, 32:547-564, 1999.
On Pencils of Tangent Planes and the Recognition of Smooth 3D Shapes from Silhouettes Svetlana Lazebnik1 , Amit Sethi1 , Cordelia Schmid2 , David Kriegman1 , Jean Ponce1 , and Martial Hebert3 1
Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign 405 North Mathews Avenue Urbana, IL 61801 USA {slazebni, asethi, kriegman, j-ponce}@uiuc.edu 2 INRIA Rhˆ one-Alpes 665, avenue de l’Europe 38330 Montbonnot, FRANCE
[email protected] 3 Carnegie Mellon University Robotics Institute 5000 Forbes Avenue Pittsburgh, PA 15213 USA
[email protected]
Abstract. This paper presents a geometric approach to recognizing smooth objects from their outlines. We define a signature function that associates feature vectors with objects and baselines connecting pairs of possible viewpoints. Feature vectors, which can be projective, affine, or Euclidean, are computed using the planes that pass through a fixed baseline and are also tangent to the object’s surface. In the proposed framework, matching a test outline to a set of training outlines is equivalent to finding intersections in feature space between the images of the training and the test signature functions. The paper presents experimental results for the case of internally calibrated perspective cameras, where the feature vectors are angles between epipolar tangent planes.
1
Introduction
Many recognition systems represent objects using features derived directly from image intensity patterns. These systems work well on textured animals [10] or objects with distinctive markings, like faces and cars [11,12]. However, they are limited in their ability to distinguish between objects based on true 3D shape. They may fail, for instance, to tell the difference between a tiger and a tiger-skin rug. Another difficulty is that some classes of objects do not have intensity or color descriptors with sufficient discriminative power: in the absence of surface texture or markings, the silhouette becomes the main clue to the object’s identity. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 651–665, 2002. c Springer-Verlag Berlin Heidelberg 2002
652
S. Lazebnik et al.
A common approach to silhouette-based matching consists of finding rich local descriptors for a set of contour points. This process may involve computing orientation information associated with pairs and triples of points [3] or attaching two-dimensional “shape context” histograms to each point [2]. An important advantage of these methods is that they do not require complete segmentation, working instead with a scattered set of edge points. However, most known contour descriptors are mainly suitable for 2D recognition — from a geometric point of view, it is hard to justify the appropriateness of arbitrary outline statistics for matching multiple views of the same 3D object. In this paper, we present a true geometric approach to recognizing smooth 3D objects. We follow the general philosophy of deriving a rich silhouette description to build a highly descriptive feature space, while taking care to define features that have a rigorous 3D interpretation. In our framework, a potential match between two outlines is a hypothesis of a consistent epipolar geometry between the two respective viewpoints. Previous work on geometric silhouette matching has been limited, considering only weak perspective or restricting the set of allowable camera movements [1,7,9]. The approach proposed in this paper is fully general, encompassing the cases of uncalibrated and internally calibrated perspective projection, as well as affine projection. Our method is not restricted to outlines taken from nearby viewpoints, and explicitly accounts for self-occlusion. In the following section, we give a conceptual introduction to our recognition framework. In Sections 3 and 4, we discuss the properties of the feature space and the conditions for matching outlines. In Section 5, we address implementation issues and report results from a preliminary recognition experiment.
2
Recognition Framework
Assume that we are given a training set of outlines of a single object. We want to construct a representation suitable for recognizing instances of this object based on outlines in test images from viewpoints not present in the training set. The key idea is the following: for each image, we associate a set of invariants with each possible baseline connecting its viewpoint to any other viewpoint. The baseline between two camera centers determines the epipolar geometry of the scene: the epipoles are the intersections of the baseline with the image planes, and the epipolar lines are intersections of the image planes with the pencil of epipolar planes passing through the baseline (Figure 2). The epipolar geometry does not change when we translate the camera centers while keeping the baseline fixed. We take advantage of this invariance property by introducing a signature function S that assigns a vector from some feature space F to each possible baseline. The feature vectors (to be defined precisely in Section 3.3) can be computed in the image given an outline and a hypothesized epipole, but they measure properties of the 3D object in space. Let Γ denote the collection of the outlines in the training images (a discrete set of pictures or a video clip). Any point e on the projective image plane of some γ in Γ is a potential epipole, and is identified with a unique line (the line
On Pencils of Tangent Planes
O
¢
g ¢ e
¢
653
S g e
f
S
S
S¢ O
F
Fig. 1. Outlines γ and γ are connected in space by the baseline through their respective camera centers, O and O . This baseline intersects the image planes in epipoles e and e . Computing signature functions S(γ, e) and S(γ , e ) yields a vector f in feature space F that lies on the intersection of the two signature surfaces Σ and Σ .
passing through the camera center and piercing the image plane at the epipole location). Thus, sampling all possible epipoles is equivalent to sampling the twodimensional set of all baselines passing through the camera center. Indeed, the space of all lines through the origin in 3-space is topologically equivalent to P2 , the projective plane. The signature function S encapsulates the relationship between outlines and epipoles/viewing rays as follows: S : Γ × P2 → F (γ, e) →f . Since the feature space F contains information about the 3D properties of the scene, it is actually possible to express S as a function of an object and a line in space. However, the definition above emphasizes the 2D information that is directly accessible to the recognition algorithm, namely, an outline and a point on the image plane. Let γ and γ denote a training and a test outline of the same object (keep in mind that the relative camera positions are unknown). Consider the baseline connecting the camera centers of γ and γ . This baseline yields a pair of epipoles: e in the image plane of γ, and e in the image plane of γ . Since the training and the test images capture the same object and the two epipoles refer to the same line in space, we must have S(γ, e) = S(γ , e ) . Now, suppose that this equality holds for some particular γ, γ , e, and e . Then the two epipole positions define a baseline for which the two pictured outlines are consistent with a single object. If the feature space F is sufficiently highdimensional, then a match of signature functions is a strong indication that two outlines belong to a single object. Here is an alternative way to think about matching: the images of the whole signature functions for γ and γ , denoted Σ = S({γ} × P2 ) and Σ = S({γ } × P2 ), form signature surfaces in the feature space F. If γ and γ come from the same object, then the intersection Σ ∩ Σ
654
S. Lazebnik et al.
yields the consistency hypothesis between the training and test outlines, and its preimage yields the unique baseline joining the camera centers of the training and test images (Figure 1). Thus, a hypothesis of a possible match for recognition is equivalent to a hypothesis of the epipolar geometry of camera pairs in the scene. So far, we have only discussed matching between a pair of outlines. In principle, since any two views of the same object can be connected by a baseline, it is always possible to match a novel view of an object given a single training image. In practice, however, a single view of an object may be ambiguous, and widely separated pairs of views may fail to produce descriptive features. For these reasons, we should collect training sequences consisting of a few representative views of each object. A recognition algorithm that works with multiple training outlines and a single test outline will have the following conceptual structure: 1. Training. a) Feature Extraction. For each training object i, acquire a training set of outlines Γi = {γij | j = 1, . . . , mi } and compute the signatures Σij = S({γij } × P2 ). 2. Testing. a) Feature Extraction. Acquire a test outline γ and compute the signature function Σ = S({γ } × P2 ). b) Matching. For each training object i and outline j, compute the intersections Σ ∩ Σij . If γ is a view of object i, then each Σ ∩ Σij should consist of a unique feature vector. Otherwise, each Σ ∩ Σij should be empty.
3
Feature Space
Consider the set of all lines passing through the epipole that are also tangent to the contour. These lines back-project to planes passing through the baseline that are tangent to the object at isolated frontier points (Figure 2). The points of epipolar tangency on the image contour are projections of these frontier points, and it is well known that they are the only true stereo matches between a pair of view-dependent contours [8]. Even though a single image does not constrain the depth of frontier points in space, it is still possible to reconstruct the tangent epipolar planes by back-projecting the observed tangent epipolar lines. Notice that the epipolar planes are defined by the baseline and the geometry of the object — they do not depend on image plane orientations or on the exact positions of the camera centers along the baseline. Thus, we can derive feature vectors for baselines by computing the tangent epipolar planes associated with them. The kinds of features we can use — projective, affine, or Euclidean — depend on the imaging model we wish to adopt. In the following subsections, we briefly review these three models in turn. Along the way, refer to Figure 3 for an example of each kind of feature vectors.
On Pencils of Tangent Planes P1
P
655
P2 P3
P p O
l
l
e
e¢
¢
(a)
P4
q ¢
O
l1 e
l2 l3 l4 O
(b)
Fig. 2. (a) Epipolar geometry of frontier points. The epipolar plane Π is defined by camera centers O and O , and the frontier point P. Π intersects the image planes in two lines l and l that pass through the epipoles and are tangent to the respective contours at the projections p and q of the frontier point. (b) Planes Π1 , Π2 , Π3 , and Π4 pass through the baseline and are tangent to the object at four frontier points. These planes intersect the image plane in epipolar tangents l1 , l2 , l3 , and l4 . Note that the epipole e is hypothetical — it does not correspond to a second camera center.
3.1
Projective Cameras
When the internal camera parameters are unknown, a pinhole camera allows us to measure only the properties of the image that remain invariant under projective transformations. In particular, projective measurements are sufficient to define the epipolar geometry between pairs of cameras. For any two cameras along a fixed baseline, the pencil of epipolar lines tangent to the contour is the projection of the same pencil of planes through the baseline. The cross-ratio of four tangent epipolar planes through this baseline is the same as the cross-ratio of the corresponding epipolar lines observed by any camera along the baseline. Given four or more lines, we can compute all possible cross-ratios between each quadruple of lines and store these cross-ratios in a feature vector. 3.2
Affine Cameras
Affine cameras are cameras whose centers and focal planes are located on the plane at infinity in three-dimensional space [6]. Affine projection preserves parallelism and maps points on the plane at infinity to points on the line at infinity. The notion of epipolar geometry still applies to the affine case: the baseline between two affine cameras is a line at infinity, and since all the epipolar planes intersect in this line, they are actually parallel to each other. In the image, epipolar lines are also parallel, and the epipoles can be thought of as direction vectors. An affine epipole has only one degree of freedom, instead of the two degrees of freedom in the perspective case, and this reduces the intrinsic dimension of the feature space [9]. As for vectors of invariants in the feature space, they are naturally given by ratios of distances between parallel tangent epipolar planes.
656
S. Lazebnik et al.
l1
l2
l3
l1
l1
l2
l2
l3
l3 l4
l4 e
l4 e
(a) Cross-ratio
(b) Distance ratios
(c) Angles
Fig. 3. An example of different kinds of feature vectors for an epipole with four tangents. (a) Projective (uncalibred cameras): f = (Cross(l1 , l2 , l3 , l4 )). (b) Affine: f = (Dist(l1 , l2 )/Dist(l1 , l4 ), Dist(l1 , l3 )/Dist(l1 , l4 )). (c) Euclidean (calibrated cameras): f = (Angle(l1 , l2 ), Angle(l1 , l3 ), Angle(l1 , l4 )). The angles are not between the lines themselves, but between the corresponding epipolar planes in space. Note that the different feature vectors have dimensions 1, 2, and 3, respectively.
3.3
Euclidean Cameras
Many reliable procedures exist for measuring the internal parameters of the camera (skew, magnification factors, and image center) [6]. Internal calibration gives us a mapping from points in the image plane to lines through the camera center in three-dimensional Euclidean space, expressed in a canonical coordinate system attached to the camera. The projection matrix of the internally calibrated camera may be written as M = K[I | 0] where K is the 3 × 3 matrix of internal parameters. Then for each line l tangent to the contour and passing through a particular epipole position we obtain a plane K l Π=M l= 0 in the canonical camera system. Given coordinate vectors of two tangent epipolar planes Π1 = M l1 and Π2 = M l2 , we may measure their angle as l 1 (KK )l2 . cos θ = l l 1 (KK )l1 2 (KK )l2
(1)
Given a contour γ and a fixed epipole position e, how do we define the corresponding value of the signature function? Consider the ordered set (l1 , . . . , ln ) of lines that pass through e and are tangent to the contour (the ordering is circular about e, with l1 serving as a specially chosen reference line). The planes formed by back-projecting the lines make up a corresponding ordered set, denoted (Π1 , . . . , Πn ). Let θi be the angle in space between Π1 and Πi+1 , computed according to (1). Finally, we are ready to define the value of the signature function as S(γ, e) = (θ1 , . . . , θn−1 ). In stating the above definition, we have left a few things deliberately unspecified. For one, the order of the angles in the feature vector depends on the
On Pencils of Tangent Planes
657
choice of the reference line and on the circular orientation convention (clockwise vs. counterclockwise). In addition, the number n of angles is not a global constant; it may vary for different contours and positions of the epipole. Because of self-occlusion, the number of tangent epipolar planes may actually vary for different camera positions along the same baseline. We will return to these issues in Sections 4.1 and 5.2. Overall, angles have significant advantages over cross-ratios as primitives making up the feature space. We need fewer measurements to compute angles — only two epipolar tangents suffice, whereas we need at least four to get a cross-ratio. As a result, the “calibrated” feature space has higher dimension than the “uncalibrated” one, which improves the ability to discriminate between different objects at recognition time. For the rest of the paper, we will focus on the calibrated case.
4
Properties of the Signature Function
In the following sections, we briefly describe the smoothness and continuity properties of the signature function, and present informal arguments about the existence and uniqueness of matches in feature space. 4.1
Critical Events
For the rest of this section, let us regard the contour γ as being fixed, so that the signature function depends only on the epipole position e. For a given e, the number of angles in the feature vector (θ1 , . . . , θn−1 ) is one less than the number of epipolar tangents (l1 , . . . , ln ), and n is also a function of e. For generic epipole positions, a contour will have an even number of epipolar tangents, so the corresponding feature vector will have odd dimension. This dimension will remain constant if we perturb the epipole a little, unless the epipole lies on a critical event boundary: the contour itself, an inflectional tangent, or a bitangent to the contour. Crossing an inflectional tangent or the contour itself will make a pair of lines appear or disappear (increasing or decreasing the dimension of F by two), while crossing a bitangent will reverse the order of a pair of lines (giving no net change in dimension). Away from critical boundaries, however, the values of the angles (θ1 , . . . , θn−1 ) vary smoothly as a function of the epipole position. Thus, even though the signature surface Σ ⊂ F may have a very complicated global structure, with different subsets immersed in spaces of a different dimension, its local structure is quite simple. If e is a generic epipole position giving rise to n epipolar tangents, then inside a sufficiently small neighborhood of e, S is an immersion of R2 into Rn−1 . 4.2
Matching and the Intersection of Signature Surfaces
Let us take two contours γ and γ and consider the intersection Σ of their signature surfaces, Σ = S({γ} × P2 ) and Σ = S({γ } × P2 ). Take some f ∈ Σ
658
S. Lazebnik et al.
where F is locally m-dimensional (that is, f ∈ Rm ). Then there exist e, e ∈ P2 such that f = S(γ, e) = S(γ , e ). Moreover, we can find neighborhoods U and U of e and e , respectively, where F = S({γ} × U ) and F = S({γ } × U ) are two-dimensional surfaces in Rm . If we expect F and F to intersect transversally, then additivity of codimension [5] yields the following: m − dim(F ∩ F ) = m − dim F + m − dim F = 2m − 4 dim(F ∩ F ) = 4 − m . Thus, for m > 4, any transversal intersection of two signature surfaces would have to be empty (note that since m can only be odd, we need not be concerned with the case m = 4). In other words, if we take two arbitrary contours γ and γ and a random feature vector f consisting of five or more angle values, we will not find e and e such that S(γ, e) = S(γ , e ). Observation 1. If the feature space has sufficiently high dimension, the possibility of “accidental” matches is in principle ruled out. But what if γ and γ are outlines of the same object seen from two different viewpoints? Then the baseline connecting these viewpoints gives two true epipole positions e and e . Clearly, there exists a unique set of planes that are tangent to the object and pass through this baseline. If we assume the object is transparent, then we will be able to observe exactly the same planes for γ and γ by looking at the respective epipolar tangents. In this case, we must have S(γ, e) = S(γ , e ). Observation 2. For transparent objects, signatures will always match at true epipole positions for two different views of the same object. Combined, the two observations above suggest that a match between signature functions for two epipoles in two images indeed offers strong evidence of consistency between two outlines. As long as the dimension of the feature space is high enough, our ability to discriminate approaches the idealization of Section 2. Nevertheless, we cannot claim that a signature match exists if and only if γ and γ are two views of the same object, and e and e are the projections of the camera centers that produced γ and γ, respectively. Various non-generic properties of the contours, such as symmetries, may conspire to produce multiple signature matches. A far more important problem, however, is self-occlusion — as noted in Section 3.3, two camera positions along the same baseline may fail to see the same epipolar tangents, when some of them become obscured by parts of the surface. In the next sections, we discuss the implementation of our approach, and show how to deal with occlusion.
5 5.1
Implementation Sampling of Epipoles
We represent signature functions for a small number of input pictures by sampling the two-dimensional space of possible epipoles. We find a set of uniformly
On Pencils of Tangent Planes
659
distributed viewing directions by recursively tessellating a unit sphere, and then project these directions onto the image plane (directions lying on the focal plane project to epipoles at infinity). Figure 4 (a) shows a tessellation of the sphere projected onto the image plane. After choosing a sampling pattern of a given
(a)
(b)
(c)
Fig. 4. (a) A synthetic image of a rhino with a projected 1313-vertex triangulation of the sphere overlayed. (b) Sample points from a 20609-vertex triangulation with 15dimensional feature vectors. (c) Sample points with 17-dimensional feature vectors.
density, we find all epipolar tangents and compute the signature functions for each sample point. During a pre-processing step, contours are segmented using thresholding followed by Gaussian smoothing [9]. To facilitate the computation of epipolar tangents, the contours are represented as cubic B-splines. Recall from Section 4.1 that for different epipole positions, the number of epipolar tangents and the dimension of feature vectors vary as certain critical boundaries are crossed. Patterns of sample points with feature vectors of different dimension shown in Figures 4 (b) and (c) reveal these boundaries. To visualize the computed signature functions, refer to Figure 5: part (a) shows a plot of the largest angle in the feature vector as a function of the epipole, and part (b) shows some patches of a 15-dimensional signature surface projected into three dimensions. 5.2
Ordering of Feature Vectors
Let us return to the definition of a feature vector given in Section 3.3. At that stage, we have not committed to a choice of a reference plane Π1 or to the orientation of the circular ordering of planes around the baseline. However, for best recognition results, the choice of Π1 should not be arbitrary. Whenever the baseline passes outside the convex hull of the object, we can identify two extremal planes that make contact with the object only at the respective frontier points. These planes are robust to self-occlusion in opaque objects — they will necessarily be observed between any two views on the same baseline. By contrast, frontier points due to non-extremal planes can become occluded (in addition, segmentation algorithms usually miss parts of the contour that are visible, but
660
S. Lazebnik et al.
(a)
(b)
Fig. 5. (a) Angle (in radians) between extremal planes as a function of epipole position (in pixel coordinates) for the rhino image shown in Figure 4. Note that the extremal angle approaches π as the epipole approaches the contour. (b) A three-dimensional immersion of a 15-dimensional subset of the signature surface for the rhino contour.
fall inside the silhouette). For these reasons, our implementation arbitrarily selects one of the two extremal planes as the reference, and computes the angles (θ1 , . . . , θn−1 ) with respect to this reference plane. While matching feature vectors from two images, it is impossible to determine whether the two reference planes correspond to each other in space, or whether the reference plane in one image corresponds to the second extremal plane in the other image. For each of the two possible orderings, we compute a matching cost as described in the next section, and select the smaller cost as the “winner”. When the baseline enters the convex hull of the object, there are no extremal planes. In this case, matching becomes more combinatorially complex, and more difficult to implement. However, since only a small fraction of all sample epipoles fails to produce extremal planes, excluding these points from matching has a negligible effect on performance. 5.3
Matching Feature Vectors
In the idealized recognition framework of Section 2, matching reduces to finding intersections of signature surfaces. Unfortunately, this formulation is difficult to implement directly. Since we are using a sampled representation of signature surfaces, we cannot locate exact matches by simply comparing discrete feature vectors. Also, Observation 2 of Section 4.2 is not true for opaque objects. Selfocclusion can make some tangent planes invisible from a particular camera position along the baseline, and introduce T-junctions that show up as false frontier points on the silhouette. Because of these effects, a successful matching algorithm must be able to compare feature vectors with different numbers of components and find subsequences of these vectors that minimize some matching cost. To
On Pencils of Tangent Planes
661
this end, we have implemented a dynamic programming algorithm that, given two feature vectors of length m and n, f = (θ1 , . . . , θm ) and f = (θ1 , . . . , θn ), finds two subsequences of length k, f˜ = (θi1 , . . . , θik ) and f˜ = (θj 1 , . . . , θj k ) that minimize the average distance function k 1 |θil − θj l | . D(f˜, f˜ ) = k
(2)
l=1
The subsequence length k can either be given to the matching algorithm as a parameter, or used as another optimization variable. The dynamic programming formulation is relatively efficient, and it is the natural way to enforce ordering constraints — e.g., the extremal angles always have to match, and the indices in the two subsequences must increase monotonically. 5.4
Matching Signature Surfaces
Once the matching score for a pair of feature vectors has been defined, we can proceed to match pairs of signature surfaces. In Section 4.2, we established that, provided the dimension of the feature space is sufficiently high, we can expect a unique match between two signature surfaces Σ = S({γ} × P2 ) and Σ = S({γ } × P2 ). Thus, in the implementation, it is sufficient to look for a single pair of “closest” feature vectors. The signature matching cost is then simply C(Σ, Σ ) =
min
f ∈Σ,f ∈Σ
D(f, f ) .
In practice, because of measurement noise and discretization error due to sampling, C will not vanish even if Σ and Σ intersect. When comparing a test signature surface Σ to training surfaces Σ1 , . . . , Σm , we can assign matches based on minimum cost: Match(Σ ) = arg min C(Σi , Σ ) . Σi
(3)
Let f = S(γ, e) and f = S(γ , e ) be two feature vectors giving the lowest-cost match. The two points e and e represent a hypothesis of the epipolar geometry between the views that produced outlines γ and γ . When full calibration is available, comparing the locations of these points to the true epipole positions allows us to verify the matching procedure. To conduct the verification, we experimented with a synthetic rhino data set (Figure 4 shows one of the rhino images). First, we computed matching costs for the signature of the true epipole in one view and the signatures of all sample points in a second view. Figure 6 shows the resulting plots for two different sampling resolutions. Well-defined local minima exist in the vicinity of the true epipoles, although the discrepancies between the minima and the true matches vary with the quality of the sampling. Next, we computed cost functions for all pairs of sample points whose viewing directions fall within 10◦ of the true epipoles. The results are summarized
662
S. Lazebnik et al.
View 1, 4◦ sampling
View 2, 4◦ sampling
View 1, 1◦ sampling
View 2, 1◦ sampling
Fig. 6. Matching costs between true epipoles and sample points plotted on the sphere for directions within 10◦ of the true match. Darker shading indicates lower cost. The local minimum of the sampled cost function is marked with a cross, and the true epipole location is marked with a diamond.
in Table 1. Our experiments show that the minimum cost over all pairs of sampled feature vectors may be an order of magnitude larger than the cost for the true match (of course, the actual numerical values are an artifact of our definition of the cost function). However, as sampling density increases, the minimum cost computed over all pairs of sample points approaches the global minimum (compare rows 1 and 2 of the table). By examining row 3, we can also see that denser sampling improves the accuracy of hypothesized epipoles. Interestingly, though, the minimum cost match is not found at sample points that are closest to the true epipole directions. Overall, our results confirm that it is in principle possible to find reliable epipole estimates through matching signature surfaces — empirically, the cost of the true match always appears to be the global minimum. However, the actual quality of local minima found by our algorithm is dependent on the density of the sampling. Table 1. The effect of sampling density on local minima of the matching cost function. The third row shows the angle differences between true epipoles and minimum-cost sample points in views 1 and 2, respectively.
Actual min. cost (×104 ) Sampled min. cost (×104 ) Angle discrepancy
5.5
Sampling density 4◦ Sampling density 2◦ Sampling density 1◦ (1313 points) (5185 points) (20509 points) 4.136 4.136 4.136 20.078 12.147 4.803 7.34◦ and 7.57◦ 5.41◦ and 4.99◦ 3.91◦ and 3.68◦
Recognition
In this section, we present a matching experiment on a real data set containing two views each of three objects: a toy dinosaur, a gargoyle statuette, and a
On Pencils of Tangent Planes
663
cowboy (see Figure 7). The data set shows a substantial amount of self-occlusion: notice, for instance, the tail and the forelegs of the dinosaur, and the left ear and wing of the gargoyle. For each input picture, signature surfaces were computed at 4◦ resolution. As Table 1 indicates, the local minima of the cost function computed at this sampling density are not very reliable. To diffuse the sampling artifacts and to pool evidence from multiple locations in the cost landscape, we modified the matching criterion of Equation 3 to take the average of a fixed number of the lowest-cost feature matches. Specifically, to classify each outline, we computed the mean of the lowest 50 matching costs of its signature surface with the signature surfaces of every other outline, and picked the smallest-cost outline as the winner. Figure 8 presents the complete matching statistics. Note that each outline is correctly assigned to the other outline from the same object.
dino1
dino2
gar1
gar2
toy1
toy2
Fig. 7. Outlines of three objects used in the recognition experiment.
Our recognition experiment allows us to draw several conclusions. First of all, dense sampling does not appear to be necessary for successful matching. Even though the lowest-cost matches may be far away from the true epipoles, the relative magnitudes of the costs give a good indication of proximity between different signature surfaces. Secondly, reliability of matching depends on the complexity of the contours. For instance, the toy outlines are the most complex, giving rise to the highest-dimensional signature surfaces. Feature vectors from these surfaces offer a large number of possible combinations for matching, raising the likelihood that a spurious low-cost match will be found.
6
Discussion and Conclusions
The preliminary implementation of Section 5 confirms the validity of our recognition framework, but it cannot serve as a prototype for a working real-world system. To make our method truly practical, we need to address several issues. – Segmentation: since contour extraction is not the focus of the current paper, we assume that all input images can be segmented using naive techniques like thresholding. This restrictive assumption is not unique to our approach, but is common to most silhouette-based recognition or reconstruction schemes. Overall, the development of robust and general segmentation algorithms remains a significant challenge.
S. Lazebnik et al. 10
10
9
9
9
8
8
8
7
cost(×104)
10
cost(×104)
cost(×104)
664
7
6
6
6
5
5
5
4
4
4
dino1
dino2
gar1
gar2
toy1
toy2
dino1
dino2
dino1
gar1
gar2
toy1
toy2
dino1
10
9
9
9
8
8
8
cost(×104)
10
7
7
6
6
5
5
5
4 dino2
gar1
dino2
gar2
toy1
toy2
gar2
toy1
toy2
gar2
toy1
toy2
7
6
4
gar1
toy1
10
dino1
dino2
gar1
cost(×104)
cost(×104)
7
4 dino1
dino2
gar1
gar2
gar2
toy1
toy2
dino1
dino2
gar1
toy2
Fig. 8. Mean and standard deviation of matching costs for each test outline vs. all the other test outlines. The dashed horizontal lines indicate the lowest cost matches.
– Occlusion and clutter: the feature matching algorithm of Section 5.3 only deals with measurement noise and self-occlusion. We are currently modifying our framework to account for occlusion of the target object by other objects, and for segmentation errors due to background clutter. – Efficiency: our implementation involves sampling two-dimensional sets of epipoles, and matching between all pairs of feature vectors in two images. We need to optimize these computationally expensive tasks, or develop alternative signature function representations and matching procedures. One interesting extension of our approach is to combine the discrete feature matching procedure with nonlinear optimization methods that solve for camera motion based on frontier points [4]. The main problem with these methods is initialization — it is difficult to find an initial guess of epipole positions that would make the system converge to the right solution. We could start an optimization at the local minima produced by our matching algorithm, and use an iterative technique to improve the estimates of the epipoles.
On Pencils of Tangent Planes
665
Another important long-term direction is class-based object recognition. In this paper, we developed a representation framework that captures the geometric constraints between different views of a single object instance. A far more challenging question is, what geometric features derived from image data would allow us to classify pictures drawn from an object category? Developing algorithms that reason directly about 3D geometry, rather than about 2D image patterns, may be the key to building recognition systems that discriminate between classes of objects related by similarity of 3D shape. Acknowledgments. This work was supported in part by the Beckman Institute, the National Science Foundation under grants IRI-990709 and IIS 00-85980, and a UIUC-CNRS collaboration agreement. We would also like to thank Steve Sullivan for providing the data used in the experiment of Section 5.5.
References 1. R. Basri and S. Ullman, “The Alignment of Objects with Smooth Surfaces”, Proc. of Int. Conf. on Computer Vision, 1988, pp. 482-488. 2. S. Belongie, J. Malik and J. Puzicha, “Shape Matching and Object Recognition Using Shape Contexts”, to appear, IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3), 2002. 3. S. Carlsson, “Order Structure, Correspondence, and Shape Based Categories”, Proc. of Int. Workshop on Shape, Contour and Grouping, 1999, pp. 58-71. 4. R. Cipolla, K. Astrom and P.J. Giblin, “Motion from the Frontier of Curved Surfaces”, Proc. of IEEE Int. Conf. on Computer Vision, 1995, pp. 269-275. 5. V. Guillemin and A. Pollack, Differential Topology, Prentice Hall, Englewood Cliffs, New Jersey, 1974. 6. R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, Cambridge, 2000. 7. D. Jacobs, P. Belhumeur and I. Jermyn, “Judging Whether Multiple Silhouettes Can Come from the Same Object”, to appear, Proc. of the 4th Int. Workshop on Visual Form, 2001. 8. J. Porrill and S. Pollard, “Curve Matching and Stereo Calibration”, Image and Vision Computing, 9(1), 1991, pp. 45-50. 9. D. Renaudie, D. Kriegman and J. Ponce, “Duals, Invariants, and the Recognition of Smooth Objects from their Occluding Contour”, Proc. of Eur. Conf. on Computer Vision, 2000, pp. 784-798. 10. C. Schmid, “Constructing models for content-based image retrieval”, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, 2001. 11. H. Schneiderman and T. Kanade, “A Statistical Method of 3D Object Detection Applied to Faces and Cars”, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition 2000, pp. 746-751. 12. M. Weber, M. Welling and P. Perona, “Towards Automatic Discovery of Object Categories”, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, 2000.
Estimating Human Body Configurations Using Shape Context Matching Greg Mori and Jitendra Malik Computer Science Division University of California at Berkeley Berkeley, CA 94720 {mori,malik}@cs.berkeley.edu
Abstract. The problem we consider in this paper is to take a single twodimensional image containing a human body, locate the joint positions, and use these to estimate the body configuration and pose in threedimensional space. The basic approach is to store a number of exemplar 2D views of the human body in a variety of different configurations and viewpoints with respect to the camera. On each of these stored views, the locations of the body joints (left elbow, right knee, etc.) are manually marked and labelled for future use. The test shape is then matched to each stored view, using the technique of shape context matching in conjunction with a kinematic chain-based deformation model. Assuming that there is a stored view sufficiently similar in configuration and pose, the correspondence process will succeed. The locations of the body joints are then transferred from the exemplar view to the test shape. Given the joint locations, the 3D body configuration and pose are then estimated. We can apply this technique to video by treating each frame independently – tracking just becomes repeated recognition! We present results on a variety of datasets.
1
Introduction
As indicated in Figure 1, the problem we consider in this paper is to take a single two-dimensional image containing a human body, locate the joint positions, and use these to estimate the body configuration and pose in three-dimensional space. Variants include the case of multiple cameras viewing the same human, tracking the body configuration and pose over time from video input, or analogous problems for other articulated objects such as hands, animals or robots. A robust, accurate solution would facilitate many different practical applications– e.g. see Table 1 in Gavrila’s survey paper[1]. From the perspective of computer vision theory, this problem offers an opportunity to explore a number of different tradeoffs – the role of low level vs. high level cues, static vs. dynamic information, 2D vs. 3D analysis, etc. in a concrete setting where it is relatively easy to quantify success or failure. There has been considerable previous work on this problem [1]. Broadly speaking, it can be categorized into two major classes. The first set of approaches use a 3D model for estimating the positions of articulated objects. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 666–680, 2002. c Springer-Verlag Berlin Heidelberg 2002
Estimating Human Body Configurations Using Shape Context Matching
(a)
(b)
667
(c)
Fig. 1. The goal of this work. (a) Input image. (b) Automatically extracted keypoints. (c) 3D rendering of estimated body configuration. In this paper we present a method to go from (a) to (b) to (c).
Pioneering work was done by O’Rourke and Badler [2], Hogg[3] and Yamamoto and Koshikawa [4]. Rehg and Kanade [5] track very high DOF articulated objects such as hands. Bregler and Malik [6] use optical flow measurements from a video sequence to track joint angles of a 3D model of a human, using the product of exponentials representation for the kinematic chain. Kakadiaris and Metaxas[7] use multiple cameras and match occluding contours with projections from a deformable 3D model. Gavrila and Davis [8] is another 3D model based tracking approach, as is the work of Rohr [9] for tracking walking pedestrians. It should be noted that pretty much all the tracking methods require a hand-initialized first video frame. The second broad class of approaches does not explicitly work with a 3D model, rather 2D models trained directly from example images are used. There are several variations on this theme. Baumberg and Hogg[10] use active shape models to track pedestrians. Wren et al. [11] track people as a set of colored blobs. Morris and Rehg [12] describe a 2D scaled prismatic model for human body registration. Ioffe and Forsyth [13] perform low-level processing to obtain candidate body parts and then use a mixture of trees to infer likely configurations. Song et al. [14] use a similar technique involving feature points and inference on a tree model. Toyama and Blake [15] use 2D exemplars to track people in video sequences. Brand [16] learns a probability distribution over pose and velocity configurations of the moving body and uses it to infer paths in this space. Carlsson [17,18] uses order structure to compare exemplar shapes with test images. In our previous work [19] we used shape context matching to localize keypoints directly. In this paper we consider the most basic version of the problem–estimating the 3D body configuration based on a single uncalibrated 2D image. The basic idea is to store a number of exemplar 2D views of the human body in a variety of different configurations and viewpoints with respect to the camera. On each of these stored views, the locations of the body joints (left elbow, right knee, etc.) are manually marked and labelled for future use. This is the only user input required in our method. The test shape is then matched to each stored view, using
668
G. Mori and J. Malik
the shape context matching technique of Belongie, Malik and Puzicha [20,21]. This technique is based on representing a shape by a set of sample points from the external and internal contours of an object, found using an edge detector. Assuming that there is a stored view “sufficiently” similar in configuration and pose, the correspondence process will succeed. The locations of the body joints are then “transferred” from the exemplar view to the test shape. Given the joint locations, the 3D body configuration and pose are estimated using the algorithm of Taylor [22]. The structure of the paper is as follows. In section 2 we describe the correspondence process mentioned above. Section 3 provides details on a parts-based extension to our keypoint estimation method. We describe the 3D estimation algorithm in section 4. We show experimental results in section 5. Finally, we conclude in section 6.
2
Estimation Method
In this section we provide the details of the configuration estimation method proposed above. We first obtain a set of boundary sample points from the image. Next, we estimate the 2D image positions of 14 keypoints (hands, elbows, shoulders, hips, knees, feet, head and waist) on the image by deformable matching to a set of stored exemplars that have hand-labelled keypoint locations. These estimated keypoints can then be used to construct an estimate of the 3D body configuration in the test image. 2.1
Deformable Matching Using Shape Contexts
Given an exemplar (with labelled keypoints) and a test image, we cast the problem of keypoint estimation in the test image as one of deformable matching. We attempt to deform the exemplar (along with its keypoints) into the shape of the test image. Along with the deformation, we compute a matching score to measure similarity between the deformed exemplar and the test image. In our approach, a shape is represented by a discrete set of n points P = {p1 , . . . , pn }, pi ∈ IR2 sampled from the internal and external contours on the shape. We first perform Canny edge detection [23] on the image to obtain a set of edge pixels on the contours of the body. We then sample some number of points (around 300 in our experiments) from these edge pixels to use as the sample points for the body. Note that this process will give us not only external, but also internal contours of the body shape. The internal contours are essential for estimating configurations of self-occluding bodies. See Figure 2 for examples of sample points. The deformable matching process consists of three steps. Given sample points on the exemplar and test image: 1. Obtain correspondences between exemplar and test image sample points 2. Estimate deformation of exemplar 3. Apply deformation to exemplar sample points
Estimating Human Body Configurations Using Shape Context Matching
(a)
(b)
(c)
669
(d)
Fig. 2. Iterations of Deformable matching. Column (a) shows sample points from the two figures to be matched. The bottom figure (exemplar) in (a) is deformed into the shape of the top figure (test image). Columns (b,c,d) show successive iterations of deformable matching. The top row shows the correspondences obtained through the shape context matching. The bottom row shows the deformed exemplar figure at each step. The right arm of the exemplar is deformed to match the right arm of the test image. The left thigh of the figure is also correctly deformed. However, the left lower leg is not deformed properly, due to a failure in the correspondence procedure.
We perform a small number (4 in experiments) of iterations of this process to match an exemplar to a test image. Figure 2 illustrates this process. Sample Point Correspondences In the correspondence phase, for each point pi on a given shape, we want to find the “best” matching point qj on another shape. This is a correspondence problem similar to that in stereopsis. Experience there suggests that matching is easier if one uses a rich local descriptor. Rich descriptors reduce the ambiguity in matching. The shape context was introduced in [20,21] to play such a role in shape matching. Consider the set of vectors originating from a point to all other sample points on a shape. These vectors express the configuration of the entire shape relative to the reference point. Obviously, this set of n − 1 vectors is a rich description, since as n gets large, the representation of the shape becomes exact. The full set of vectors as a shape descriptor is much too detailed since shapes and their sampled representation may vary from one instance to another in a category. The distribution over relative positions is a more robust and compact,
G. Mori and J. Malik
(b)
log r
log r
(a)
θ
(d)
(c)
log r
670
θ
(e)
θ
(f)
Fig. 3. Shape contexts. (a,b) Sampled edge points of two shapes. (c) Diagram of logpolar histogram bins used in computing the shape contexts. We use 5 bins for log r and 12 bins for θ. (d-f) Example shape contexts for reference samples marked by ◦, , in (a,b). Each shape context is a log-polar histogram of the coordinates of the rest of the point set measured using the reference point as the origin. (Dark=large value.) Note the visual similarity of the shape contexts for ◦ and , which were computed for relatively similar points on the two shapes. By contrast, the shape context for is quite different.
yet highly discriminative descriptor. For a point pi on the shape, compute a coarse histogram hi of the relative coordinates of the remaining n − 1 points, hi (k) = # {q =pi : (q − pi ) ∈ bin(k)} . This histogram is defined to be the shape context of pi . The descriptor should be more sensitive to differences in nearby pixels, which suggests the use of a log-polar coordinate system. An example is shown in Fig. 3(c). Note that the scale of the bins for log r is chosen adaptively, on a per shape basis. This makes the shape context feature invariant to scaling. As in [20,21], we use χ2 distances between shape contexts as a matching cost between sample points. We would like a correspondence between sample points on the two shapes that enforces the uniqueness of matches. This leads us to formulate our matching of a test body to an exemplar body as an assignment problem (also known as the weighted bipartite matching problem) [24]. We find an optimal assignment between sample points on the test body and those on the exemplar. To this end we construct a bipartite graph. The nodes on one side represent sample points on the test body, on the other side the sample points on the exemplar. Edge weights between nodes in this bipartite graph represent the
Estimating Human Body Configurations Using Shape Context Matching
671
costs of matching sample points. Similar sample points will have a low matching cost, dissimilar ones will have a high matching cost. -cost outlier nodes are added to the graph to account for occluded points and noise - sample points missing from a shape can be assigned to be outliers for some small cost. We use the assignment problem solver in [25] to find the optimal matching between the sample points of the two bodies. Note that the output of more specific filters, such as face or hand detectors, could easily be incorporated into this framework. The matching cost between sample points can be measured in many ways. Deformation Model Belongie et al. [20,21] used thin plate splines as a deformation model. However, it is not appropriate here, as human figures deform in a more structured manner. We use a 2D kinematic chain as our deformation model. The kinematic chain has 9 segments: a torso, upper and lower arms, upper and lower legs (see Figure 4). Our kinematic chain allows translation of the torso, and 2D rotation of the limbs around the shoulders, elbows, hips and knees. This is a simple representation for deformations of a figure in 2D. It only allows in-plane rotations, ignoring the effects of perspective projection as well as out of plane rotations. However, this deformation model is only used in the matching process and is sufficient to capture the small deformations of an exemplar.
(a)
(b)
(c)
Fig. 4. The deformation model. (a) Underlying kinematic chain. (b) Automatic assignment of sample points to kinematic chain segments on an exemplar. Each different symbol denotes a different chain segment. (c) Sample points deformed using the kinematic chain.
In order to estimate a deformation or deform a body’s sample points, we must know to which kinematic chain segment each sample point belongs. On the exemplars we have hand-labelled keypoints – we use these to automatically assign the hundreds of sample points to segments.
672
G. Mori and J. Malik
Since we know the segment S(pi ) that each exemplar sample point pi belongs to, given correspondences {(pi , pi )} we can estimate a deformation D of the points {pi }. Our deformation process starts at the torso. We find the best least squares fit estimate of the translation for the sample points on the torso. T (pi ) − pi 2 Dtorso = Tˆ = arg minT 1 Tˆ = N
pi ,S(pi )=torso
(pi − pi ), where N = #{pi : S(pi ) = torso}
pi ,S(pi )=torso
Subsequent segments along the kinematic chain have rotational joints. We again obtain the best least squares estimates, this time for the rotations of these joints. Given previous deformation Dprev along the chain up to this segment, we estimate Dlimbj as the best rotation around the joint location cj : Pj = {pi : S(pi ) = limbj } Dlimbj = Rθ,c ˆ j = arg minRθ ,cj
Rθ,cj (Dprev · pi ) − pi 2
pi ∈Pj
The best angle of rotation θˆ for this joint is then found as:
θˆ = arg minθ
(Dprev · pi − cj )Rθ (cj − pi )
pi ∈Pj
i qix qiy − i qiy qix ˆ θ = arctan i qix qix + i qiy qiy where qi = Dprev · pi − cj and qi = pi − cj Steps 2 and 3 in our deformable matching framework are performed in this manner. We estimate deformations for each segment of our kinematic chain model, and apply them to the sample points belonging to each segment. We have now provided a method for estimating a set of keypoints using a single exemplar, along with an associated score (the sum of shape context matching costs for the optimal assignment). The simplest method for choosing the best keypoint configuration in a test image is to find the exemplar with the best score, and use the keypoints predicted using its deformation as the estimated configuration. However, with this simple method there are concerns about the number of exemplars needed for a general matching framework. In the following section we will address this by combining matching results for multiple exemplars.
3
Using Part Exemplars
Given a set of exemplars, we can choose to match either entire exemplars or parts, such as limbs, to a test image. The advantage of a parts-based approach that
Estimating Human Body Configurations Using Shape Context Matching
673
matches limbs is that of compositionality, which saves us from an exponential explosion in the required number of exemplars. Consider the case of a person walking while holding a briefcase in one hand. If we already have exemplars for a walking motion, and a single exemplar for holding an object in the hand, we can combine these exemplars to produce correct matching results. However, if we were forced to use entire exemplars, we would require a different “holding object and walking” exemplar for each portion of the walk cycle. Using part exemplars prevents the total number of exemplars from growing to an unwieldy size. As long as we can ensure that the composition of part exemplars yields an anatomically correct configuration we will benefit from this reduced number of exemplars. The matching process is identical to that presented in the preceding section. For each exemplar, we deform it to the shape of the test image. However, instead of assigning a total score for an exemplar, we give a separate score for each limb on the exemplar. This is done by simply summing the shape context costs for sample points on a limb. With N exemplars we now have N estimates for the location of each of the 6 “limbs” (arms, legs, head, waist). We have to find the “best” combination of these estimates. It is not sufficient to simply choose each limb independently as the one with the best score. There would be nothing to prevent us from violating underlying anatomical constraints. For example, the left leg could be found hovering across the image disjoint from the rest of the body. We need to enforce the consistency of the final configuration. Consider again the case of using part exemplars to match the figure of a person walking while holding a briefcase. Given a match for the arm grasping the briefcase, and matches for the rest of the body, we know that there are constraints on the distance between the shoulder of the grasping arm and the rest of the body. Motivated by this, the measure of consistency we use is the 2D image distance between the bases b(li ) (shoulder for the arms, hip for the legs) of limbs li . We form a tree structure by connecting the arms and the waist to the head, and the legs to the waist. For each of the 5 links in this tree, we compute the N 2 2D image distances between all pairs of bases of limbs from the different exemplars. We now make use of the fact that each whole exemplar on its own is consistent. For a pair of limbs (li1 , lj2 ) – limb 1 from exemplar i and limb 2 from exemplar j, we compare the distance d12 ij between the bases of the limbs with the same distances using limbs 1 and 2 on the same exemplar. We define the consistency cost of using this pair of limbs (li1 , lj2 ) together in matching a test image as the average of the two differences: 1 2 d12 ij = b(li ) − b(lj ) 12 12 12 |d12 ij − dii | + |dij − djj | (1) 2 The consistency cost Conii for using limbs from the same exemplar across a tree link is zero. As the configuration begins to deviate from the consistent exemplars, Conij increases. We define the total cost of a configuration
Con12 ij =
674
G. Mori and J. Malik
c ∈ {1, 2, . . . , N }6 as the weighted sum of consistency and shape context limb scores: Score(c) = (1 − wcon )
6
SCc(i) + wcon
i=1
Conuv c(u)c(v)
links:(lu ,lv )
There are N 6 possible combinations of limbs from the N exemplars. However, we can find the optimal configuration in O(5N 2 ) time using a dynamic programming algorithm along the tree structure1 . Moreover, an extension to our algorithm can produce the top K matches for a given test image. Preserving the ambiguity in this form, instead of making an instant choice, is particularly advantageous for tracking applications, where temporal consistency can be used as an additional filter. pre-processing: % Compute shape contexts for exemplars. % Click 14 keypoint locations on each exemplar. matching: % Compute shape contexts for test image I. foreach exemplar Ei [Limbsi ,Scoresi ] = match(Ei ,I) foreach link Lw = (lu , lv ) foreach exemplar Ei foreach exemplar Ej D(i, j, w) = Conuv ij % See equation (1) Sols = dynamic-solve (Limbs, Scores, D, waist) Sol = min(Sols) dynamic-solve(Limbs, U nary, Binary, root): foreach link Lw = (root, next) Solsrec (w) = dynamic-solve(Limbs, U nary, Binary, next) foreach exemplar Ei % Find best set of nodes from Solsrec , while using Ei as root. sˆ = best(Solsrec , Ei , Binary(root), U nary(root)) Sols(i) = sˆ return Sols
4
Estimating 3D Configuration
We use Taylor’s method in [22] to estimate the 3D configuration of a body given the keypoint position estimates. Taylor’s method works on a single 2D image, taken with an uncalibrated camera. It assumes that we know: 1. the image coordinates of keypoints (u, v) 2. the relative lengths l of body segments connecting these keypoints 1
Dynamic programming on trees is a well-studied problem in the theory community. For an early example of such methods in the computer vision literature, refer to [26]
Estimating Human Body Configurations Using Shape Context Matching
675
3. a labelling of “closer endpoint” for each of these body segments 4. that we are using a scaled orthographic projection model for the camera We can then solve for the 3D configuration of the body {(Xi , Yi , Zi ) : i ∈ keypoints} up to some ambiguity in scale s. The method considers the foreshortening of each body segment to construct the estimate of body configuration. For each pair of body segment endpoints, we have the following equations: l2 = (X1 − X2 )2 + (Y1 − Y2 )2 + (Z1 − Z2 )2 (u1 − u2 ) = s(X1 − X2 ) (v1 − v2 ) = s(Y1 − Y2 ) dZ = (Z1 − Z2 ) =⇒ dZ = l2 − ((u1 − u2 )2 + (v1 − v2 )2 )/s2 To estimate the configuration of a body, we first fix one keypoint as the reference point and then compute the positions of the others with respect to the reference point. Since we are using a scaled orthographic projection model the X and Y coordinates are known up to the scale s. All that remains is to compute relative depths of endpoints dZ. We compute the amount of foreshortening, and use the user-supplied “closer endpoint” labels from the closest matching exemplar to solve for the relative depths. Moreover, Taylor notes that the minimum scale smin can be estimated from the fact that dZ cannot be complex. (u1 − u2 )2 + (v1 − v2 )2 s≥ l This minimum value is a good estimate for the scale since one of the body segments is often perpendicular to the viewing direction.
5 5.1
Results Pose File
We applied our method to the human figures in Eguchi’s pose file collection [27]. This book contains a thorough collection of artist reference photographs of one model performing typical actions. The book depicts actions such as skipping, jumping, crawling, and walking. Each action is photographed from about 8 camera angles each, and performed in 2 levels of clothing (casual and skirt). We selected 18 training images (two from each clothing-action pair), to be used as exemplars. These exemplars were then used to match with 36 different test images (four from each clothing-action pair). The positions of 14 keypoints (hands, elbows, shoulders, feet, knees, hips, waist and head) were manually labelled on each training image. We automatically
676
G. Mori and J. Malik
Head
R hand
R elbow
L hand L shoulder
R shoulder
L elbow
L hip R hip Waist R knee L knee
R foot
L foot
Head R hand L hand R shoulder L shoulder R elbow L elbow
L hip hip Waist R LRfoot foot
R knee knee L
(a)
(b)
Fig. 5. Example renderings. (a) Original image with located keypoints. (b) 3D rendering (green is left, red is right).
locate the 2D positions of these keypoints in each of our test images, and then estimate a 3D configuration. Figure 5 shows some example 3D configurations that were obtained using our method.
Estimating Human Body Configurations Using Shape Context Matching Error Distribution for Hands
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.5 0.4 0.3 0.2
0.6 0.5 0.4 0.3 0.2
0.1 0 0
Error Distribution for Feet
1
Fraction of test images
Fraction of test images
1
0.1 1
2
3 4 5 Distance in head−lengths
6
0 0
7
1
2
(a)
0.8
0.7
0.7
Fraction of test images
Fraction of test images
0.9
0.8
0.6 0.5 0.4 0.3 0.2
7
6
7
6
7
0.5 0.4 0.3
2
3 4 5 Distance in head−lengths
6
0 0
7
1
2
(c)
3 4 5 Distance in head−lengths
(d)
Error Distribution for Shoulders
Error Distribution for Hips
1
0.9
0.9
0.8
0.8
0.7
0.7
Fraction of test images
Fraction of test images
6
0.6
0.1 1
1
0.6 0.5 0.4 0.3 0.2
0.6 0.5 0.4 0.3 0.2
0.1
0.1 1
2
3 4 5 Distance in head−lengths
6
0 0
7
1
2
(e)
3 4 5 Distance in head−lengths
(f)
Error Distribution for Head
1
Error Distribution for Waist
1
0.9
0.9
0.8
0.8
0.7
0.7
Fraction of test images
Fraction of test images
7
0.2
0.1
0.6 0.5 0.4 0.3 0.2
0.6 0.5 0.4 0.3 0.2
0.1 0 0
6
Error Distribution for Knees
1
0.9
0 0
3 4 5 Distance in head−lengths
(b)
Error Distribution for Elbows
1
0 0
677
0.1 1
2
3 4 5 Distance in head−lengths
(g)
6
7
0 0
1
2
3 4 5 Distance in head−lengths
(h)
Fig. 6. Distributions of error in 2D location of keypoints. (a) Hands, (b) Feet, (c) Elbows, (d) Knees, (e) Shoulders, (f) Hips, (g) Head, (h) Waist. Error (X-axis) is measured in terms of head-lengths (average length of head in 2D image). Y-axis shows fraction of keypoints in each bin. The average image size is 380 by 205 pixels. Large errors in positions are due to ambiguities regarding left and right limbs.
678
G. Mori and J. Malik
50
100
150
200
50
50
100
100
150
150
200
200
50
100
150
200
250
300
350
50
50
100
100
150
150
200
200
50
100
150
200
250
300
350
50
50
100
100
150
150
200
200
50
100
150
200
(a)
250
300
350
50
100
150
200
250
300
350
50
100
150
200
250
300
350
50
100
150
200
250
300
350
50
100
150
200
250
300
350
(b)
Fig. 7. Frames of speed skater sequence. The top left image shows an example of sample points extracted using edge detection. The rest of the images show the estimated body configuration of every 3rd frame of the sequence.
Estimating Human Body Configurations Using Shape Context Matching
679
Figure 6 shows distributions of error in the location of the 2D keypoints. Ground truth was obtained through user-clicking. The large errors (especially for hands and feet) can be attributed to ambiguities between left and right limbs. In tracking applications, these ambiguities could be resolved by: 1. for each frame returning the K best solutions instead of just the best one 2. exploiting temporal consistency by using an HMM or other dynamic model 5.2
Speed Skating
We also applied our method to a sequence of video frames of an Olympic speed skater (Figure 7). We chose 5 frames for use as exemplars, upon which we handlabelled keypoint locations. We then applied our method for configuration estimation to a sequence of 20 frames. Each frame was processed independently – no dynamics were used, and no temporal consistency was enforced!
6
Conclusion
In this paper we have presented a simple, yet apparently effective, approach to estimating human body configurations in 3D. Our method first matches using 2D exemplars, then estimates keypoint locations, and finally uses these keypoints in a model-based algorithm for determining 3D body configuration. Our method requires minimal user input, only the locations of 14 keypoints on each exemplar. There are several obvious directions for future work: 1. The 2D shape matching could make use of additional attributes such as distances from labelled features such as faces or hands. 2. When video data are available, then estimation can benefit from temporal context. Human dynamic models are most naturally expressed in joint angle space, and our framework provides a natural way to incorporate this information in the 3D configuration estimation stage. Acknowledgements. This research was supported by Office of Naval Research grant no. N00014-01-1-0890, as part of the MURI program.
References 1. Gavrila, D.M.: The visual analysis of human movement: A survey. Computer Vision and Image Understanding: CVIU 73 (1999) 82–98 2. O’Rourke, J., Badler, N.: Model-based image analysis of human motion using constraint propagation. IEEE Trans. PAMI 2 (1980) 522–536 3. Hogg, D.: Model-based vision: A program to see a walking person. Image and Vision Computing 1 (1983) 5–20 4. Yamamoto, M., Koshikawa, K.: Human motion analysis based on a robot arm model. CVPR (1991) 664–665
680
G. Mori and J. Malik
5. Rehg, J., Kanade, T.: Visual tracking of high dof articulated structures: An application to human hand tracking. Proc. of 3rd ECCV II (1994) 35–46 6. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn. (1998) 8–15 7. Kakadiaris, I., Metaxas, D.: Model-based estimation of 3d human motion. IEEE Trans. PAMI 22 (2000) 1453–1459 8. Gavrila, D., Davis, L.: 3d model-based tracking of humans in action: A multi-view approach. IEEE Computer Society CVPR (1996) 73–80 9. Rohr, K.: Incremental recognition of pedestrians from image sequences. In: CVPR93. (1993) 8–13 10. Baumberg, A., Hogg, D.: Learning flexible models from image sequences. Lecture Notes in Computer Science 800 (1994) 299–308 11. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. IEEE Trans. PAMI 19 (1997) 780–785 12. Morris, D., Rehg, J.: Singularity analysis for articulated object tracking. In: Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn. (1998) 289–296 13. Ioffe, S., Forsyth, D.: Human tracking with mixtures of trees. In: Proc. 8th Int. Conf. Computer Vision. Volume 1. (2001) 690–695 14. Song, Y., Goncalves, L., Perona, P.: Monocular perception of biological motion clutter and partial occlusion. In: Proc. 6th Europ. Conf. Comput. Vision. (2000) 15. Toyama, K., Blake, A.: Probabilistic exemplar-based tracking in a metric space. In: Proc. 8th Int. Conf. Computer Vision. Volume 2. (2001) 50–57 16. Brand, M.: Shadow puppetry. Proc. 7th Int. Conf. Computer Vision (1999) 1237– 1244 17. Carlsson, S.: Order structure, correspondence and shape based categories. In: Shape Contour and Grouping in Computer Vision. Springer LNCS 1681 (1999) 58–71 18. Carlsson, S., Sullivan, J.: Action recognition by shape matching to key frames. Workshop on Models versus Exemplars in Computer Vision at CVPR (2001) 19. Mori, G., Malik, J.: Estimating human body configurations using shape context matching. Workshop on Models versus Exemplars in Computer Vision at CVPR (2001) 20. Belongie, S., Malik, J., Puzicha, J.: Matching shapes. In: Eighth IEEE International Conference on Computer Vision. Volume 1., Vancouver, Canada (2001) 454–461 21. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. PAMI (2002) (in press). 22. Taylor, C.J.: Reconstruction of articulated objects from point correspondences in a single uncalibrated image. CVIU 80 (2000) 349–363 23. Canny, J.: A computational approach to edge detection. IEEE Trans. PAMI 8 (1986) 679–698 24. Papadimitriou, C., Stieglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Prentice Hall (1982) 25. Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38 (1987) 325–340 26. Amit, Y., Kong, A.: Graphical templates for model registration. IEEE Trans. PAMI (1996) 27. Eguchi, H.: Moving Pose 1223. Bijutsu Shuppan-sha (1995)
Probabilistic Human Recognition from Video Shaohua Zhou and Rama Chellappa Center for Automation Research (CfAR) Department of Electrical and Computer Engineering University of Maryland, College Park, MD 20740 {shaohua, rama}@cfar.umd.edu
Abstract. This paper presents a method for incorporating temporal information in a video sequence for the task of human recognition. A time series state space model, parameterized by a tracking state vector and a recognizing identity variable, is proposed to simultaneously characterize the kinematics and identity. Two sequential importance sampling (SIS) methods, a brute-force version and an efficient version, are developed to provide numerical solutions to the model. The joint distribution of both state vector and identity variable is estimated at each time instant and then propagated to the next time instant. Marginalization over the state vector yields a robust estimate of the posterior distribution of the identity variable. Due to the propagation of identity and kinematics, a degeneracy in posterior probability of the identity variable is achieved to give improved recognition. This evolving behavior is characterized using changes in entropy. The effectiveness of this approach is illustrated using experimental results on low resolution face data and upper body data.
1
Introduction
Human recognition has been an extensive research area for a long time, using a variety of human biometrics like fingerprint, retinal or iris scan, face, and human body to name a few. The problem of human recognition system can be stated as follows: given a gallery of still templates and a probe, either a still image or a video sequence containing a certain biometric, determine the closest template in the gallery to the one in the probe, or rank the templates in the gallery according to a similarity measure. Significant research has been conducted on still-to-still human recognition [12,7] where the probe is a still image. Usually, an abstract representation of an image after a suitable geometric and photometric registration is formed using the various approaches and then recognition is performed based on this new representation. However, research efforts using video as the probe, or still-to-video recognition, are relatively few due to the following challenges [16] in typical applications like surveillance and access control: poor video quality, large illumination and
This work was completed with the support of the DARPA HumanID Grant N0001400-1-0908. All correspondences are addressed to
[email protected].
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 681–697, 2002. c Springer-Verlag Berlin Heidelberg 2002
682
S. Zhou and R. Chellappa
pose variations, and low image resolution. Most video-based recognition systems [2] do the following: the face or the body is first detected and then tracked over time. Only when a frame satisfying certain criteria (size, pose, etc) is acquired, recognition is performed using a still-to-still recognition technique. For this, the biometric part (face, whole body, etc) is cropped from the frame and transformed or registered with appropriate parameters. There are several unresolved issues in the above recognition systems: criteria for selecting good frames and estimation of parameters for registration. Also, still-to-still recognition does not effectively exploit temporal information. A common strategy that selects several good frames to perform recognition per frame, then votes on these recognition results for a final solution is rather ad hoc. In this paper, we attempt to overcome these difficulties with a unified probabilistic framework. We formulate human recognition as a hypothesis-testing problem and derive the posterior probability of each hypothesis given the observation, which contains a transformed and noise-corrupted version of one hypothesis. To accommodate the above requirement and video sequences, we propose a time series state space model that is parameterized by a tracking state vector (e.g. affine transform parameter) and an identity variable. The model consists of three equations: – a state equation governing the temporal behavior of the kinematics vector, – an identity equation governing temporal evolution of the identity variable, – an observation equation at each time instant. Using the SIS technique, the joint distribution of state vector and identity variable is estimated at each time instant and then propagated to the next time instant governed by the state and identity equations. The marginal distribution of the identity variable is estimated to provide a recognition result. There is no need for selecting good video frames in this framework. Ultimately, the two tasks, namely discriminating the identity and determining the transformation parameter, are unified and solved here. However, a face detector is still needed to provide the prior distribution for the state vector. In the following, Section 2 summarizes some related work in the literature. Section 3 introduces the mathematical framework and establishes the timeevolving behavior of posterior probability of identity variable based on some minor assumptions. Section 4 presents the SIS technique and the actual recognition algorithms. Section 5 presents and discusses experimental results, and Section 6 presents conclusions.
2
Related Work
There are numerous papers in the literature on recognizing human using biometrics. Interested readers may refer to [7] for a general review of statistical pattern recognition, and [16] for a recent survey on face recognition. Probabilistic visual tracking in video sequences has recently gained significant attention. Generally, a state space model is applied to accommodate the
Probabilistic Human Recognition from Video
683
dynamics of a video sequence. The task of visual tracking is reduced to solving the posterior distribution of the state vector given an observation. Isard and Blake [6] proposed the CONDENSATION algorithm to track an object in a cluttered environment. In their work, the object is represented by a robust active contour. A near real-time performance and a high tracking accuracy are reported. However, only the tracking problem is considered. This technique has been extended to perform other tasks beside tracking. Li and Chellappa [9] performed simultaneous tracking and verification via sequential posterior estimation. At each time instant, the posterior probability of the state vector is evaluated using the SIS technique. The verification probability is computed on a proper region of the state space. They presented experimental results on both synthetic data and real sequences (some using face information as well). Our method is somewhat similar to this approach, but there are significant differences from it. We will highlight them in the discussion part in Sec. 5. Black and Jepson [1] use a CONDENSATION-based algorithm to match temporal trajectories. Models of temporal trajectories, such as gestures and facial expressions, are trained beforehand, and are gradually matched against the human motion in a new image sequence. The joint posterior distribution of model selection, local stretching, scaling, and position evolves as time proceeds. Edwards et al [5] learn how individual faces vary through video sequences by decoupling sources of image variations such as expression, lighting and pose. The assumption that the identity remains the same throughout the video sequence is used. Torre et al [14] propose a probabilistic framework to perform tracking and activity recognition using video. They first track the motion of rigid and non-rigid appearance, constrained with a mixture of pre-trained Hidden Markov Models (HMM) [13] with one model for each activity. After tracking, activity recognition is performed using standard Viterbi algorithm [13]. In [10], recognition of face over time is implemented by constructing a face identity surface. The face is first warped to a frontal view, and its KDA (Kernel Discriminant Analysis) features are used to form a trajectory. It is shown that the recognitive evidence accumulates over time along the trajectory. However, this recognition algorithm is deterministic.
3
Mathematical Framework
In this section, we present the mathematical details on how we establish our propagation model for recognition and discuss its impact on the posterior distribution of identity variable under some minor assumptions. 3.1
A Time Series State Space Model for Recognition
We will use the following mathematical notations: – An image I. It could be represented using raw intensity values on the image region R, i.e. I(R), or abstract features extracted from I(R).
684
S. Zhou and R. Chellappa
– A transformed version of image I, f (I; θ), with state vector θ ∈ Θ, a continuous state space. The transform can be either geometric or photometric or both. – A gallery set H = {I1 , I2 , . . . , IN )}, indexed by an identity variable n ∈ N = {1, 2, . . . N }. In is the still template for the identity n, which may be registered beforehand or not. A time series state space model for recognition can be described as follows: 1. State equation:
θt = g(θt−1 ) + ut ; t ≥ 1,
(1)
where θt is the state vector and ut the state noise respectively at time instant t. If g(.) is an identity function, it is a Brownian motion model. 2. Identity equation: nt = nt−1 ; t ≥ 1, (2) assuming that identity does not change as time proceeds. 3. Observation equation: f (yt ; θt ) = Int + vt ; t ≥ 1,
(3)
where yt is the observation and vt the observation noise respectively at time instant t. 4. Prior distributions: p(θ0 |y0 ), p(n0 |y0 ). (4) 5. Noise distributions: p(ut ) or p(θt |θt−1 ), p(vt ) or p(yt |nt , θt ); t ≥ 1,
(5)
where p(yt |nt , θt ) is the likelihood. 6. Statistical independence: n0 ⊥ θ0 , ut ⊥ vs ; t, s ≥ 1 ut ⊥ us , vt ⊥ vs ; t, s ≥ 1 & t = s,
(6)
where ⊥ implies statistical independence. Given this model, our goal is to compute the posterior probability p(nt |y0:t ), where y0:t = {y0 , . . . , yn }. It is in fact a probability mass function (PMF), as well as a marginal probability of p(nt , θt |y0:t ). Therefore, the problem is reduced to computing the posterior probability. 3.2
The Posterior Probability of Identity Variable
The evolution of the posterior probability p(nt |y0:t ) as time proceeds is very interesting since the identity variable does not change by assumption, i.e., p(nt |nt−1 ) = δ(nt − nt−1 ).
Probabilistic Human Recognition from Video
685
By time recursion, Markov properties and statistical independence embedded in the model, we can easily derive: p(n0:t , θ0:t |y0:t ) = p(n0:t−1 , θ0:t−1 |y0:t−1 ) = p(n0 , θ0 |y0 )
p(yt |nt , θt )p(nt |nt−1 )p(θt |θt−1 ) p(yt |y0:t−1 )
t p(ys |ns , θs )p(ns |ns−1 )p(θs |θs−1 ) p(ys |y0:s−1 ) s=1
= p(n0 |y0 )p(θ0 |y0 )
t p(ys |ns , θs )δ(ns − ns−1 )p(θs |θs−1 ) . (7) p(ys |y0:s−1 ) s=1
Therefore, by marginalizing over θ0:t and n0:t−1 , we get
t p(ys |n, θs )p(θs |θs−1 ) dθt . . . dθ0 . p(ys |y0:s−1 ) θ0 θt s=1 (8) Thus p(nt = n|y0:t ) is determined by theprior distribution p(n0 = n|y0 ) and t the product of the likelihood functions, s=1 p(ys |n, θs ). If a uniform prior is t assumed as below, then s=1 p(ys |n, θs ) is the only determining factor. Suppose that, (A) the prior probability for each hypothesis is same,
p(nt = n|y0:t ) = p(n|y0 )
...
p(θ0 |y0 )
p(n0 |y0 ) = 1/N ; n0 ∈ N ,
(9)
and (B) for the correct hypothesis c ∈ N , there exists a constant K > 1 such that, c, (10) p(yt |c, θt ) ≥ K ∗ p(yt |nt , θt ); t ≥ 1, nt ∈ N , & nt = where K is a lower bound on the likelihood ratio. Substitution of Eqns. 9 and 10 into Eqn. 8 gives c, p(c|y0:t ) ≥ K t ∗ p(nt |y0:t ); nt ∈ N , & nt =
(11)
t N where K t = s=1 K. Since nt =1 p(nt |y0:t ) = 1, it is easy to see that for the correct hypothesis, (12) p(c|y0:t ) ≥ p(c|y0:t−1 ). In other words, the posterior probability of the correct hypothesis is nondecreasing as time proceeds. More interestingly, from Eqn. 11, we have (N − 1)p(c|y0:t ) ≥ K t ∗
N
p(nt |y0:t ) = K t ∗ (1 − p(c|y0:t )),
(13)
nt =1,nt = c
i.e., p(c|y0:t ) ≥
Kt . Kt + N − 1
(14)
686
S. Zhou and R. Chellappa
Since K > 1 and p(c|y0:t ) ≤ 1, lim p(c|y0:t ) = 1,
t→∞
(15)
implying that degeneracy will be reached in the correct identity variable for some sufficiently large t. However, all these derivations are based on conditions (A) and (B). Though it is easy to satisfy condition (A), difficulty arises in practice in order to satisfy condition (B) for all the frames in the sequence. Fortunately, as we will see in the experiment in Sec. 5, numerically this degeneracy is still reached even if condition (B) is satisfied only for most but not all frames in the sequence. This issue is also addressed in Sec. 5. To change a viewing angle, we use the notation of entropy [3]. Given a PMF p(x); x ∈ N , the entropy is defined as: p(x) log2 p(x). (16) H(x) = − x∈N
Entropy essentially measures the average uncertainty about the random variable x with PMF p(x). It is well known that among all distributions taking values on {1, . . . , N }, the uniform distribution yields a maximum log2 N and the degenerate case yields the minimum 0, i.e., 0 ≤ H ≤ log2 N . Similarly, conditional entropy is defined as: p(y) p(x|y) log2 p(x|y). (17) H(x|y) = − y
x
In the context of this problem, conditional entropy H(nt |y0:t ) captures the evolving uncertainty of the identity variable given observations y0:t . However, the knowledge of p(y0:t ) is needed to compute H(nt |y0:t ), we simply assume it is degenerate in the actual observations y˜0:t since we observe only this particular sequence, i.e., p(y0:t ) = δ(y0:t − y˜0:t ). Now, p(nt |˜ y0:t ) log2 p(nt |˜ y0:t ). (18) H(nt |y0:t ) = − nt ∈N
Under the conditions mentioned above, we expect H(nt |y0:t ) to decrease as time proceeds.
4
SIS Algorithms
If the model is linear with Gaussian noise, it is analytically solvable by a Kalman filter which essentially propagates the mean and variance of a Gaussian distribution over time. For nonlinear and non-Gaussian cases, an extended Kalman filter (EKF) and its variants are then proposed to give an approximate analytic solution. Recently, a sequential Monte Carlo method [6,4,8,11] has been used to provide a numerical solution for propagating an arbitrary distribution.
Probabilistic Human Recognition from Video
4.1
687
Importance Sampling
The key idea of Monte Carlo method is that an arbitrary probability distribution π(x) can be represented closely by a set of discrete samples. It is ideal to draw i.i.d. samples {x(m) }M m=1 from π(x). However it is often difficult to implement, especially for non-trivial distributions. Instead, a set of samples {x(m) }M m=1 is drawn from an importance function g(x) which is easy to sample from, then a weight w(m) = π(x(m) )/g(x(m) ) is assigned to each sample. In the ideal case, the importance function g(x) is π(x) itself, each sample has its weight 1. This technique is called Importance Sampling (IS). To accommodate a video, importance sampling is used in a sequential fashion, leading to Sequential Importance Sampling (SIS). For a complete SIS recipe, refer to [4,11]. We now introduce the following definition and two propositions. Definition 1: A set of weighted random samples S = {(x(m) , w(m) )}M m=1 is proper with respect to π(x) if for any integrable function h(x), M (m) )w(m) m=1 h(x lim = Eπ [h(x)]. (19) M (m) M →∞ m=1 w Proposition 1: When π(x) is a PMF defined on a finite sample space, the proper sample set should exactly include all samples in the sample space. Proposition 2: If a set of weighted random samples {(x(m) , y (m) , w(m) )}M m=1 is proper with respect to π(x, y), then a new set of weighted random samples {(y (k) , w (k) )}K k=1 , which is proper with respect to π(y), the marginal of π(x, y), can be constructed as follows: (k) K }k=1 , where 1) Remove the repetitive samples from {y (m) }M m=1 to obtain {y (k) all y ’s are distinct; 2) Sum the weight w(m) belonging to the same sample y (k) to obtain the weight w (k) , i.e., M w (k) = w(m) . (20) m=1,y (m) =y (k)
Proof: By Definition 1, for any integrable function h(x, y), we have M (m) (m) , y )w(m) m=1 h(x = Eπ [h(x, y)]. (21) lim M (m) M →∞ m=1 w By taking the above h(x, y) as a function of only y, g(y) = h(x, y)dx, we get lim
M →∞
M
(m) )w(m) m=1 g(y M (m) m=1 w
= Eπ [g(y)].
(22)
However, in the above equation, some y (m) might be of same value. It is necessary (k) to merge them to arrive at a new sample set {(y (k) , w (k) )}K k=1 , where w representing the weights belonging to y (k) , i.e., using Eqn. 20.
688
4.2
S. Zhou and R. Chellappa
Algorithms and Their Efficiency
In the context of this framework, the posterior probability p(nt , θt |y0:t ) is represented by a set of indexed and weighted samples (m)
St = {(nt
(m)
, θt
(m)
, wt
)}M m=1
(23)
with nt as the above index. By Proposition 2, we can sum the weights of the samples belonging to the same index nt to obtain a proper sample set {nt , βnt }N nt =1 with respect to the posterior PMF p(nt |y0:t ). We apply the following algorithm to solve this problem. Algorithm I shown in Fig. 1 is similar to CONDENSATION [6] for computing the joint distribution p(nt , θt |y0:t ). (m)
(m)
Initialize a sample set S0 = {(n0 , θ0 , 1)}M m=1 according to prior distributions p(n0 |z0 ) and p(θ0 |z0 ). For t = 1, 2, . . . For m = 1, 2, . . . , M (m) (m) (m) Resample St−1 = {(nt−1 , θt−1 , wt−1 )}M m=1 to obtain a new sample
(m)
(m)
(nt−1 , θt−1 , 1).
(m)
Predict sample by drawing (nt
(m) p(θt |θt−1 ).
(m)
(m)
, θt
(m)
(m)
) from p(nt |nt−1 ) and
(m)
Update weight using αt = p(zt |nt , θt ). End (m) (m) M (m) Normalize each weight using wt = αt / m=1 αt . Marginalize over θt to obtain weight βnt for nt . End Fig. 1. Algorithm I
Algorithm I is not efficient in terms of its computational load. Since N = {1, 2, . . . , N } is a countable sample space, we need N samples for the identity variable nt according to Proposition 1. Assume that, for each identity variable nt , J samples are needed to represent θt . Hence, we need M = J ∗ N samples in total. Further assume that one resampling step takes Tr unit time (ut), one predicting step Tp ut, one updating step Tu ut, the normalizing step Tn ut, and the marginalizing step Tm ut. In the updating step, there are two substeps: computing the transformed image f (yt ; θt ) and evaluating the likelihood p(yt |nt , θt ), taking Tt and Tl ut respectively . Obviously, the total computation is J ∗ N ∗ (Tr + Tp + Tt + Tl ) + Tn + Tm ut to deal with one video frame. It is well known that computing the transformed image is much more expensive than other operations, i.e., Tt >> max(Tr , Tp , Tl ). Therefore, as the number of templates N grows, the computational load increase dramatically. To avoid the ’brute force’ method in Algorithm I, we develop an efficient Algorithm II, which is motivated by the fact that the sample space N is countable.
Probabilistic Human Recognition from Video
689
Therefore, an exhaustive search of sample space N is possible. Mathematically, we release the random sampling in the identity variable nt by constructing sam(j) ples as follows: for each θt , (j)
(j)
(j)
(j)
(j)
(j)
(1, θt , wt,1 ), (2, θt , wt,2 ), . . . , (N, θt , wt,N ). In Algorithm II, we in fact use the following notation for the sample set, (j)
(j)
(j)
(j)
(j)
St = {(θt , wt , wt,1 , wt,2 , . . . , wt,N )}Jj=1 ,
(24)
N (j) (j) with wt = n=1 wt,n . Based on this, we can first resample the ’marginal’ sample set (j) (j) (j) (j) {(θt−1 , wt−1 )}Jj=1 to arrive at θt−1 ’s. After resampling, set wt−1 = 1 and
(j)
(j)
(j)
(j)
wt−1,n = wt−1,n /wt−1 for all j = 1, 2, . . . , N . We then predict θt ’s, governed by
(j)
p(θt |θt−1 ). Finally, the propagation of the joint distribution requires the weighting step as follows: (j)
(j)
(j)
wt,n = wt−1,n ∗ p(yt |n, θt ).
(25)
The core of Algorithm II lies in that, instead of propagating random samples on both state vector and identity variable, we can keep the samples on the identity variable fixed and let those on the state vector random. Also, we propagate the marginal distribution for tracking the state vector, but we still propagate the joint distribution for recognition purposes. Algorithm II is summarized in Fig. 2. (j)
Initialize a sample set S0 = {(θ0 , N, 1, ..., 1)}Jj=1 according to prior distribution p(θ0 |z0 ). For t = 1, 2, . . . For j = 1, 2, . . . , J (j) (j) Resample St−1 = {(θt−1 , wt−1 )}Jj=1 to obtain a new sample
(j)
(j)
(j)
(j)
(j)
(j)
(θt−1 , 1, wt−1,1 , . . . , wt−1,N ), where wt−1,n = wt−1,n /wt−1 for n = 1, 2, . . . , N . (j) (j) Predict sample by drawing (θt ) from p(θt |θt−1 ). For n = 1, . . . , N (j) (j) (j) Update weight using αt,n = wt−1,n ∗ p(zt |n, θt ). End End J (j) (j) N (j) (j) Normalize each weight using wt,n = αt,n / n=1 j=1 αt,n and wt =
N
(j)
wt,n . Marginalize over θt to obtain weight βnt for nt . End n=1
Fig. 2. Algorithm II
690
S. Zhou and R. Chellappa
Compared with Algorithm I, Algorithm II is more efficient and accurate. The total computation of Algorithm II is J ∗ (Tr + Tp + Tt ) + J ∗ N ∗ Tl + Tn + Tm ut, a tremendous saving over Algorithm I when dealing with a large database since the majority computational time J ∗ Tt does not depend on N . To appreciate its accuracy, consider the following case for Algorithm I: all samples belonging to identity nt = c, where c is the correct hypothesis, are drawn with a bias in θt . More specifically, though p(yt |c, θt ) > p(yt |nt , θt ) holds provided that the same θt is applied in both sides, different θt ’s can be chosen in a bias to disobey this inequality. However, Algorithm II elegantly eliminates such a bias.
5 5.1
Experiments and Discussions Experimental Results
We use video sequences with subjects walking towards a camera in order to simulate typical scenarios like in visual surveillance. There are 30 subjects, each having one face template and one upper body template. The face gallery and the upper body gallery are as shown in Fig. 3. The probe set contains 30 video sequences, one for each subject. Fig. 4 gives some example frames in one probe video. Note that both galleries are captured under different circumstances from the probe and that the probe has considerable change in scale. These images were collected, as part of the HumanID project, by National Institute of Standards and Technology and University of South Florida researchers. Model parameters used in the experiments are chosen as follows. 1. Image representation. A reconstructed image from top 300 principal components or eigenfaces [15] is used to represent the face, while the raw intensity array is used to represent the upper body. Fig. 5 shows the top 10 eigenfaces. 2. Geometric transformation is only affine. Specifically, θ = (a1 , a2 , a3 , a4 , tx , ty ) where {a1 , a2 , a3 , a4 } are deformation parameters and {tx , ty } are 2-D translation parameters. It is a reasonable approximation since there is no significant out-of-plane motion in the scenario that the subject walks towards the camera. Regarding photometric transformation, only histogram equalization, a typical but coarse method dealing for enhancement, is adopted for face gallery and no preprocessing operation is performed for body gallery. 3. Prior distribution p(θ0 |y0 ) is Gaussian, whose mean comes from the initial detector and whose covariance matrix is manually specified. 4. Noise distribution p(ut ) or p(θt |θt−1 ) is Gaussian, whose mean and covariance matrix are manually specified. Also function g(.) in Eqn. 1 is assumed to be an identity function. Furthermore, p(ut ) is not time-varying. This is essentially a constant-velocity model. Given the scenario that the subject is walking towards the camera, the scale increases with time. However, under perspective projection, this increase is no longer linear, causing the constant-velocity model to be not optimal. However, experimental results show that as long as the samples of θ can cover the motion, this model can be applied for simplicity.
Probabilistic Human Recognition from Video
691
Fig. 3. The face and body galleries. The face image size is 48x42 and the body image size is 100x75.
Fig. 4. Example frames in one probe video. The image size is 720x480 while the actual face size ranges approximately from 20x20 in the first frame to 60x60 in the last frame.
692
S. Zhou and R. Chellappa
Fig. 5. The top 10 eigenfaces.
5. Noise distribution p(vt ) or the likelihood p(yt |nt , θt ) is a ’truncated’ Laplacian: L ∗ exp(− vt /σ) if vt ≤ λσ p(yt |nt , θt ) = (26) L ∗ exp(−λ) if vt > λσ where I(R) = r∈R |I(r)|, σ and λ are manually specified, and L is a normalizing constant. Furthermore, p(vt ) is not time-varying. Gaussian distribution is widely used as a noise model, accounting for sensor noise, digitization noise, etc. However, given the observation equation: vt = f (yt ; θt ) − Int , the dominant part of vt becomes the high-frequency residual if θt is not proper, and it is well known that the high-frequency residual of natural images is more Laplacianlike. The ’truncated’ Laplacian is used to give a ’surviving’ chance for samples to accommodate abrupt motion changes.
Fig. 6. Posterior probability p(nt |y0:t ) against time instant t. Left: Algorithm I. Right: Algorithm II.
For Algorithm I and II, Fig. 6 presents the plot of the posterior probability p(nt |y0:t ) against frame index t. Fig. 7 presents the conditional entropy H(nt |y0:t ) against t and the MMSE estimate of the scale parameter a1 against t, both obtained using Algorithm II. In Fig. 4, the tracked face is superimposed on the image using a bounding box. Suppose the correct identity for Fig. 4 is c. From Fig. 6, we can easily observe that the posterior probability p(c|y0:t ) increases as time proceeds and eventually c|y0;t ) go to 0 finally. Evidenced in Fig. 7 is approaches 1, and all others p(nt = the decreasing conditional entropy H(nt |y0:n ) and the increasing scale parameter, which matches with the scenario: a subject walking towards a camera.
Probabilistic Human Recognition from Video
693
Fig. 7. Left: conditional entropy H(nt |y0:t ) against t. Right: MMSE estimate of a1 against t. Both are obtained using Algorithm II.
Table 1 summarizes the average recognition performance and computational time of two algorithms. As far as performance is concerned, there is no remarkable difference between the two algorithms, with Algorithm II giving slightly better results. However, there are big differences between two galleries. The reasons are summarized as follows: (i) the probe video is taken outside with the sunshine casting a strong shadow on the face while the face gallery is taken inside and the body gallery taken outside; (ii) the subjects are wearing the same dress for the probe and body galleries; and (iii) the body template is bigger than the face template. Fig. 8 shows a list of facial images cropped by hand from probe videos and then normalized. We then perform still-to-still face recognition using the eigenface approach [15], the recognition result is less than 30% for the top one match and less than 50% for top three matches. Obviously, algorithm II is more efficient than Algorithm I. It is about 10 times faster than Algorithm I as shown in Table I. Note that this experiment is implemented in C++ on a PC with P-III650 CPU and 512M RAM with the number of motion samples J chosen to be 200, the number of templates in the gallery N to be 30. Table 1. Summary of Algorithms I and II. Algorithm Gallery Probability of correct match as the top match Probability of correct match within top 3 matches Probability of correct match within top 5 matches Time per frame
I Face Body 47% 93% 70% 93% 83% 100% 22s
II Face Body 50% 93% 77% 93% 87% 100% 2.1s
694
S. Zhou and R. Chellappa
Fig. 8. Facial images cropped from probe videos and then normalized.
5.2
Discussions and Future Work
The following issues are worthy of investigation in the future. 1. The assumption that the identity does not change as time proceeds, i.e., p(nt |nt−1 ) = δ(nt − nt−1 ), could be relaxed by having nonzero transition probabilities between different identity variables. Consider this example: Initially, the most likely identity might be not the desired one because of low-resolution of the data or some artifacts; then, the desired one may pop up. Under our assumption that the identity remains the same, the incorrect choice will grab most of the samples initially. But, having nonzero transition probabilities will enable us an easier transition to the correct choice, making this algorithm more robust. 2. Choice of likelihood distribution p(yt |nt , θt ) and condition (B) . In general, the smaller vt is, the higher the likelihood p(yt |nt , θt ) and the higher the posterior p(nt |y0:t ). In this sense, an accurate solution to this problem is determined by the basic problem: how can we find an efficient distance metric or how good is the condition (B) satisfied? Fig. 9 plots, against the logarithm of the scale pa1 rameter, the ’average’ likelihood of the correct hypothesis, n∈N p(In |n, θ), N and that of the incorrect hypotheses, N (N1−1) m∈N ,n∈N ,m= p(I m |n, θ), of the n face gallery as well as the ’average’ likelihood ratio, i.e., the ratio between the above two quantities. The observation is that only within a narrow ’band’ that the condition (B) is well satisfied. Therefore, the success of SIS algorithm depends on how good the samples lie in a similar ’band’ in the high-dimensional affine space. Also, the lower bound K in condition (B) is too strict. If we take the mean of the ’average’ likelihood ratio shown in Fig. 9 as an estimate of K ( roughly 1.5 ), Eqn. 14 tells that, after 20 frames, the probability p(c|y0:t ) reaches 0.99! However, this is not reached in the experiments due to noise in the observations and incomplete parameterization of transformations. (j) (j) 3. Resampling. In Algorithm II, the marginal distribution {(θt−1 , wt−1 )}Jj=1 (j)
is sampled to obtain the sample set {(θt , 1)}Jj=1 . This may cause problem in principal since there is no conditional independence between θt and nt given y0:t . However, in a practical sense, this is not a big disadvantage because the purpose of resampling is to ’provide chances for the good streams (samples) to amplify themselves and hence rejuvenate the sampler to produce a better result
Probabilistic Human Recognition from Video
695
Fig. 9. Left: the ’average’ likelihood of the correct hypothesis and incorrect hypotheses against the log of scale parameter. Right: The ’average’ likelihood ratio against the log of scale parameter.
for future states as system evolves’[11]. Resampling scheme can be either simple random sampling with weights (like in CONDENSATION), residual sampling, or local Monte Carlo methods. Thus, this can be considered as a special resampling strategy which also amplifies samples with high weights. 4. Computational load. As mentioned earlier, two important numbers affecting computation are J, the number of motion samples, and N , the size of database. (i) The choice of J is an open question in the statistics literature. In general, bigger J produces more accurate results. (ii) The choice of N depends on applications. Since a small database is used in this experiment, it is not a big issue here. However, the computational burden may be excessive if N is large, even when Algorithm II is used. One possibility is to use a continuously parameterized representation, say γ, instead of discrete identity variable n. Now the task reduces to computing p(γt , θt |y0:t ). We then can rank the gallery easily using estimated γt . 5. Mutual dependence of tracking and recognition. Since joint posterior distribution is computed each time, the mutual dependence is obvious. If tracking fails, the recognition is meaningless. If recognition is poor, for instant, some background region in the video might be more favored than the biometric region according to the distance measure, making the tracker stick to the background. In fact, one reason for the low performance of face gallery is this kind of tracking failure. We are now developing an algorithm which cleverly splits the tracking and recognition tasks, but still uses the idea of propagation of posterior probability for recognition. 6. Now we highlight the differences from Li and Chellappa’s approach [9]. In [9], basically only the tracking state vector is parameterized in the state-space model. The identity is involved only in the initialization step to rectify the template onto the first frame of the sequence. However, in our approach both tracking state vector and identity variables are parameterized in the state-space model, which offers us one more degree of freedom and leads to a different approach for deriving the solution. The SIS technique is applied in both approaches to nu-
696
S. Zhou and R. Chellappa
merically approximate the posterior probability given the observation. Again in [9], it is the posterior probability of state vector and the verification probability is estimated by marginalizing on a proper region of state space redefined at each time instant. However, we always compute the joint density, i.e., the posterior probability of state vector and identity variable and the posterior probability of identity variable is just a free estimate by marginalizing over the state vector. Note that there is no time propagation of verification probability in [9] while we always propagate the joint density. One consequence is that we guarantee that nt ∈N p(nt |y0:t ) = 1, but there is no such guarantee in [9]. Their approach in some sense is more like a batch method by running the algorithm for different templates, while ours is truly recursive. Another important consequence is that in our approach the degeneracy in the correct identity eventually indicates an immediate decision while no such decision could be readily made from the verification probability in [9]. In addition, in terms of tracking accuracy, if the wrong template is rectified on the first frame in the initialization step, the tracking is more likely to be absorbed to the noisy background, while our approach is more robust since we consider all templates at the same time.
6
Conclusion
In this paper, a time series state space model is proposed to solve the two tasks of tracking and recognition simultaneously. This probabilistic framework, which overcomes many difficulties arising in conventional recognition approaches using video, is registration-free and poses no need for selecting good frames. More importantly, temporal information is elegantly exploited. However, this model is nonlinear and non-Gaussian, with potential nonexistence of an analytic solution. Two algorithms based on the SIS technique have been applied to propagate the posterior distribution. Algorithm I, an extension of CONDENSATION, which is not so efficient in computation, while Algorithm II improves the computational efficiency and accuracy as well by incorporating the inherent property of this mixed model, namely the discreteness of identity variable. It turns out that an immediate recognition decision can be made in our framework due to the degeneracy of the posterior probability of the identity variable. The conditional entropy can also serve as a good indication for the convergence. A final remark is that this framework is not limited to recognizing only face and upper body, and is easily extended to other recognition tasks from video.
References 1. M. J. Black and A. D. Jepson. A probabilistic frameworkfor matching temporal trajectories. Proc. of ICCV, pages 176–181, 1999. 2. T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland. Multimodal person recognition using unconstrained audio and video. Proc. of Intl. Conf. on Audio- and Video-Based Person Authentication, pages 176–181, 1999.
Probabilistic Human Recognition from Video
697
3. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. 4. A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10(3):197–209, 2000. 5. T. J. Edwards, C. J. Taylor, and T. F. Cootes. Improving identification performance by intergrating evidence from sequences. Proc. of CVPR, pages 486–491, 1999. 6. M. Isard and A. Blake. Contour tracking by stochatic propagation of conditional density. Proc. of ECCV, 1996. 7. A. K. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Trans. PAMI, 22:4–37, 2000. 8. G. Kitagawa. Monte carlo filter and smoother for non-gaussian nonlinear state space models. J. Computational and Graphical Statistics, 5:1–25, 1996. 9. Baoxin Li and R. Chellappa. Simultaneous tracking and verification via sequential posterior estimation. Proc. of CVPR, pages 110–117, 2000. 10. Yongmin Li, Shaogang Gong, and H. Liddell. Constucting structures of facial identities on the view sphere using kernel discriminant analysis. Proc. of the 2rd Intl. Workshop on SCTV, 2001. 11. J. S. Liu and R. Chen. Sequential monte carlo for dynamic systems. Journal of the American Statistical Association, 93:1031–1041, 1998. 12. P. J. Philipps, H. Moon, S. Rivzi, and P. Ross. The feret testing protocol. Face Recognition: From Theory to Applications, 83:244–261, 1998. 13. L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proc. of IEEE, 77(2), 1989. 14. De la Torre, F. Y. Yacoob, and L. Davis. A probabilistic framework for rigid and non-rigid appearance-based tracking and recognition. Proc. of 4th Intl. Conf. on Auto. Face and Gesture Recognition, 2000. 15. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neutoscience, 3:72–86, 1991. 16. W. Y. Zhao, R. Chellappa, A. Rosenfeld, and P. J. Phillips. Face recognition: A literature survey. UMD CfAR Technical Report CAR-TR-948, 2000.
SoftPOSIT: Simultaneous Pose and Correspondence Determination Philip David1,2 , Daniel DeMenthon3 , Ramani Duraiswami4 , and Hanan Samet1 1 3 4
Department of Computer Science, University of Maryland, College Park, MD 20742 2 Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783-1197 Language and Media Processing Laboratory, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 Perceptual Interfaces and Reality Laboratory, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 {phild@cs,daniel@cfar,ramani@umiacs,hjs@cs}.umd.edu ‡
Abstract. The problem of pose estimation arises in many areas of computer vision, including object recognition, object tracking, site inspection and updating, and autonomous navigation using scene models. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image in the case that correspondences between model points and image points are unknown. The algorithm combines Gold’s iterative SoftAssign algorithm [19, 20] for computing correspondences and DeMenthon’s iterative POSIT algorithm [13] for computing object pose under a full-perspective camera model. Our algorithm, unlike most previous algorithms for this problem, does not have to hypothesize small sets of matches and then verify the remaining image points. Instead, all possible matches are treated identically throughout the search for an optimal pose. The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm performs well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has a run-time complexity that is better than previous methods by a factor equal to the number of image points. The algorithm is being applied to the practical problem of autonomous vehicle navigation in a city through registration of a 3D architectural models of buildings to images obtained from an on-board camera. Keywords: Object recognition, autonomous navigation, POSIT, SoftAssign
1
Introduction
We present an algorithm for solving the model-to-image registration problem, which determines the position and orientation (the pose) of a 3D object with respect to a camera coordinate system given a model of the object with 3D reference points and a single 2D image of these points. We assume no additional information to constrain the ‡
Support of NSF grants EAR-9905844, IIS-0086162, and IIS-9987944 is gratefully acknowledged.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 698–714, 2002. c Springer-Verlag Berlin Heidelberg 2002
SoftPOSIT: Simultaneous Pose and Correspondence Determination
699
pose of the object or the correspondences. This is also known as the simultaneous pose and correspondence problem. Automatic registration of 3D models to images is important for many applications, including object recognition and tracking, site inspection and updating, and autonomous navigation using scene models. The problem is difficult because it requires solution of two coupled problems, correspondence and pose, each easy to solve only if the other has been solved first: 1. Solving the pose problem consists of finding the rotation and translation of the object with respect to the camera coordinate system. Given matching model and image features, one can determine the pose that best aligns those matches. For 3 to 5 matches, the pose can be found in closed-form by solving polynomial equations [17, 24,37]. For six or more matches, linear and nonlinear approximate methods are generally used [13,16,23,25,29]. 2. Solving the correspondence problem requires matching image and model features. If the object pose is known, one can determine such matches. Projecting the model with known pose into the original image, one can match features that project sufficiently close to an image feature. This is the approach typically taken for pose verification [22]. The classic approach to solving these coupled problems is the hypothesize-and-test approach [21] where a small set of correspondences are first hypothesized, and the corresponding pose of the object is computed. Using this pose, the model points are back-projected into the image. If the original and back-projected images are sufficiently similar, the pose is accepted; else a new hypothesis is formed, and the process repeated. The best known example of this approach is RANSAC [17] for the case that no information is available to constrain the correspondences. When there are J image points and K model points1 , and three correspondences are used to determine pose, a high probability of success can be achieved by the RANSAC algorithm in O(KJ 4 ) operations [11]. The problem we address occurs when taking a model-based approach to objectrecognition. (The other main approach to object recognition is appearance-based [31], where multiple object views are compared to the image. However, since 3D models are not used, accurate object pose is not recovered.) Many investigators (e.g., [8,9,15, 26,28,33]) approximate the nonlinear perspective projection via linear affine approximations. This is accurate when the relative depth of object features is small compared to the distance of the object from the camera. Among the pioneer contributions were Baird’s tree-pruning method [1], with exponential time complexity for unequal point sets, and Ullman’s alignment method [35] with time complexity O(J 4 K 3 log K). Geometric hashing is employed in [28] to determine an object’s identity and pose using a hashing metric computed from image features; because the metric must be invariant to camera viewpoint, the method can only be applied for affine camera models [5]. In [12] an approach using a binary search by bisection of pose boxes in two 4D spaces was used, extending the work of [1,6,7] on affine transforms, however the method was computationally intensive. The approach of [27] is similar: An initial volume of pose 1
Many authors use N and M instead of J and K to denote the numbers of image and model points.
700
P. David et al.
space is guessed, and all correspondences compatible with this volume are accounted for. Then the pose volume is recursively reduced until it can be viewed as a single pose. Few researchers have addressed the full perspective problem. The object recognition approach of [2] extends the geometric hashing approach by using non-invariant image features: off-line training is performed to learn 2D feature groupings associated with a large number of views. An on-line recognition stage then uses new feature groupings to index into a database of learned model-to-image correspondence hypotheses, and these hypotheses are used for pose estimation and verification. In [36], the abstract problem is formalized in a way similar to the present approach, as the optimization of an objective function combining correspondence and pose. However, the correspondence constraints are not represented analytically. Instead, each model feature is explicitly matched to the closest line of sight of the image features. The closest 3D points on the lines of sight are found for each model feature, and then the pose that brings the model features closest to these 3D points is selected; this allows an easier 3D to 3D pose problem to be solved. The process is repeated until a minimum is reached. A randomized pose clustering algorithm is presented in [32] whose time complexity is O(KJ 3 ). In this approach, instead of testing each hypothesis as it is generated (as in the RANSAC approach), all hypotheses are clustered in a pose space before back-projection and testing. This step is performed only on high probability poses that are determined from the larger clusters. A method related to ours is presented in [3] that uses random-start local search with a hybrid pose estimation algorithm employing both full and weak perspective models. A steepest descent search in the space of model-to-image line segment correspondences is performed. A weak-perspective algorithm is used in ranking neighboring points in this search space, and a full-perspective algorithm is used to update the model’s pose for new correspondence sets. The empirical time complexity is O(K 2 J 2 ). Our approach, termed the SoftPOSIT algorithm, integrates the iterative pose technique called POSIT (Pose from Orthography and Scaling with ITerations) due to DeMenthon and Davis [13], and the iterative 2D to 2D or 3D to 3D correspondence assignment technique called SoftAssign due to Gold and Rangarajan [19,20]. A global objective function is defined that captures the nature of the problem in terms of both pose and correspondence, which are determined simultaneously by applying a deterministic annealing schedule and by minimizing this global objective function at each step. In the following sections, we describe each step of the method, and provide pseudocode for the algorithm. We then evaluate the algorithm using Monte Carlo simulations with various levels of clutter, occlusion and image noise, and finally apply the algorithm to some imagery of a city scene.
2
POSIT Algorithm
We summarize the original POSIT algorithm [13] and then present a variant that performs closed-form minimization of an objective function. This function is modified below to include the simultaneous pose and correspondence problem in one objective. Consider a pinhole camera of focal length f and an image feature point p with Euclidean and homogeneous coordinates x, y and (wx, wy, w), respectively; p is the perspective projection of 3D point P with homogeneous coordinates (X, Y, Z, 1) in the
SoftPOSIT: Simultaneous Pose and Correspondence Determination
701
Fig. 1. Geometric interpretation of the POSIT computation. Image point p , the scaled orthographic projection of world point P , is computed by one side of the POSIT equations. Image point p , the scaled orthographic projection of point PL on the line of sight of p, is computed by the other side of the equation. The equations are satisfied when the two points are superposed, which requires that the world point P be on the line of sight of image point p. The plane of the figure is chosen to contain the plane of the optical axis and the line of sight L. The points P0 , P , P , and p are generally out of the plane.
frame of an object with origin P0 . There is an unknown transformation between the object and the camera coordinates, represented by a rotation matrix R = [R1 R2 R3 ]T and a translation vector T = (Tx , Ty , Tz ). RT1 , RT2 , RT3 are the row vectors of R. They are the unit vectors of the camera coordinate system expressed in the model system. T is the vector from the center of projection O of the camera to the origin P0 expressed in the camera coordinate system. The coordinates of the projection p are related to the world point P by
f RT1 f Tx wx wy = f RT2 f Ty P0 P , 1 w RT3 Tz where P0 P = (X, Y, Z)T is the vector from P0 to P . The homogeneous image coordinates are defined up to a multiplicative constant; therefore the validity of the equality
702
P. David et al.
is not affected if we multiply all elements of the projection matrix by 1/Tz . Introducing the scaling factor s = f /Tz , we obtain T wx sR1 sTx P0 P = (1) , w = R3 · P0 P /Tz + 1. wy 1 sRT2 sTy In Eq. 1 for w, R3 · P0 P represents the projection of P0 P onto the optical axis. When the depth range of the model along the optical axis is small compared to the model distance, R3 · P0 P is small compared to Tz , and w 1. Then, perspective projection is well approximated by scaled orthographic projection (SOP) with scaling factor s: T x sR1 sTx P0 P = . (2) y 1 sRT2 sTy The general perspective equation (1) can be rewritten as sR1 sR2 XY Z1 = wx wy . sTx sTy
(3)
Assume the homogeneous coordinate w for each image point p has been computed at a previous step. We can then calculate wx, wy, and Eq. 3 relates the unknown pose components sR1 , sR2 , sTx , sTy , and the known image components wx, wy and known world coordinates X, Y, Z. If we know K world points Pk , k = 1, . . . , K, their image points pk , and their homogeneous components wk , we can write two linear systems of size K that can be solved for the unknown components of sR1 , sR2 and sTx and sTy , provided at least four of the points of the model with given image points are noncoplanar. After sR1 and sR2 are obtained, we get s, R1 and R2 , by imposing that R1 and R2 be unit vectors, and that R3 be the cross-product of R1 and R2 : s = (|sR1 | |sR2 |)1/2 ,
R1 = (sR1 )/s, R2 = (sR2 )/s,
R3 = R1 × R2 ,
Tx = (sTx )/s, Ty = (sTy )/s, Tz = f /s. To compute the wk required in the right-hand side of Eq. (3), we initially set wk = 1 for every point pk (corresponding to a SOP model). Once we get the pose for this first step, we compute better estimates for the wk using the expression for w in Eq. (1). Then we can iteratively solve Eqs. (3) again to obtain progressively refined poses. This iteration is stopped when the process becomes stationary.
3
Geometry and Objective Function
We consider a geometric interpretation of POSIT to represent it as minimization of an objective function. Consider, as in Fig. 1, a pinhole camera with center of projection at O, optical axis along Oz, an image plane Π at distance f from O, and an image center at c. Consider an object, the origin of its coordinate system at P0 , an object point P , corresponding image point p, and line of sight L through p. The image point p is the SOP of object point P . The image point p is the SOP of point PL obtained by shifting P to the line of sight of p in a direction parallel to the image plane.
SoftPOSIT: Simultaneous Pose and Correspondence Determination
703
One can show [14] that the image plane vector from c to p , is cp = s(R1 · P0 P + Tx , R2 · P0 P + Ty ). In other words, the left-hand side of Eq. (3) represents the vector cp in the image plane. One can also show [14] that the image plane vector from c to p is cp = (wx, wy) = wcp. In other words, the right-hand side of Eq. (3) represents the vector cp in the image plane. The image point p can be interpreted as a correction of the image point p from a perspective projection to a SOP of a point PL located on the line of sight at the same distance as P . P is on the line of sight L of p if, and only if, the image points p and p are superposed. Then cp = cp , i.e. Eq. (3) is satisfied. When we try to match the points Pk of an object to the lines of sight Lk of image points pk , it is unlikely that all or even any of the points will fall on their corresponding lines of sight, or equivalently that cpk = cp k or pk pk = 0. We can minimize a global 2 objective function E equal to the sum of the squared distances d2k =| pk p k | between image points pk and pk :
cp − cp 2 = d2k = ((M · Sk − wk xk )2 + (N · Sk − wk yk )2 ) (4) E= k k k
k
k
where, to simplify the subsequent notation, we introduce the vectors (with four homogeneous coordinates) Sk = (P0 Pk , 1), and M = (M1 , M2 , M3 , M4 ) = s(R1 , Tx ),
N = (N1 , N2 , N3 , N4 ) = s(R2 , Ty ).
We call M and N the pose vectors. In Fig. 1, notice that p p = sP P = sP PL . Thus this corresponds to minimizing the scaled sum of squared distances of model points along lines of sight, when distances are taken parallel to the image plane. This objective function is minimized iteratively. Initially, the wk are all set to one. Then the following two operations take place at each step: 1. Compute the pose vectors M and N assuming wk are known (Eq. (4)). 2. Compute the correction terms wk using the M and N just computed (Eq. (1) for w)). We now focus on the pose vectors M and N. The objective function is minimized when the partial derivatives of E with respect to the coordinates of the pose vectors vanish. This condition provides 4 × 4 linear systems for M and N with solutions Sk STk )−1 ( wk xk Sk ), N=( Sk STk )−1 ( wk yk Sk ). (5) M=( k
k
k
k
The matrix L = ( k Sk STk ) is a 4 × 4 matrix that can be precomputed.
4
Pose Calculation with Unknown Correspondences
The point p can be viewed as the image point p “corrected” for SOP using w computed at the previous step. The next step finds the pose such that the SOP of each point P is as close as possible to its corrected image point. Now when correspondences are unknown,
704
P. David et al.
each image point pj can match any of the model points Pk , and must be corrected using the w corresponding to Pk : wk = R3 · P0 Pk /Tz + 1.
(6)
Therefore for each image and model point pj and Pk we generate a corrected image point pjk , aligned with the image center c and with pj , and defined by cp jk = wk cpj .
(7)
The scaled orthographic projections M · Sk . cpk = N · Sk
pk
of the points Pk are (8)
The squared distances between the corrected points pjk and the SOPs are
2
= (M · Sk − wk xj )2 + (N · Sk − wk yj )2 . d2jk = pk p k
(9)
The simultaneous pose and correspondence problem can then be formulated as minimization of the global objective function E=
J K
mjk d2jk =
j=1 k=1
J K
mjk (M · Sk − wk xj )2 + (N · Sk − wk yj )2
(10)
j=1 k=1
where the mjk are weights, equal to zero or one, for each of the d2jk , and J and K are the number of image and model points, respectively. The mjk are correspondence variables that define the assignments between image and model feature points. Note that when all the assignments are well-defined, this objective function becomes equivalent to the objective function defined in Eq. (4). This function E is minimized iteratively, as follows: 1. Compute the correspondence variables assuming everything else is fixed (see below). 2. Compute the pose vectors M and N assuming everything else is fixed (see below). 3. Compute the corrections wk using the computed M and N (described above). 4.1
Pose Problem
We now focus on finding the optimal poses M and N, assuming the correspondence variables mjk are known and fixed. As before, the minimizing pose vectors of E at a given step are those for which the partial derivatives of E with respect to these vectors vanish. This condition provides 4 × 4 linear systems for the coordinates of M and N whose solutions are −1 J K K K J [M, N] = mk Sk ST (mjk wk xj Sk ), (mjk wk yj Sk ) , (11) k k=1
J
j=1 k=1
j=1 k=1
mjk . The terms Sk ST k are 4×4 matrices. Computing M and N requires K inversion of a 4 × 4 matrix, L = ( k=1 mk Sk ST k ), which is inexpensive. with mk =
j=1
SoftPOSIT: Simultaneous Pose and Correspondence Determination
4.2
705
Correspondence Problem
We obtain the correspondence variables mjk assuming that the d2jk are known and fixed, and minimizing E. Our aim is to find a zero-one assignment matrix, M = {mjk }, that specifies the matchings between a set of J image points and a set of K model points, and that minimizes E. This assignment matrix M has one row for each of the J image points pj and one column for each of the K model points Pk . M must satisfy the constraint that each image point match at most one model point, and vice versa. A slack row J + 1 and a slack column K + 1 are added for points with no correspondences. A one in the slack column K + 1 at row j indicates that the image point pj has no match among the model points. A one in the slack row J + 1 at column k indicates that the feature point Pk is not seen in the image. E will be a minimum if the assignment matrix M matches image and model points with the smallest distances d2jk . This problem is solved by the iterative SoftAssign technique [19,20]. We begin with a matrix M0 in which element m0jk is initialized to exp(−β(d2jk − α)), with β very small, and with all slack elements set to a small constant. The parameter α determines how far apart two points must be before being considered unmatchable (see [20]). The continuous match matrix M0 converges to a discrete matrix M using two concurrent procedures: 1. Each row and column of the correspondence matrix is normalized, alternately, by the sum of its elements. The resulting matrix then has positive elements with all rows and columns summing to one. (see Sinkhorn [34]) 2. The term β is increased as the iteration proceeds. As β increases and each row or column of M0 is renormalized, the terms m0jk corresponding to the smallest d2jk tend to converge to one while other elements tend to converge to zero. This is a deterministic annealing process [18] known as Softmax [4]. This is a desirable behavior, since it leads to an assignment of correspondence to matches that satisfy the matching constraints and whose sum of distances is minimized. This procedure was called SoftAssign in [19,20]. The resulting M is the assignment that minimizes E. This procedure along with the substeps that optimize pose and correct image points by SOP are combined into the iteration loop of SoftPOSIT. 4.3
Pseudocode for SoftPOSIT
The SoftPOSIT algorithm can be summarized as follows: Inputs: 1. A list of J image feature points pj = (xj , yj ). 2. A list of K world points Sk = (Xk , Yk , Zk , 1) = (P0 Pk , 1) in the object. Initialize slack elements of assignment matrix M to γ = 1/(max{J, K} + 1), β to β0 (β0 ≈ 4 × 10−4 if the pose is unconstrained, larger if a good initial guess is available). Initialize pose vectors M and N with expected pose or a random pose. Initialize wk = 1. Do A until β > βf inal (βf inal around 0.5) (Deterministic annealing loop)
706
P. David et al.
– Compute squared distances d2jk = (M · Sk − wk xj )2 + (N · Sk − wk yj )2 – Compute m0jk = γ exp(−β(d2jk − α)) – Do B until ∆M small (Sinkhorn’s method) K+1 • Update matrix M by normalizing across all rows: m1jk = m0jk / k=1 m0jk J+1 • Update matrix M by normalizing across all columns: m0jk = m1jk / j=1 m1jk – End Do B J K – Compute 4 × 4 matrix L = ( k=1 mk Sk ST k ) with mk = j=1 mjk −1 – Compute L J K – Compute M = L−1 ( j=1 k=1 mjk wk xj Sk ) J K – Compute N = L−1 ( j=1 k=1 mjk wk yj Sk ) – Compute s = |(M1 , M2 , M3 )|, R1 = (M1 , M2 , M3 )/s, R2 = (N1 , N2 , N3 )/s, R3 = R1 × R2 – Compute wk = R3 · P0 Pk /Tz + 1 – β = βupdate β (βupdate around 1.05) End Do A Outputs: Rotation matrix R = [R1 R2 R3 ]T , translation vector T = (Tx , Ty , Tz ), and assignment matrix M = {mjk } between image and world points.
5
Random Start SoftPOSIT
The SoftPOSIT algorithm described above performs a deterministic annealing search starting from an initial guess for the object’s pose. There is no guarantee of finding the global optimum. The probability of finding the globally optimal object pose and correspondences starting from an initial guess depends on a number of factors including the number of model points, the number of image points, the number of occluded model points, the amount of clutter in the image, and the image measurement noise. A common way of searching for a global optimum, and the one taken here, is to run the algorithm starting from a number of different initial guesses, and keep the first solution that meets a specified termination criteria. Our initial guesses range over [−π, π] for the three Euler angles, and over a 3D space of translations containing the true translation. Since each search for correspondence and pose is relatively expensive, we would like to have a mathematical statement that allows us to make the claim that, for a given number of starting points, our starting guesses sample the parameter space in some optimal manner. Fortunately, there are a set of deterministic points that have such properties. These are the quasi-random, or low discrepancy sequences [30]. Unlike points obtained from a standard pseudo-random generator, quasi-random points are optimally self-avoiding, and uniformly space filling. We use a standard quasi-random generator [10] to generate quasi-random 6-vectors in a unit 6D hypercube. These points are scaled to reflect the expected ranges of translation and rotation. 5.1
Search Termination
Ideally, one would like to repeat the search from a new starting point whenever the number of correspondences determined is not maximal. However, one usually does not
SoftPOSIT: Simultaneous Pose and Correspondence Determination
707
know what this maximal number is. Instead, we repeat the search when the number of model points matching to image points is less than some threshold tm . Due to occlusion and imperfect image feature extraction, not all model points will be detected as features in an image. Let pd be the ratio of the number of model points detected as image features to the total number of model points (pd ∈ [0, 1]). In the Monte Carlo simulations described below, pd is known. With real imagery, however, pd must be estimated based on the scene complexity and the reliability of the feature detection algorithm used. We terminate the search for better solutions when the current solution is such that the number of model points matching to image points is greater than or equal to the threshold tm = ρpd K where ρ ∈ [0, 1] determines the fraction of model points to be matched and K is the total number of model points. ρ accounts for measurement noise. In the experiments discussed below, we take ρ = 0.8. This test is not perfect, as it is possible for a pose to be very accurate even when the number of matched points is less than this threshold; this occurs mainly in cases of high noise. Conversely, a wrong pose may be accepted when the ratio of clutter features to detected model points is high. However, these situations are uncommon. 5.2
Early Search Termination
The deterministic annealing loop of the SoftPOSIT algorithm iterates over a range of values for the annealing parameter βk . In the experiments reported here, β0 is initialized to 0.0004 and is updated according to βk+1 = 1.05 × βk , and the iteration ends when βk > 0.5, or earlier if convergence is detected. This means that the annealing loop can run for up to 147 iterations. It’s usually the case that, by viewing the original image and, overlaid on top of this, the projected model points produced by SoftPOSIT, a person can determine early on in the iteration (e.g., around iteration 30) whether or not the algorithm is going to converge to the correct pose. It is desired that the algorithm make this determination itself, so that it can end the current unfruitful search and restart. A simple test is performed on each iteration to determine if it should continue or restart. At iteration k of SoftPOSIT, the match matrix M k = {mki,j } is used to predict the final correspondences of model to image points: upon convergence we expect image point i to correspond to model point j if mki,j > mku,v for all u =i and all v =j (however, this is not guaranteed). The number of predicted correspondences at iteration k, nk , is the number of pairs (i, j) that satisfy this relation. We define the match ratio on iteration k as rk = nk /(pd K) where pd is defined above. This metric is commonly used at the end of a local search to determine if the current solution for correspondence and pose is good enough to end the search for the global optimum. We, however, use this metric within the local search itself. Let CP denote the event that the SoftPOSIT algorithm eventually converges to the correct pose. Then, the algorithm restarts after the k th iteration if P (CP | rk ) < αP (CP ) where 0 < α ≤ 1. That is, the search is restarted from a new random starting condition whenever the posterior probability of eventually finding a correct pose given rk drops to less than some fraction of the prior probability of finding the correct pose. A separate posterior probability is required for each k because the ability to predict the outcome using rk improves as the iteration progresses. Although this test may result in termination of some searches which would eventually produce a good pose, it is expected on average that the total time required to
708
P. David et al.
find a good pose will be less. Our experiments show that this is true; we obtain a speedup by a factor of at least two. The posterior probability function for the k th iteration can be computed from P (CP ), the prior probability of finding a correct pose on one random local search, and from P (rk | CP ) and P (rk | CP ), the probabilities of observing a particular match ratio on the k th iteration given that the eventual pose is either correct or incorrect, respectively: P (CP | rk ) =
P (CP )P (rk | CP ) . P (CP )P (rk | CP ) + P (CP )P (rk | CP )
P (CP ), P (CP ), P (rk | CP ), and P (rk | CP ) are estimated in Monte Carlo simulations of the algorithm in which the number of model vertices and the levels of image clutter, occlusion, and noise are all varied. To estimate P (rk | CP ) and P (rk | CP ), the algorithm is repeatedly run on random test data. For each test, the values of the match ratio rk computed at each iteration are recorded. Once an iteration completes, ground truth information is used to determine whether or not the correct pose was found. If the pose is correct, then the recorded values of rk are used to update histograms representing P (rk | CP ); otherwise, histograms for P (rk | CP ) are updated. Upon completing the training, the histograms are normalized. See Fig. 2.
0.3
0.3
P(CP | r) P(r |CP) P(r) P(r |¬CP)
Probability
0.2
Probability
0.2
P(CP | r) P(r |CP) P(r) P(r |¬CP)
0.1
0
0.1
0
0.1
0.2
0.3
0.4
0.5 0.6 r (match ratio)
(a)
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 r (match ratio)
0.7
0.8
0.9
1
(b)
Fig. 2. Probability functions estimated for (a) the first iteration, and (b) the 31st iteration, of the SoftPOSIT algorithm.
6
Experiments
We investigate two important questions related to the performance of the SoftPOSIT algorithm: (a) How often does it find a “good” pose? (b) How long does it take?
SoftPOSIT: Simultaneous Pose and Correspondence Determination
6.1
709
Synthetic Data
The algorithm has been extensively evaluated in Monte Carlo simulations. The simulations are characterized by 5 parameters: nt , K, pd , pc , and σ. The parameter nt is the number of trials performed for each combination of values for the remaining 4 parameters. K is the number of points in a 3D model. pd is the probability that the image of any particular model point will be detected. pc is the probability that any particular image point is clutter, i.e., is not the image of some 3D model point. Finally, σ is the standard deviation of the normally distributed noise in the image coordinates of the non-clutter feature points, measured in pixels for a 1000 × 1000 image, generated by a simulated camera having a 37 degree field of view (focal length of 1500 pixels). The current tests were performed with nt = 100, K ∈ {20, 30, 40, 50}, pd ∈ {0.4, 0.6, 0.8}, pc ∈ {0.2, 0.4, 0.6}, and σ ∈ {0.5, 1.0, 2.5}. There were 10800 independent trials. For each trial, a 3D model is created in which the K model vertices are randomly located, with uniform probability, in a sphere centered at the origin. Because the algorithm works with points, only the model vertices are important. However, to make human interpretation of results easier, each model vertex is connected to the two closest remaining model vertices. The model is placed with random rotation and translation in the camera view. Each projected model point is detected with probability pd . We add Gaussian noise (N (0, σ)) to both x and y image coordinates. Finally, Kpd /(1 − pc ) randomly located clutter feature points are added to the true feature points, so that 100 × pc % of the feature points are clutter. Fig. 3 shows cluttered images of random models. We consider a pose to be good when it allows 80% (ρ = 0.8 in section 5.1) or more of the detected model points to be matched to image points. The number of random starts for each trial was limited to 104 . If a good pose is not found after 104 starts, the algorithm declares failure. Fig. 4 shows two examples of the pose found. Fig. 5 shows the success rate as a function of the number of model points for the case of σ = 2.5 and for all combinations of the parameters pd and pc . (Due to space limitations, we only describe results for the case of σ = 2.5; the algorithm performs better for smaller σ’s.) For more than 86% of the different combinations of simulation parameters, a good pose is found in 95% or more of the associated trials. For the remaining 14% of the tests, a good pose is found in 85% or more of the trials. (The overall success rate is 94%.) As expected, the higher the occlusion rate (lower pd ) and the clutter rate (higher pc ), the lower the success rate. For the high-clutter and high-occlusion tests, the success rate increases as the number of model points decreases. This is because a smaller number of model points are more easily matched to clutter than a larger number of model points. Fig. 6 shows the average number of random starts required to find a good pose. These numbers generally increase with increasing image clutter and occlusion. The run-time complexity of SoftPOSIT (for a single start) is easily seen to be O(JK) where J is the number of image points and K is the number of model points. Our results show that the mean number of random starts required to find a good pose, to ensure a probability of success of at least 0.95 in all but the highest clutter and occlusion levels, is bound by a function that is linear in the size of the input. That is, the mean number of random starts is O(J), assuming that K < J, as is normally the case. Then, the run-time complexity of SoftPOSIT with random starts is O(KJ 2 ). This is better than any known algorithm that solves simultaneous pose and correspondence problem under full perspective.
710
P. David et al.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Typical images of randomly generated models. The number of model points in the models are 20 – (a), 30 – (b), 40 – (c) and (d), and 50 – (e) and (f). In all cases pd = 1.0 and pc = 0.6.
(a)
(b)
Fig. 4. Two cluttered images with projected models overlayed for which we found a good pose. The black dots are the original image points; white dots are projections of model points for the computed pose, and the gray lines are the initial guess that lead to the true pose being found.
SoftPOSIT: Simultaneous Pose and Correspondence Determination
711
100
Success rate (%)
95
90
85 D = 0.8, C = 0.2 D = 0.8, C = 0.4 D = 0.8, C = 0.6 D = 0.6, C = 0.2 D = 0.6, C = 0.4 D = 0.6, C = 0.6 D = 0.4, C = 0.2 D = 0.4, C = 0.4 D = 0.4, C = 0.6
80
75
20
30
40
50
Number of model points
Fig. 5. Success rate as a function of the number of model points for fixed values of pd and pc . (Note that pd and pc are denoted D and C, respectively, in the legend of this figure and in the next few figures.)
1200
2200 D = 0.8, C = 0.2 D = 0.8, C = 0.4 D = 0.8, C = 0.6 D = 0.6, C = 0.2 D = 0.6, C = 0.4 D = 0.6, C = 0.6 D = 0.4, C = 0.2 D = 0.4, C = 0.4 D = 0.4, C = 0.6
1000
1800 1600 Std. dev. number of starts
800 Mean number of starts
D = 0.8, C = 0.2 D = 0.8, C = 0.4 D = 0.8, C = 0.6 D = 0.6, C = 0.2 D = 0.6, C = 0.4 D = 0.6, C = 0.6 D = 0.4, C = 0.2 D = 0.4, C = 0.4 D = 0.4, C = 0.6
2000
600
400
1400 1200 1000 800 600
200 400 0
20
30
40 Number of model points
(a)
50
200
20
30
40
50
Number of model points
(b)
Fig. 6. Number of random starts required to find a good pose as a function of the number of model points for fixed values of pd and pc . (a) Mean . (b) Standard deviation.
6.2
Experiments with Images
We applied the SoftPOSIT algorithm to imagery generated from a model of a district of Los Angeles by a commercial virtual reality (VR) system. Fig. 7 shows an image
712
P. David et al.
generated and a world model projected into that image using the pose computed by SoftPOSIT. Image feature points are automatically located in the image by detecting corners along the boundary of bright sky regions. Because the 3D world model has over 100,000 data points, we use a rough pose estimate (as may be generated by a GPS system) to cull the majority of model points that are far from the estimated field of view. The world points that do fall into this estimated view are further culled by keeping only those that project near the detected skyline. So far, the results have been very good, and could be used for autonomous navigation. Although this is not real imagery, the VR system used is advanced, and should give a good indication of how the system will perform on real imagery.
(a)
(b)
Fig. 7. (a) Original image from a virtual reality system. (b) World model (white lines) projected into this image using the pose computed by SoftPOSIT.
7
Conclusions
We have developed and tested the SoftPOSIT algorithm for determining the correspondence and pose of objects in an image. This algorithm will be used as a component in an object recognition system. Our evaluation indicates that the algorithm performs well under a variety of levels of occlusion, clutter, and noise. We are currently collecting imagery and 3D models of city environments for the purpose of evaluating the algorithm on real data. The complexity of SoftPOSIT has been empirically determined to be O(KJ 2 ). This is better than any known algorithm that solves the simultaneous pose and correspondence problem for a full perspective camera model. Rigorous validation of this claim is an item of future work.
References 1. H.S. Baird. Model-Based Image Matching Using Location, MIT Press, 1985. 2. J.S. Beis & D.G. Lowe, “Indexing Without Invariants in 3D Object Recognition,” IEEE Trans. PAMI, vol. 21, no. 10, pp. 1000–1015, 1999.
SoftPOSIT: Simultaneous Pose and Correspondence Determination
713
3. J.R. Beveridge & E.M. Riseman, “Optimal Geometric Model Matching Under Full 3D Perspective,” CVIU, vol. 61, pp. 351–364, May 1995. 4. J.S. Bridle. “Training Stochastic Model Recognition as Networks can Lead to Maximum Mutual Information Estimation of Parameters”, Advances in Neural Information Processing Systems, vol. 2, pp. 211–217, 1990. 5. J.B. Burns, R.S. Weiss, & E.M. Riseman, “View Variation of Point-Set and Line-Segment Features,” IEEE Trans. PAMI, vol. 15, pp. 51–68, 1993. 6. T.M. Breuel. “Fast Recognition using Adaptive Subdivisions of Transformation Space,” Proc. 1992 IEEE CVPR, pp. 445–451. 7. T.A. Cass. “Polynomial-Time Object Recognition in the Presense of Clutter, Occlusion, and Uncertainty,” Proc. ECCV’92, pp. 834–842, Springer-Verlag, 1992. 8. T.A. Cass. “Robust Geometric Matching for 3D Object Recognition,” Proc. 12th ICPR, vol. 1, pp. 477 –482, 1994. 9. T.A. Cass. “Robust Affine Structure Matching for 3D Object Recognition,” IEEE Trans. PAMI, vol. 20, no. 11, pp. 1265–1274, 1998. 10. “C” code for quasi-random number generation. Available from http://www.taygeta.com. 11. P. David, D. DeMenthon, R. Duraiswami, & H. Samet, “Evaluation of the SoftPOSIT Model to Image Registration Algorithm,” University of Maryland Technical Report CAR-TR-974, February 2002. 12. D. DeMenthon & L.S. Davis. “Recognition and Tracking of 3D Objects by 1D Search,” Proc. DARPA Image Understanding Workshop, Washington, DC, April 1993. 13. D. DeMenthon & L.S. Davis, “Model-Based Object Pose in 25 Lines of Code”, International Journal of Computer Vision, vol. 15, pp. 123–141, 1995. 14. D. DeMenthon & P. David, “SoftPOSIT:AnAlgorithm for Registration of 3D Models to Noisy Perspective Images Combining SoftAssign and POSIT,” University of Maryland Technical Report CAR-TR-970, May 2001. 15. R.W. Ely, J.A. Digirolamo, & J.C. Lundgren, “Model Supported Positioning,” Integrating Photogrammetric Techniques with Scene Analysis and Machine Vision II, SPIE Aerospace Sensing and Dual Use Sensors and Controls, Orlando, April 1995. 16. P.D. Fiore. “Efficient Linear Solution of Exterior Orientation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2 , pp. 140–148, February 2001. 17. M.A. Fischler & R.C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Comm. Association for Computing Machinery, vol. 24, no. 6, pp. 381–395, June 1981. 18. D. Geiger & A.L. Yuille, “A Common Framework for Image Segmentation”, International Journal of Computer Vision, vol. 6, pp. 227–243, 1991. 19. S. Gold & A. Rangarajan, “A Graduated Assignment Algorithm for Graph Matching”, IEEE Trans. PAMI, vol. 18, pp. 377–388, 1996. 20. S. Gold, A. Rangarajan, C.P. Lu, S. Pappu, and E. Mjolsness, “New Algorithms for 2D and 3D Point Matching: Pose Estimation and Correspondence”, Pattern Recognition, vol. 31, pp. 1019–1031, 1998. 21. E. Grimson, Object Recognition by Computer: The Role of Geometric Constraint, MIT Press, 1990. 22. E. Grimson & D.P. Huttenlocher, “On the Verification of Hypothesized Matches in ModelBased Recognition”, IEEE Trans. PAMI, vol. 13, no. 12, pp. 1201–1213, 1991. 23. R. Hartley & A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. 24. R. Horaud, B. Conio, O. Leboulleux, & B. Lacolle, “An Analytic Solution for the Perspective 4-Point Problem,” Proc. 1989 IEEE CVPR, pp. 500–507. 25. B.K.P. Horn. Robot Vision. MIT Press, Cambridge, Massachusetts 1986. 26. D.W. Jacobs. “Space Efficient 3D Model Indexing,” Proc. 1992 IEEE CVPR, pp. 439–444. 27. F. Jurie, “Solution of the Simultaneous Pose and Correspondence Problem Using Gaussian Error Model,” CVIU, vol. 73, pp. 357–373, 1999.
714
P. David et al.
28. Y. Lamdan & H.J. Wolfson. “Geometric Hashing: A General and Efficient Model-Based Recognition Scheme,” Proc. 1998 IEEE ICCV, pp. 238-249. 29. C.-P. Lu, G.D. Hager, & E. Mjolsness, “Fast and Globally Convergent Pose Estimation from Video Images,” IEEE Trans. PAMI, vol. 22, pp. 610–622, 2000. 30. W.J. Morokoff & R. E. Caflisch, “Quasi-Random Sequences and their Discrepancies”, SIAM J. Sci. Comput., pp. 1251–1279, 1994. 31. H. Murase & S.K. Nayar, “Visual Learning and Recognition of 3-D Objects fromAppearance,” IJCV, vol. 14, pp. 5–24, 1995. 32. C.F. Olson, “Efficient Pose Clustering Using a Randomized Algorithm,” IJCV, vol. 23, no. 2, pp. 131–147, 1997. 33. S. Procter & J. Illingworth. “ForeSight: Fast Object Recognition using Geometric Hashing with Edge-Triple Features,” Proc. 1997 ICIP., vol. 1, pp. 889–892. 34. R. Sinkhorn. “A Relationship between Arbitrary Positive Matrices and Doubly Stochastic Matrices”, Annals Math. Statist., vol. 35, pp. 876–879, 1964. 35. S. Ullman, “Aligning Pictorial Descriptions: An Approach to Object Recognition,” Cognition, vol. 32, no. 3, pp. 193–254, 1989. 36. P. Wunsch & G. Hirzinger, “Registration of CAD Models to Images by Iterative Inverse Perspective Matching”, Proc. 1996 ICPR, pp. 78–83. 37. J.-C. Yuan, “A General Photogrammetric Method for Determining Object Position and Orientation,” IEEE Trans. Robotics and Automation, vol. 5, no. 2, pp. 129-142, 1989.
Shock-Based Indexing into Large Shape Databases Thomas B. Sebastian1 , Philip N. Klein2 , and Benjamin B. Kimia1 1
2
Division of Engineering, Brown University, Providence, RI, USA {tbs, kimia}@lems.brown.edu Department of Computer Science, Brown University, Providence, RI, USA
[email protected]
Abstract. This paper examines issues arising in applying a previously developed edit-distance shock graph matching technique to indexing into large shape databases. This approach compares the shock graph topology and attributes to produce a similarity metric, and results in 100% recognition rate in querying a database of approximately 200 shapes. However, indexing into a significantly larger database is faced with both the lack of a suitable database, and more significantly with the expense related to computing the metric. We have thus (i) gathered shapes from a variety of sources to create a database of over 1000 shapes from forty categories as a stage towards developing an approach for indexing into a much larger database; (ii) developed a coarse-scale approximate similarly measure which relies on the shock graph topology and a very coarse sampling of link attributes. We show that this is a good first-order approximation of the similarly metric and is two orders of magnitude more efficient to compute. An interesting outcome of using this efficient but approximate similarity measure is that the approximation naturally demands a notion of categories to give high precision; (iii) developed an exemplar-based indexing scheme which discards a large number of non-matching shapes solely based on distance to exemplars, coarse scale representatives of each category. The use of a coarse-scale matching measure in conjunction with a coarse-scale sampling of the database leads to a significant reduction in the computational effort without discarding correct matches, thus paving the way for indexing into databases of tens of thousands of shapes. Keywords: Similarity metric, object recognition, shape matching, shape retrieval, categorization, exemplars.
1
Introduction
The recognition of objects from their images is challenging and involves the integration of multiple cues like shape, color, and texture, which typically provide A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 731–746, 2002. c Springer-Verlag Berlin Heidelberg 2002
732
T.B. Sebastian, P.N. Klein, and B.B. Kimia
complementary information. Shape is an important cue since humans can often recognize objects from line drawings as effectively as from images with color and texture information [8]. Indexing by shape content is therefore an important aspect of retrieving images from large databases. While retrieval based on shape can be described in terms of the underlying representation of shape, such as feature sets, point clouds, curves, medial axis, eigenshapes, among others, we present previous work in two classes based on the approach to organizing and querying the database, namely, whether the shape representation is eventually reduced to a finite set of attribute dimensions, or whether it is based on relative distances to other shapes. In the first class, the “Euclidean Space” approach, each shape is reduced to a finite set of “features” and the distance between a pair of shapes is defined as the Euclidean distance between the features, e.g., using moments, Fourier descriptors [17], invariants [22], etc. A good example of this class is the “active shape model” which represents shapes using a mean shape plus a set of linearly independent variation modes that are derived using a training set, e.g., as in face recognition [1]. That each shape is summarized as a point in a low-dimensional Euclidean space allows for an organization and efficient querying of the database based on feature dimensions. Data structures such as KD-tree [5], Quadtree [23], X-Tree [6], and others rely on this Euclidean structure to iteratively divide the database elements into small clusters, some of which are not searched when querying the database [11,14]. In the second class, the “Metric Space” approach, each shape is only relatively represented by distances to other shapes. In this approach, differences in the basic representation of shape, e.g., point clouds, curves, skeletons etc, are summarized in the quality of the metric, and database organization, querying, and matching of shapes are based solely on distances between shapes. This distinction between representations of shape at the two levels is significant. For example, the measurement of shape context based on a point cloud representation [4] gives an alignment and finally a distance. Since at the matching stage only distances are used, this method falls into the metric space class. Also, approaches where an outline curve is elastically deformed into correspondence with another [3,12,33], and those that rely on matching medial axis representations [34,18,26,28,21,30] fall into this class. Unfortunately, unlike the first class of representations where the structure of the underlying Euclidean space is used to organize and query the database, it is not clear how the metric space representations can be used to organize and query a database. The projection of the metric space to a low-dimensional Euclidean space can be problematic if shape variations occur along an infinite number of dimensions, each with a wide range of extents. These variations are not sparse and thus the mapping to a low-dimensional space will necessarily exclude certain dimensions, variations along which will not lead to robust results. Thus, the problem is one of organizing a metric space which can at best be approximated as a very high-dimensional Euclidean space.
Shock-Based Indexing into Large Shape Databases
733
Distance-based nearest neighbor search techniques [10,31,32,9] have been used for searching metric spaces. These methods typically use the triangle inequality to avoid computing the distances of the query to all elements in the database. The basic idea is to select certain elements as pivots, group the remaining elements into clusters based on their distances to the pivots, and use the triangle inequality to prune out certain clusters during the search of the nearest neighbor of a query. However, the performance of these methods deteriorate as the dimensionality of the space increases (curse of dimensionality), principally because pairwise distances between elements in a high-dimensional space tend to fall in a narrow range, and the triangle inequality can only eliminate a few elements [9]. In this paper, we examine approaches to make indexing into large databases by matching shock graphs practical. In this approach the edit distance between shock graphs [25,16,15] is used as a metric of similarity between shapes. It is obtained by exhaustively searching for the optimal deformation path between two shapes, and using the cost of this path as the distance between two shapes. This approach is robust to various transformations like boundary perturbations, articulation, deformations of parts, occlusion, etc., and gives intuitive results for indexing into shape databases. Results of two databases of 99 and 216 shapes show 100% correct recognition rate for the top three rank-ordered matches. However, this robustness to a range of variations is at the cost of a computationally expensive metric, and hence it is not practical for the direct querying of large databases, without employing additional steps. A first step to reducing the computational burden is to devise an computationally inexpensive but approximate measure for comparing two shapes [20]. This “coarse” measure can then be used to quickly rule out highly distant shapes, retaining a few candidates for the “fine” comparison stage. Clearly, a reduction in accuracy of the distance measure implies that if only a very small margin of error is allowed much of the database has to considered for the fine-scale comparison. This defeats the purpose of developing a coarse measure. However, observe that extremely high precision can be achieved if a reduction in accuracy in the matching is accompanied by a grouping of each shape with a set of neighboring shapes leading to the concept of categories. A second step to making indexing practical involves the use of exemplars to represent a category. The key observation is that shapes in a category appear tightly clustered when viewed in the context of all shapes, and a category can thus be represented by a few exemplars. The membership of a query in a category is then determined by the set of distances to exemplars, but not all members of the category. The experiments in this paper indicate that the use of exemplars leads to significant savings in computational time without significantly affecting the recognition rate. Note that the use of exemplars is analogous to the use of pivots in distance-based indexing schemes. A third step is required to evaluate indexing into large databases, namely the formation of a large shape database. A key contribution of this paper is the development of such a database with 1032 shapes organized into forty categories,
734
T.B. Sebastian, P.N. Klein, and B.B. Kimia
Fig. 1. The database of shapes used in the experiments. There are a total of 1032 shapes from 40 categories. The shapes were collected from a variety of sources. We have included shapes from the MPEG-7 test database, Farzin Mokhtarian (fish, sea animals), Mike Tarr (greebles), and Stan Sclaroff (tools).
Figure 1. The rest of the paper is organized as follows. Section 2 reviews the editdistance shock graph approach to computing a shape comparison techniques by focusing on details which are needed to devise a coarse-scale measure. Section 3 discusses how this measure is defined, and describes the results of experiments with it which lead to the need for a notion of categories. Section 4 discusses the use of exemplars for further increasing the efficiency of indexing.
Shock-Based Indexing into Large Shape Databases
2
735
Recognition of Shapes by Matching Shock Graphs
In this section we review a framework for matching shapes by editing their shock graphs [25,16,15]. The basic idea is to view a shape as a point in a shape space where the distance is defined as the minimum-cost deformation path connecting one shape to the other. It is immediately clear, however, that there are an infinite number of ways a shape can deform. The practical search for an optimal path requires a discretization of the space of shapes and their deformations. Observe that generally as a shape deforms, the shock graph topology does not change, but only the attributes of the shock graph edges change. The identification of shapes by shock graph topology leads to a notion of a shape cell, the equivalence class of shapes with the same shock graph topology. This results in a discretization of the shape space. However, near the “boundaries” of each shape cell, or “transition” shapes, an infinitesimal change in the shape causes an abrupt change in the shock graph topology. These shapes are the instabilities of the medial axis/shock graph, which any recognition strategy must deal with. We have previously enumerated and classified these transitions [13]. As the shock graph topology changes at these transitions, matching strategies must explicitly identify shapes on either side of the transition as being close. Towards this end, we use the shock transitions to annotate deformation paths. Thus, as a second discretization, we define a shape deformation bundle which is the set of one-parameter families of deformation paths passing through the same set of shock transitions. While the discretization of the shape space and their deformation paths drastically reduce the underlying dimensions of search, they do not avoid considering paths that venture into completely unrelated areas of the shape space: a shape can first be completely changed and these changes can be undone in a different manner to reach the second shape. Thus, to avoid unnecessarily complicating the shape, as a third measure, we consider each deformation path as a pair of simplifying paths, one for each shape, reaching a common third shape which is “simpler” than both. With these three measures, the search space is significantly reduced, but it is still large. To efficiently search the discretized space we have developed an edit distance approach [16,15]. In the graph domain, each shock transition is represented by an “edit” operation, and four types of edit operations are defined based on shock transitions [13]: (i) the splice operation deletes a shock branch and merges the remaining two; (ii) the contract operation deletes a shock branch connecting two degree-three nodes; (iii) the merge operations combine two branches at a degree-two node; (iv) we also define a deform edit to relate two shapes with the same shock graph topology but with different attributes. We find the minimum-cost sequence of edits by an efficient polynomial-time tree edit distance algorithm developed in [16,15]. The costs of edit operations are assigned so that they are consistent with each other and also with our perceptual similarity metrics. Our approach is to first define the deform cost, and then derive all other edit costs as limits of the deform cost when the shape moves to the boundary of the shape cell (transition shape) [25]. The deform cost is defined by summing over local shape differences, and is based on an extension of an intrinsic curve comparison metric [24]. Ob-
736
T.B. Sebastian, P.N. Klein, and B.B. Kimia ds+ r
B+ S
(a)
B−
φ ds − ds
(b)
dθ+
B
+
S (shock) dθ−
B
−
Fig. 2. The shock S and the corresponding shape boundary segments B+ and B− .
550 551 560 567 572 589 593 613 616 678 350 573 581 600 616 618 646 655 720 770
(a)
739 748 753 756 756 777 788 811 812 836 322 507 572 574 578 589 649 649 704 911
(b)
209 255 265 268 268 273 276 289 299 334
300 600 607 617 622 628 634 637 641 642 535 556 558 614 628 637 646 652 685 693
(c) Fig. 3. Left: This figure from [25] illustrates that the shock graph matching algorithm works well in presence of articulation and deformation of parts (a), view-point variation (b), and occlusion (c). Same colors indicate matching shock branches, and grey colored branches have been pruned. The matching is intuitive in all cases. Observe, in particular, how the rider on the horse is pruned when matched against horse by itself. Right: Indexing results for a database of 99 shapes (9 categories and 11 shapes in each category) [25]. The 10 nearest neighbors for a few shapes in the database ordered by edit distance are shown. We rate the performance based on the number of times the 10 nearest neighbors are from the same category as the query. The results in proportions of 99 are: (99,99,99,98,98,98,96,95,94,86).
serve that each edge in the shock graph corresponds to two boundary segments, Figure 2a, and the problem of deforming a shock edge S to another shock edge can be viewed as one of deforming the corresponding boundary segments, B + and B − , representing a “joint curve matching problem”. Let the edges of the shock ˆ be parameterized by s and sˆ respectively. Let graphs being compared, S and S, the boundary segments corresponding to S, B + and B − , be parameterized by s+ and s− , and let θ+ (s+ ) and θ− (s− ) be their orientations. Let r(s) and φ(s) denote the width and relative orientation of B + and B− . (See Figure 2b.) The properties of Sˆ are similarly defined. The local shape differences are measured in terms of differences in boundary lengths, boundary curvatures, width and relative orientation, and expressed as the following functional
Shock-Based Indexing into Large Shape Databases
ˆ α] = dˆs± − ds± +R dθˆ± − dθ± +2 dˆr − dr dξ µ[S, S; dξ dξ dξ dξ dξ dξ ˆ ˆ +2|ˆ r0 − r0 | + 2 R ddξφ − dφ dξ dξ + 2R|φ0 − φ0 |.
737
(1)
The alignment is expressed in terms of an alignment curve α to ensure symmetric treatment of the two shock edges [24]. Finally, the deform cost of the shock edges S and Sˆ is defined in terms of the alignment that minimizes the above functional, which is also found by a dynamic-programming method [24]. The shock graph matching technique works well in the presence of a variety of visual transformations like articulation, deformation of parts, boundary noise, viewpoint variation and occlusion, and gives excellent recognition rates for two distinct shape databases of roughly 100 and 200 shapes each, Figure 3. However, the running time of each match, which is 3-5 minutes on a typical workstation prohibits its use in organizing and querying databases with large number of shapes. We now detail two ideas to render this task practical.
3
Coarse-Scale Matching of Shock Graphs
Observe that in a large shape database, most of the shapes are typically very dissimilar to the query shape, and can possibly be eliminated as potential matches without using detailed and expensive comparisons. A close comparison of two shapes is warranted only if they are fairly similar. For example, if the query is a “camel”, it is desirable to eliminate all “fish” shapes without using the computationally expensive comparison of all shock graph attributes. A detailed match at a later stage will make fine distinctions among remaining shapes. This coarseto-fine matching strategy can significantly reduce the computational time for indexing into databases. In this section, we develop a coarse-scale version of the similarity metric, and show that it is effective in identifying the top matches of the query efficiently. Note that in matching two shock graphs as described in Section 2, most of the computational effort goes towards computing the cost of the edit operations, in particular, computing the deform costs for each pair of paths in the two shock graphs. This requires the computation of O(n2 m2 ) optimal alignments, where n and m are the number of nodes in the shock graphs. Furthermore, finding each optimal alignment uses dynamic programming, which is quadratic in the number of samples along the shock graph paths. Thus, reducing the cost of each alignment, e.g., by reducing the number of samples along the shock paths, would significantly reduce the overall computational expense. One idea, therefore, is to completely avoid computing the optimal alignment, and instead compute the deform cost in terms of the end points of the shock graph paths. While this is clearly not always accurate, it does provide a first-order approximation to the similarity metric. As before, we compare differences in boundary lengths, total curvatures, width and relative orientation for each shock path. Specifically, we define the deform cost between shock edges S and Sˆ as
T.B. Sebastian, P.N. Klein, and B.B. Kimia 1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
0.2
0 0
200
400 600 Number of matches
800
1000
0 0
Recall
1
Recall
Recall
738
0.4
0.2
coarse fine 50
100 150 Number of matches
200
0 0
coarse fine 50 100 Number of matches
150
Fig. 4. Left: Average recall as function of N , the size of the top ranking coarse matches, for 200 randomly selected shapes from the database in Figure 1. Middle and right: Two instances of recall as a function of N for a fish and a dog (shown in Table 2), respectively.
± ± ± ± ˆ = dˆ dθˆ − dθ +2 dˆ r − dr µ[S, S] s − ds +R +2|ˆ r0 − r0 | + 2R dφˆ − dφdξ + 2R|φˆ0 − φ0 |.
(2)
Note that this measure does not rely on finding an optimal alignment, and is thus very efficient to compute; it reduces the computational requirement of matching a pair of shock graphs from a few minutes to a few seconds. This is a significant speedup: the computation of pairwise distances and the distance from a query for a database of 1000 shapes is 500, 000 and 1000 times the cost of computing the metric, respectively. For a database of 10, 000 shapes these numbers are 50, 000, 000 and 10, 000, respectively! The premise underlying the use of a coarse-scale metric is that it allows for the pruning of very dissimilar shapes, e.g., if we could safely discard 90% of the database using a coarse-scale match, the overall cost is reduced to 10% of the fine-scale match, plus the coarsescale matching cost. Since the cost of the latter is negligible compared to the cost of fine-scale matching of 10% of the database, this would result in approximately an order of magnitude speedup. Therefore, the effectiveness of the coarse-scale matching is measured in terms of its ability to capture the correct match in a small fraction of the top rankordered matches. This is precisely the notion of recall1 . Figure 4 shows the average recall for 200 queries. Observe that well over 90% of similar shapes are captured in the top 200 matches. Table 1 shows the top 60 matches for a few query shapes using coarse-scale shock-graph matching. Observe that the top matches (i) contain a large number of shapes in the correct category, and (ii) the correct category is predominant in the first few matches, but (iii) not all shapes in the correct category are retrieved, unless a very large portion of the database is considered. Thus, while for several queries all shapes in the same category are returned, for some others one or more instances are missing. This situation will get only worse as the number of shapes in the database increases. 1
Precision and recall have been used to evaluate the performance of a shape retrieval scheme [7,19]. Precision is the ratio of the number of similar shapes retrieved to the total number of shapes retrieved, while recall is the ratio of the number of similar shapes retrieved to the total number of similar shapes in the database.
Shock-Based Indexing into Large Shape Databases
739
Table 1. This table shows that coarse-scale match is a good first-order approximation of the detailed shock graph matching. The query shape (left column), top 60 matches (middle column) and number of shapes from the correct category (right column) are shown. The matches are ordered from left to right and top to bottom. Observe that in most cases the top matches contain many shapes that are similar to the query.
Thus, if the correct ranking is required without much tolerance for error, an increasing number of top ranking coarse-scale matches must be presented for fine-scale comparison. This clearly defeats the purpose of employing a two-stage coarse-to-fine strategy. See Table 2 and Figure 4 for the interaction between coarse and fine stages. High precision can be achieved by observing that as the ability of the metric in precisely locating one shape with respect to others is diminished, there is an implicit implication of equivalence between a shape and its immediate neighbors. The explicit formulation of such an equivalence to improve precision implies connecting a shape to its neighboring shapes the basis of forming categories. To illustrate that the neighborhood topology tremendously improves precision, consider the extreme case of the dog example in Table 1 where only 29/49 are retrieved in the top 60 coarse matches, and where the complete retrieval of all shapes requires considering the top 260 matches. We observe that the missing 20/49 at the coarse-scale can be retrieved by association, i.e., membership in a
740
T.B. Sebastian, P.N. Klein, and B.B. Kimia
Table 2. This table illustrates the coarse-to-fine strategy for two examples. In the first example, using the coarse-scale match 158 shapes (first row) out of 1032 are selected for fine-scale matching. Note that 54/58 fish shapes are retrieved. These 158 shapes are compared to the query using fine-scale shock graph matching, and the top 61 matches contain the 54 fish shapes (second row). Also, note that the top 53 matches are fish shapes. In the second example, top 150 matches are needed to retrieve 47/49 dog shapes using coarse-scale matching (third row). In this case also, fine-scale matching (last row) does better than coarse-scale matching, but represents less of an improvement compared to most other cases. Most of the errors in fine-scale matching are due to the confusion between dog and other quadrupeds. In fact, the top 82 matches for fine-scale shock matching are quadrupeds. The recall in both cases is compared in Figure 4.
category. Thus, if the top ranking coarse-scale match is a dog, all shapes closely connected to it (all dogs, or some subsets of dogs) are automatically included as retrieved results. More precisely, define a category-based ranking of retrieval as the relative ranking of each category by the order of appearance of its first exemplar relative to those of other categories. For example, in the case of the fish in top row of Table 2, the categories are “fish”, “hammer”, “child” etc. Table 3 shows that in the category-based ranking, the top category is correct 96% of the time, and the top five categories are correct 100% of the time, see bottom of Table 3.
Shock-Based Indexing into Large Shape Databases
741
Table 3. (Top) This table shows the ranking by category, where each category is ranked by its first sample in the coarse-scale match. The top category for 200 randomly selected queries is correct 96% of the time. (Bottom) A summary of correctness for top N categories.
1 0.995 0.99
Recall
0.985 0.98 0.975 0.97 0.965 5
10 15 Number of matches
20
Fig. 5. This figure shows the category-based recall for 200 queries based on categories returned in top N matches.
A second observation is required before claiming that examining the categories of the top-ranking coarse-scale matches at the fine-scale is both sufficient and efficient. For this to be true, the top few categories must be packed into the top few matches by the coarse-scale measure. Figure 5 which plots categorybased recall shows the first match gives 96% correct category, while the top six matches gives 99.5% and the top twelve matches give 100% correct recall. Thus, the top dozen matches result in top five categories which are required for a fine-scale comparison with zero-error recognition. The selection of N , the top matches to be considered, is a parameter of the system.
742
4
T.B. Sebastian, P.N. Klein, and B.B. Kimia
Exemplars for Indexing Into Databases
The approach developed in Section 3 reduces the number of fine-scale computations significantly. However, the coarse-scale comparison of a query is done against all shapes in the database. This also becomes unrealistic for significantly larger databases. For example, if we have a database of 10, 000 shapes, and if the cost of a coarse-scale match is 2 seconds, the coarse-scale matching of each query would take about 20, 000 seconds, i.e., more than 5 hours! To alleviate this difficulty, observe that shapes are not randomly organized, but rather enjoy a neighborhood topology. Intuitively, when considering the collection of “fish” in the context of other shapes, they seem to be tightly clustered and densely packed around each other, and at some distances to other shapes. Thus, if we establish that our query e.g., a “camel” is highly distant from one instance of a fish, “fish 1”, it is also likely to be distant from nearby neighbors, say “fish 2”. Formally, using the triangle inequality, d(camel, f ish 2) > |d(camel, f ish 1) − d(f ish 1, f ish 2)|, which establishes a lower bound for d(camel, f ish 2) without having to compute it. This suggests that for a cluster of shapes distant from the query the actual distance needs to be computed only for a few samples. In other words, each category can be represented by a few representative samples, or exemplars. This representation of a tightly packed, dense cluster of shapes by a few samples is consistent with the psychophysics data [29]. There has also been ample use of exemplars or prototypes in computer vision, e.g. [2]. From a computational perspective, with the use of exemplars the number of matches necessary is proportional to the number of categories, and not to the total number of shapes. The process of selecting optimal exemplars is a challenging task in itself, and is not discussed in this paper. For the experiments reported in this paper, we initially selected several representative exemplars to represent each category. We now describe our experiments in using exemplars to further reduce computational requirements in indexing a large database. In our experiments the triangle inequality turns out to be a strong condition. An alternative approach is to consider the similarity of the query to each exemplar, and use it to define a fuzzy notion of membership in each category. Let Ci , i = 1, . . . , N be the categories in the database. Let Eik represent the k th exemplar of category Ci , and denote Ei = {Eik |k = 1, . . . , Ni } as the set of exemplars for category Ci . Let Q be the query shape, and let d(Q, Eik ) be the distance between Q and Eik . The similarity of the query to an exemplar is then defined as d(Q, Eik ) k S(Q, Ei ) = exp − . (3) min d(Q, Eik ) i=1,...,N
k=1,...,Ni
Shepard [27] has shown that the use of exponential in relating perceptual distance and similarity is universal. We do not directly use the closest exemplar’s category because categories tend to have different variations; some tightly clustered,
Shock-Based Indexing into Large Shape Databases
743
Table 4. This table illustrates the top 5 matching categories ranked by the membership function using the coarse-scale distance measure for some representative query shapes. The closest matching exemplar is used to represent each category. greeble
turtle
fish
bone
bird
dinosaur
bird
horse
dog
elephant
bone
bone
bottle
turtle
classic
child
car
car
classic
child
flightbird
fish
bottle
bottle
ray
face
flatfish
bone
camel
camel
dinosaur
bird
horse
dog
rat
bird
camel
dude
plane
carriage carriage
dog
textbox
child
fish
bird
cattle
dog
horse
key
bottle
ray
brick
key
glass
rabbit
turtle
plane
bird
cattle
dude
dinosaur
dog
cattle
camel
cat
turtle
Heart
dinosaur dinosaur
glass
greeble flatfish
glass
hammer hammer
hammer flatfish
cat
horse
dog
cat
cattle
bird
horse
horse
cat
cat
cattle
horse
dog
bird
rabbit
rabbit
cat
cat
bird
cattle
turtle
dog
key
key
child
brick
bottle
dog
child
child
classic
flatfish
bone
fish
fish
fish
hammer
bone
flatfish
brick
bird
key
textbox
dog
ray
ray
flatfish textbox
face
Heart
textbox
ray
textbox
hand
bird
bottle
turtle
turtle
cattle
dog
bone
flatfish
tool
tool
hammer
brick
bone
glass
fish
fish
child
bone
chopper chopper classic
classic
child
bottle
bone
flatfish
dog
horse
dog
dinosaur
camel
cat
bird
horse
dog
elephant elephant dinosaur plane
plane
bird
elephant dinosaur
dog
textbox greeble
hammer flatfish
others highly variable. Consider a query that is slightly closer to an exemplar in a tightly clustered category than to an exemplar in a largely variable category. It is intuitively clear that it is more likely that this query is a member of the second category. The statistical analysis of an appropriate measure of membership is beyond the scope of this paper. As a first approximation, we define a membership measure ν for the query in each category Ci as the sum of similarities of the query to all exemplars of Ci . ν(Q, Ci ) =
S(Q, Eik ).
(4)
k=1,...,Ni
This membership measure is used to identify the closest categories to each query. The overall strategy is as follows. For each query Q, we compute d(Q, Eik ) for each exemplar, k = 1, . . . , Ni , i = 1, . . . , N . This is of significantly lower cost than computing d(Q, S) for all shapes S in the database. In our experiments we initially selected three to six exemplars per category, for a total of 204 exemplars. Thus, the cost of computing d(Q, Eik ) is five times lower than d(Q, S). We then used ν(Q, Ci ) to rank order likely categories and for these performed fine-scale matching. Table 4 shows the top matching categories for a few representative
744
T.B. Sebastian, P.N. Klein, and B.B. Kimia 100 95
% Recognition rate
90 85 80 75 204 exemplars 160 exemplars 120 exemplars 80 exemplars 40 exemplars
70 65 60 0
5 10 Number of categories
15
Fig. 6. This figure illustrates the percent correct match found in the top N rankordered categories for a range of exemplars per category. Observe that for about six exemplars the first match is right 78% of the time, while for one exemplar it is right only 62% of the time. Also, note that the lower the number of exemplars used, the more categories must be considered for each level of error tolerance. For example, allowing for only a 3% error tolerance, we must consider (15, 8, 5, 4, 4) categories for (1,2,3,4,6) exemplars per category, respectively. This suggests a hierarchical scheme: use the one exemplar and 15 categories to rule out 25/40 categories; then use a few, say four exemplars to prune further. We expect that these curves will become significantly distinct as the number of total shapes per database increases, thereby increasing the utility of a hierarchical scheme in larger databases.
query shapes using exemplar-based matching. The top one, five and ten categories are correct 78.5%, 98.5% and 99.5% of the time, respectively, Figure 6. We have experimented with the use of fewer exemplars to represent each category. Figure 6 shows the result of discarding exemplars whose intra-category distance to all other exemplars in that category is the smallest. This process is repeated until a pre-defined number of exemplars remain. The results show that while fewer exemplars generally implies considering additional shapes for the fine-scale match, as expected, there is a trade-off between the two. Note that the use of approximately three exemplars per category for a total of 120 to represent the database of 1032 shapes results in an order of magnitude speedup to determine an object’s category, Figure 6. The conclusion of these experiments is to follow a hierarchical representation at the exemplar level as well: with two or three principal exemplars we can safely rule out 75% of categories, while with three auxiliary exemplars, we can rule out 50% of the remaining categories, thus reducing the expensive fine-scale matching only to very few categories. A goal of our future work is to examine this type of hierarchical exemplar representation.
Shock-Based Indexing into Large Shape Databases
5
745
Conclusion
We have gathered a database of over 1000 shapes to examine the issues of computational cost and the need for database organization for indexing into large databases. We have presented two approaches which together significantly reduce computational requirements. First, we developed a coarse-scale distance measure which leads to 50-100 times speedup in distance computations. We showed that the need for high precision at the coarse-scale matching level naturally gives rise to the need to organize the database into categories. Second, we showed that a coarse-scale sampling of each category using exemplars does not significantly reduce categorization and indexing performance if a membership function is defined based on the similarity measure. Together, the coarse-scale measure and the coarse-scale sampling lead to significant reduction in computational requirements. We believe that together with a hierarchical organization of both categories and exemplars we can apply this approach to indexing into databases of tens of thousands of shapes. Acknowledgments. Philip Klein acknowledges the support of NSF grant CCR9700146 and Benjamin Kimia acknowledges the support of NSF grant BCS9980091.
References 1. A.Lanitis, C.J.Taylor, and T.F.Cootes. Automatic interpretation and coding of face images using flexible models. PAMI, 19(7):743–756, 1997. 2. R. Basri. Recognition by prototypes. IJCV, 19(2):147–167, August 1996. 3. R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of deformable shapes. Vision Research, 38:2365–2385, 1998. 4. S. Belongie, J. Malik, and J. Puzicha. Matching shapes. ICCV, pages 454–461, 2001. 5. J. L. Bentley and J. H. Friedman. Data structures for range searching. ACM Computing Surveys, 11(4):397–409, 1979. 6. S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. VLDB, pages 28–39, 1996. 7. L. Bergman and V. Castelli, editors. Image Databases, Search and Retrieval of Digital Imagery. John Wiley and Sons, 2002. 8. I. Biederman and G. Ju. Surface versus edge-based determinants of visual recognition. Cognitive Psychology, 20:38–64, 1988. 9. S. Brin. Near neighbor search in large metric spaces. VLDB, pages 574–584, 1995. 10. E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroqu´ın. Searching in metric spaces. ACM Computing Surveys, 33(3):273 – 321, 2001. 11. C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petrkovic, and W. Equitz. Efficient and effective querying by image content. J. Intelligent Information Systems, 3:231–262, 1994. 12. Y. Gdalyahu and D. Weinshall. Flexible syntactic matching of curves and its application to automatic hierarchical classification of silhouettes. PAMI, 21(12):1312– 1328, 1999.
746
T.B. Sebastian, P.N. Klein, and B.B. Kimia
13. P. J. Giblin and B. B. Kimia. On the local form and transitions of symmetry sets, and medial axes, and shocks in 2D. ICCV, pages 385–391, 1999. 14. W. I. Groski and R. Mehrota. Index-based object recognition in pictorial data management. CVGIP, 52:416–436, 1990. 15. P. Klein, T. Sebastian, and B. Kimia. Shape matching using edit-distance: an implementation. SODA, pages 781–790, 2001. 16. P. Klein, S. Tirthapura, D. Sharvit, and B. Kimia. A tree-edit distance algorithm for comparing simple, closed shapes. SODA, pages 696–704, 2000. 17. C. Lin and R. Chellappa. Classification of partial 2-D shapes using Fourier descriptors. PAMI, 9(5):686–690, 1987. 18. T. Liu and D. Geiger. Approximate tree matching and shape similarity. ICCV, pages 456–462, 1999. 19. E. Milios and E. Petrakis. Shape retrieval based on dynamic programming. IEEE Trans. Image Processing, 9(1):141–146, 2000. 20. G. Mori, S. Belongie, and J. Malik. Shape contexts enable efficient retrieval of similar shapes. CVPR, pages I:723–730, 2001. 21. M. Pelillo, K. Siddiqi, and S. Zucker. Matching hierarchical structures using association graphs. PAMI, 21(11):1105–1120, 1999. 22. E. Rivlin and I. Weiss. Local invariants for recognition. PAMI, 17(3):226–238, 1995. 23. H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2):187–260, 1984. 24. T. B. Sebastian, P. N. Klein, and B. B. Kimia. Alignment-based recognition of shape outlines. IWVF, pages 606–618, 2001. Springer. 25. T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition of shapes by editing shock graphs. ICCV, pages 755–762, 2001. 26. D. Sharvit, J. Chan, H. Tek, and B. B. Kimia. Symmetry-based indexing of image databases. JVCIR, 9(4):366–380, 1998. 27. R. N. Shepard. Toward a universal law of generalization for psychological science. Science, pages 1317–1323, 1987. 28. K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker. Shock graphs and shape matching. IJCV, 35(1):13–32, November 1999. 29. M. J. Tarr and H. H. Bulthoff, editors. Object Recognition in Man, Monkey, and Machine. MIT Press/Elsevier, 1999. 30. A. Torsello and E. R. Hancock. Computing approximate tree edit-distance using relaxation labelling. Worksop on Graph-based Representations in Pattern Recognition, pages 125–136, 2001. 31. J. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40:175–179, 1991. 32. P. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. SODA, pages 311–321, 1993. 33. L. Younes. Computable elastic distance between shapes. SIAM J. Appl. Math., 58:565–586, 1998. 34. S. C. Zhu and A. L. Yuille. FORMS: A flexible object recognition and modeling system. IJCV, 20(3):187–212, 1996.
EigenSegments: A Spatio-Temporal Decomposition of an Ensemble of Images Shai Avidan The Interdisciplinary Center Kanfei Nesharim, Herzliya, Israel
[email protected]
Abstract. Eigensegments combine image segmentation and Principal Component Analysis (PCA) to obtain a spatio-temporal decomposition of an ensemble of images. The image plane is spatially decomposed into temporally correlated regions. Each region is independently decomposed temporally using PCA. Thus, each image is modeled by several lowdimensional segment-spaces, instead of a single high-dimensional imagespace. Experiments show the proposed method gives better classification results, gives smaller reconstruction errors, can handle local changes in appearance and is faster to compute. Results for faces and vehicles are shown.
1
Introduction
View-based representation is often used in problems such as object recognition, object detection or image coding. In such an approach 3D objects are represented by a collection of images and therefor a compact representation of ensembles of images is crucial. In this scheme, images are considered as points in the highdimensional image-space and an ensemble of images of a class of objects forms a non-linear image-manifold in this space. The question is what is the best representation of the manifold for the above mentioned problems. The first possibility is to approximate the image-manifold with a linear subspace, as was suggested for the case of upright, frontal faces. This subspace is taken to be the leading principal components of the ensemble of images and the key finding of [11,10] was that the intrinsic dimensionality of this linear PCAspace is much lower than the dimensionality of the image-space. This gives a very compact approximation to a large number of images in terms of a small number of orthogonal basis images, termed “eigenimages”. There exists a constant tension between the desire to increase the number of principal components, to improve reconstruction quality, as well as object recognition and detection capabilities, and the desire to keep that number low to avoid modelling noise and to maintain the computational efficiency of the algorithm. To overcome this hurdle several authors proposed to approximate
This work was done while the author was with MobilEye Vision Technologies
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 747–758, 2002. c Springer-Verlag Berlin Heidelberg 2002
748
S. Avidan
the manifold with several PCA-spaces [6,5,4,1,2]. This means that instead of having a single fixed PCA-space to represent all the images in the manifold, the manifold is broken into several regions and each region is approximated with a different PCA-space. Note that the high-dimensional image-manifold is broken into regions, not the 2D image-plane. This is where we set in. Our approach is orthogonal to the above mentioned methods. Instead of taking an image to be a point in the image-space we spatially segment the image into several segments and work in each segment-space independently. The spatial segmentation is carried out once, during the training phase, and is based on clustering together pixels that exhibit similar temporal behavior. All the methods developed to work in the image-space should work, as is, in each of the segmentspaces as well. Moreover, since our decomposition breaks the high-dimensional image-space into several lower-dimensional segment-spaces the number of samples (per space) is denser and a better approximation, to each of the segmentmanifolds, can be obtained. Our approach is somewhat related to the work on eigenfeatures [8] where facial features such as eyes, nose and mouth were manually selected for face detection and recognition. Instead of manually characterizing special features, such as eyes or nose, we define the special features to be a collection of pixels that exhibit similar temporal behavior. Our work is also related to the work of [4,5] who used Factor Analysis to approximate the image-manifold by several PCAs. We, on the other hand, focus on the spatial decomposition of the image-plane, as well as the representation of an image by several low-dimensional segmentspaces, instead of a single high-dimensional image-space. This spatio-temporal decomposition has several advantages. It is as fast and simple to compute as the PCA approach but yet if offers some of the advantages of the mixture of PCA-spaces. Each segment-space can be viewed as an approximation to the original image-space where all the pixels outside the segment are ignored. We also found empirically that our method gives better classification results and can only speculate that the additional spatial information help improve the classification results. Our approach is also better than PCA and PCA-mixture approaches in handling occlusions. Occlusions will only affect some of the segments and we will be able to determine if a mis-fit between an image and our model is local (and hence can be classified as an occlusion and ignored) or is global and act accordingly. Finally, it is natural to extend our approach to handle Level-of-Details PCA, where each segment is approximated by a different number of principal components, say to maintain a predefined reconstruction error.
2
Principal Component Analysis
Principal Component Analysis (PCA) is a statistical method for dimensionality reduction. Let A = [A1 ...AN ] be the, so called, design matrix of M rows and N columns, where each column Ai represents an image of M pixels (in vector form). Further assume that all the columns of A are zero-mean (or else subtract
EigenSegments: A Spatio-Temporal Decomposition
749
the mean image from all the columns of A). Then the principal components are obtained by solving the following eigenvalue problem: C = U DU T
(1)
where C is the covariance matrix C = N1 AAT , U is the eigenvectors matrix and D is the diagonal matrix of eigenvalues. PCA maximizes the variance of the input data samples and the eigenvalues measure the variance of the data in the directions of the principal components. The principal components form an orthonormal basis of the columns of A and if the effective rank of A is much smaller than N then we can approximate the column space of A with K << N principal = UT components. Formally, Arec i K UK Ai , is an approximate reconstruction of Ai where UK are the first K eigenvectors, and the coefficients xi = UT K Ai are obtained by projecting the column Ai on the principal components.
3
EigenSegments
An ensemble of images can be approximated by its leading principal components. This is done by stacking the images (in vector form) in a design matrix A and taking the leading eigenvectors of the covariance matrix C = N1 AAT . The leading principal components are the leading eigenvectors of the covariance matrix C and they form a basis that approximates the space of all the columns of the design matrix A. But instead of looking at the columns of A we look at the rows of A. Each row in A gives the intensity profile of a particular pixel, i.e., each row represents the intensity values that a particular pixel takes in the different images in the ensemble. If two pixels come from the same region of the face they are likely to have the same intensity values and hence have a strong temporal correlation. We wish to find this correlations and segment the image plane into regions of pixels that have similar temporal behavior. This approach broadly falls under the category of Factor Analysis [3] that seeks to find a low-dimensional representation that captures the correlations between features. Let Ax be the x-th row of the design matrix A. Then Ax is the intensity profile of pixel x (We address pixels with a single number because the images are represented in a scan-line vector form). That is, Ax is an N -dimensional vector (where N is the number of images) that holds the intensity values of pixel x in each image in the ensemble. Pixels x and y are temporally correlated if the dot product of rows Ax and Ay is approaching 1 and are temporally uncorrelated if the dot-product is approaching 0. Thus, to find temporally correlated pixels all we need to do is run a clustering algorithm on the rows of the design matrix A. In particular, we used the k-means algorithm on the rows of the matrix A but any method of Factor Analysis can be used. As a result, the image-plane is segmented into several (possibly non-continuous) segments of temporally correlated pixels. Each segment can then be approximated by its own mean segment and set of leading principal components. Put formally, let s = {s1 , ..., sR } be a collection of R rows of the matrix A that form the segment s. Next, define Asi as the pixels in image i
750
S. Avidan
that belong to segment s and the design matrix of this segment is given by As = [As1 , As2 , ..., AsN ]. All the variants for PCA can be applied to each As independently. The eigensegment approach has nice computational and memory properties. Observe that the collection of the first K principal components of all the S segments occupy the same memory as the first K principal components of the entire image. This is because the number of pixels in all the segments is equal to the number of pixels in the image. The only additional memory requirement is to store the segmentation map that assigns each pixel to a different segment. If we assume we have S segments, each with K principal components then the eigensegment representation is a collection of S K-dimensional points. If we stack them together we have that each image is represented by a SK-dimensional feature vector. This SK-dimensional feature vector is not as optimal as the one obtained by taking the first SK principal components. But it is S times faster to compute because of the additional dot-products required in traditional PCA. This property becomes increasingly important in applications of object detection where exhaustive search over the entire image is often performed and hence speed-ups are crucial. Another interesting property we found empirically is that the spatial decomposition is often as powerful, if not more powerful, then the temporal decomposition for the purpose of object classification. In the experimental section we demonstrate that good classification scores are obtained if an image ensemble is decomposed spatially and each segment is approximated by its mean intensity value (this can be thought of as the 0-th principal component). This is particularly encouraging because this spatial decomposition can be achieved in a single pass over the test image, compared to the multiple dot-products needed in case of using several principal components.
4
Experiments
We performed a number of experiments to demonstrate the potential of the spatio-temporal decomposition. We show results on faces and vehicles and address issues of image reconstruction and object classification. 4.1
Experiments with Face Images
We demonstrate our results on both the Olivetti [9] and the Weizmann [7] face databases. The Olivetti database contains 10 images of 40 subjects, taken over a period of about two years (see Figure 1). The size of the original images is 112×92 pixels but we reduced their size by half, to increase the speed of computations. The Weizmann database contains about 28 subjects under various pose, lighting and facial expressions. There is a total of 840 images. The original size of the images is 512 × 352 pixels, but we reduced their size to be 32 × 22.
EigenSegments: A Spatio-Temporal Decomposition
751
Fig. 1. Some sample images from the Olivetti database.
EigenSegment representation. In this experiment we measured the reconstruction error of eigensegment representation vs. the usual eigenimage representation. We tested it on both databases by exhaustively computing the reconstruction error for all combinations of up to 5 segments and up to 30 eigenvectors per segment. Figure 2 shows the contour maps for both databases. Each contour in the figure represent an iso-reconstruction error contour in the spatio-temporal space. As can be seen, there is a tradeoff between the number of eigenvectors and the number of segments. For example, in the Olivetti database one segment with 21 eigenvectors has the same reconstruction error (in intensity values), for this image ensemble, as a 5-segment 5-eigenvector combination. The advantage of the latter is that it takes less memory to store the principal components and it takes less time (5 scans of the image, compared to 21 scans in the usual PCA approach) to compute the projection of an image into the eigensegment space, albeit at the price of representing an image as a 25-dimensional vector in eigensegment representation compared to a 21-dimensional vector in PCA. The Weizmann database is much more diverse than the Olivetti database (in terms of facial expressions, illumination and head pose) and hence the tradeoff between the number of segments and the number of principal components is lower, but still a factor of 2 in speedup can be achieved.
752
S. Avidan 5
25
30
20
20
35
4.5
40
4.5
40
30
5
4
4
15
20
20
(a) Olivetti database
25
30
1
25
15 # of eigenvectors
30
10
35
5
20
40
40
1.5 30
70 80
1
25
2
50
60
1.5
2.5
30
2
30
2.5
3
35
40 20
40
20
30
3
# of segments
3.5
50
# of segments
3.5
5
10
15 # of eigenvectors
20
25
30
(b) Weizmann database
Fig. 2. Tradeoff between number of segments (the y-axis) and number of eigenvectors (the x-axis). The contour labels measure the reconstruction error (in intensity values) per pixel. For example, In the Olivetti database one segment with 21 eigenvectors has the same reconstruction error, for this image ensemble, as a 5-segment 5-eigenvector combination. See text for further details.
Face segmentation. We segmented the Olivetti database into five segments. Figure 3 shows the result of the segmentation. As can be seen, the segments correspond to natural facial features such as eyes, nose, forehead and hair. This was achieved without enforcing spatial continuity on the various segments. We did note however that as the image ensemble becomes more diverse (in terms of facial expressions, illumination and head pose) the quality of the spatial segmentation deteriorates.
Fig. 3. Segmentation of the Olivetti image ensemble into 5 segments. Each color represents a segment, that is, a collection of pixels that exhibit a similar temporal behavior.
EigenSegments: A Spatio-Temporal Decomposition
753
Handling local changes in appearance. In another experiment we analyzed the effect of local changes in appearance on our representation. Our learning set consisted of 40 images, one per subject, from the Olivetti database. We then focused on one of the subjects that had glasses in the learning set and took a different image of him, without the glasses. We then measured the reconstruction quality of eigensegment decomposition vs. the reconstruction quality of eigenimage decomposition. Figure 4 shows the results. One interesting observation we can make is that the eigensegment representation can go beyond PCA in terms of representation power. In our case we have only 40 images to learn from and so we are limited to only 40 principal components to represent every new face (even of the same subject). However, by breaking the image into segments we can approximate each segment with it own set of 40 principal components. At the limit, if the number of pixels in the segment goes below 40, we can have a perfect reconstruction. We found that in such a small database the advantage of increasing the number of segments instead the number of principal components is considerable. Consider figure 4(c) that shows the reconstruction error as a function of the number of segments and principal components. It can be seen that the test image can be reconstructed up to an average error of 16 intensity values (in the 0-255 range), using 5 segments and 7 principal components per segment, or using a single segment (that corresponds to the entire image) and 30 principal components. The advantage of the former is that it takes less memory to store the principal components (7 instead of 30) and is faster to compute (7 scans of the image instead of 30). Eigensegments for recognition. In this experiment we tested the classification power of the eigensegment representation, compared to that of the usual eigenspace approach. The Olivetti database was divide into a learning set that contains the first 5 images of every subject and a testing set that contains the last 5 images of every subject. The learning database of 200 images was then decomposed using various combinations of segments and principal components. Both learned images and test images were projected into feature space, according to the particular spatio-temporal decomposition we choose, and a nearest-neighbor algorithm was used to classify each test image. Table 1 compares the results using the vanilla eigenspace vs. the eigensegments approach with either 3, 5 or 20 segments. As can be seen, the best results (189 correct classifications out of 200) are obtained using the segment-mean decomposition, that is the feature vector of the image is taken to be the mean intensity value of each segment. In addition to being the most accurate it is the fastest to compute since only a single scan of the image is needed to compute the mean intensity value of each segment, as opposed to the multiple dot-products required in the usual PCA approach. 4.2
Experiments with Vehicles Images
This experiment demonstrates the use of segments for the purpose of vehicle recognition using Support Vector Machine (SVM) [12]. We collected a set of 2832
754
S. Avidan
(a)
(b)
20 10
22
14
6
18
8
16
6
12
18 20
16
8
10
12 22
14
10 8
26 24
# of segments
14
10
12
18
8
20 10
4
2
14
16
26
6
12
22
12
14
18 20
24 5
10
14
16 15
20 25 # of eigenvectors
16 30
35
40
(c)
(d)
(e)
(f)
Fig. 4. Reconstruction quality in the presence of local appearance changes. The subject has glasses in the database image (a) but not in the test image (b). Image (c) shows a contour map of reconstruction error for a various number of segments and PCs. Images (d-f) show the reconstructed test image using different combinations of segments and PCs. Image (d) uses 1 segment and 40 PCs, Image (e) uses 10 segments and 20 PCs and Image (f) uses 20 segments and 40 PCs. Image (d) is the best PCA can do because there are only 40 images in the database. Image (e) is as good as image (d) but takes half the time to compute and image (f) is better than the best reconstruction PCA can do. We can keep improving the reconstruction by increasing the number of segments.
EigenSegments: A Spatio-Temporal Decomposition
755
Table 1. The table on the left compares PCA vs. eigensegments for different numbers of principal components (PCs). Classification rates (for 200 test images) for different numbers of segments and eigenvectors are shown. The case of 1 segment is equivalent to the usual eigenimage approach. The table on the right shows the classification results where each segment is approximated by its mean intensity value (which can be seen as the 0-th principal component). Note that the spatial decomposition where every segment is approximated by its mean intensity value gives the best results. In both cases, the Olivetti database was used with 200 images for learning and 200 images for testing. See text for further details. # PCs 1 Segment 3 Segments 5 Segments 1 117 158 169 3 169 189 185 5 176 182 186 10 184 183 187 20 185 187 187
# Segments Correct classification 1 126 3 158 5 164 10 178 20 189
images of vehicles and 7757 images of non-vehicles, see Figure 5. All the images are rescaled to be 20 × 20 pixels in size and the mean intensity value is set to 127 (in the range [0..255]) to compensate for variations in illumination and color. We compared three spatio-temporal decompositions of the image ensemble. In the first case we decomposed the images into 4 segments and approximated each segment with 30 principal components resulting in a 120-dimensions feature vector. In the second case we decomposed the images into 6 segments and approximated each segment with 8 principal components resulting in a 48-dimensional feature vector. In the third case we decomposed the image into 50 segments and took the mean intensity value of each segment, resulting in a 50-dimensional feature vector. The first case requires 30 dot-products of the test image, the second case requires 8 dot-products, and the last case requires a single scan of the test image to compute the mean intensity value of each segment. In every case we fed the feature vectors to a second-order homogenous polynomial SVM. The trained SVM was then tested on a set of 4292 images of vehicles and 9589 images of non-vehicles. The Receiver Operator Characteristics (ROC) curve is shown in figure 6. Each curve represent the percentage of correct classifications of vehicles against false classifications. So, for example, in the third method there is correct classification on 99.29% of the vehicles at the price of wrongly classifying 15.26% of the non-vehicles as vehicles. By changing the SVM threshold we can achieve a classification score of 95.90% on vehicles at the price of wrongly classifying only 3.15% of the non-vehicles as vehicles. The first two methods, that involve a combination of eigenvectors and segmentation perform far worst. The third method, that takes the mean intensity value of the different segments, produced the best results, in terms of classification power, at the lowest CPU time as it requires only a single scan of the test image. The results of using the usual PCA approach on this database were much worst. We projected the images on the first 50 principal components of the image
756
S. Avidan
ensemble and fed this 50-dimensional vectors to the same SVM classifier. The results were considerably worse than those reported above. The classifier was correct on 83.71% of the test vehicles and 89.97% correct on the non-vehicles, compared to 94.92% and 90.62%, respectively achieved by the third method (taking the means of 50 segments) reported above.
Fig. 5. Some sample images from the vehicle database.
5
Conclusions
We have shown a method for a spatio-temporal decomposition of an ensemble of images. The ensemble is first decomposed spatially into regions of pixels that exhibit a similar temporal behavior and each region is then approximated by a number of principal components. In the experiments we conducted we found, empirically, that this spatio-temporal decomposition can give the same reconstruction errors as the usual eigenimages approach, at a lower computational and memory price, compared to the usual PCA approach. For the purpose of object classification we found that the mean-segment decomposition, where every segment is approximated by its mean intensity value gives the best result both for face and vehicle classification.
EigenSegments: A Spatio-Temporal Decomposition
757
1 0.9 0.8
% of true positives
0.7 0.6 0.5 0.4
(a)
0.3
(b)
0.2
(c)
0.1 0
0
0.05
0.1
0.15 % of false positives
0.2
0.25
0.3
Fig. 6. Comparison of Receiver Operator Characteristics (ROC) for vehicle detection. The image ensemble was decomposed in three different ways. (a) 6 segments and 8 PCs. (b) 4 segments and 30 PCs. (c) 50 segments where each segment is approximated by its mean intensity value. As can be seen, the mean-segment decomposition is far better than the other two methods that involve both spatial decomposition and temporal decomposition. Moreover, method (c) is the fastest to compute as it requires a single scan of the test image to compute the mean intensity value of each segment. See text for further details.
Acknowledgment. I would like to thank the research memebers of MobilEye Vision Technologies for their good advice and for allowing me to use some of their data in my experiments.
References 1. C. M. Bishop and J. M. Winn. Non-linear Bayesian Image Modelling. In European Conference on Computer Vision. Dublin 2000. 2. C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech recognition. In Fifth International Conference on Computer Vision, pages 494-499, Boston, June 1995. 3. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. WileyInterscience publication, 1973. 4. B. J. Frey, A. Colmenarez and T. S. Huang. Mixtures of Local Linear Subspaces for Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Sant Barabara, CA, June 1998. 5. G. E. Hinton, M. Revow and P. Dayan. Recognizing handwritten digits using mixtures of linear models. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky and T. Leen, Eds. 19955, pp. 1015-1022, MIT Press. 6. B. Moghaddam and A. Pentland. Probabilitic visual learning for object recognition. In IEEE Transactions on Pattern Ananlysis and Machine Inteliligence, 19(7):696710, 1997. 7. Y. Moses. The Weizmann facebase, ftp.idc.ac.il/Pub/Users/cs/yael/Facebase/
758
S. Avidan
8. A. Pentland, B. Moghaddam and T. Starner. View-based and Modular Eigenspaces for Face Recognition. In em IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, June, 1994. 9. F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identification. In 2nd IEEE Workshop on Applications of Computer Vision December 1994, Sarasota (Florida). 10. L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of human faces. In Journal of the Optical Society of America 4, 510-524. 11. M. Turk and A. Pentland. Eigenfaces for recognition. In Journal of Cognitive Neuroscience, vol. 3, no. 1, 1991. 12. V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.
On the Representation and Matching of Qualitative Shape at Multiple Scales Ali Shokoufandeh1 , Sven Dickinson2 , Clas J¨onsson3 , Lars Bretzner3 , and Tony Lindeberg3 1
3
Department of Mathematics and Computer Science, Drexel University, USA 2 Department of Computer Science, University of Toronto, Canada Computational Vision and Active Perception Laboratory, Department of Numerical Analysis and Computer Science, KTH, Stockholm, Sweden
Abstract. We present a framework for representing and matching multi-scale, qualitative feature hierarchies. The coarse shape of an object is captured by a set of blobs and ridges, representing compact and elongated parts of an object. These parts, in turn, map to nodes in a directed acyclic graph, in which parent/child edges represent feature overlap, sibling edges join nodes with shared parents, and all edges encode geometric relations between the features. Given two feature hierarchies, represented as directed acyclic graphs, we present an algorithm for computing both similarity and node correspondence in the presence of noise and occlusion. Similarity, in turn, is a function of structural similarity, contextual similarity (geometric relations among neighboring nodes), and node contents similarity. Moreover, the weights of these components can be varied on a node by node basis, allowing a graph-based model to effectively parameterize the saliency of its constraints. We demonstrate the approach on two domains: gesture recognition and face detection.
1
Introduction
Matching is a fundamental problem in computer vision, and plays an important role in many tasks, ranging from stereo reconstruction to object recognition. Given two images, matching can be simply defined as establishing a correspondence between features in one image and similar features in the other image. The representation of image features at multiple scales has long been a powerful paradigm in computer vision, offering a number of attractive properties. From a representational standpoint, scale spaces support the processing of information at appropriate (even variable) levels of resolution. Moreover, the computational complexity of many tasks can be reduced by using the results of processing at coarser scales to constrain processing at finer scales. Matching multi-scale feature hierarchies therefore raises a number of important challenges, including: 1) what
Dr. Shokoufandeh gratefully acknowledges the support of the US National Science Foundation (NSF-IDM0136337). Dr. Dickinson gratefully acknowledges the support of NSERC Canada.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 759–775, 2002. c Springer-Verlag Berlin Heidelberg 2002
760
A. Shokoufandeh et al.
features make up the hierarchy?; and 2) how do we effectively and efficiently match two feature hierarchies in the presence of noise and occlusion? In this paper, we address the problem of matching two image structures based on qualitative shape. We describe the construction of a qualitative feature hierarchy, based on the detection of ridges and blobs with automatic scale selection. These features, and the relations between them, attempt to capture the coarse shape of an object in terms of a set of parts. Given two such feature hierarchies, each represented by a directed acyclic graph (DAG), we describe a matching algorithm that computes both an overall measure of similarity (or distance) as well as node correspondence. The matching algorithm takes into account the structure of the feature hierarchy, the contents of the nodes, and the attributed relations (i.e., context) among nodes. Moreover, the weights of these components can be varied on a node by node basis, allowing a graph-based model to effectively parameterize the saliency of its constraints. To demonstrate the approach, we apply it to two separate domains: gesture recognition and face detection.
2
Related Work
There has been considerable effort devoted to both scale space theory and hierarchical structure matching, although much less effort has been devoted to combining the two paradigms. Although space prohibits an extensive review, a few examples are worth noting. Coarse-to-fine image descriptions are plentiful, including work by Burt [2] and Lindeberg [6]. Some have applied such models to directing visual attention, e.g., Tsotsos [19], while others have applied multiscale structures to matching, including Crowley and Sanderson [3], Rao et al. [12], Bretzner and Lindeberg [1], and Pizer et al. [11]. Although graph matching is a popular topic in the literature, including both inexact and exact graph matching algorithms, there is far less work on dealing with hierarchical graphs, i.e., DAGs, in which lower levels reflect less saliency. The work most relevant to the work reported here is the work of Shokoufandeh et al. [15], which matched multi-scale blob representations represented as directed acyclic graphs. Our framework differs in that: 1) it includes geometric relations among features; 2) the importance of structure and geometry as a matching criteria can be varied continuously within a single framework (as opposed to two separate algorithms); and 3) our matching framework offers several orders of magnitude less complexity. Another related approach is due to Wiskott et al., who apply elastic graph matching to a planar graphs whose nodes represent collections of wavelet jets. Although their features are multi-scale, their representation is not hierarchical, and matching requires that the graphs be coarsely aligned in scale and image rotation [21]. A similar approach was applied to hand posture recognition by Triesch and von der Malsburg [18].
On the Representation and Matching of Qualitative Shape
(a)
761
(b)
Fig. 1. Feature Extraction: (a) extracted blobs and ridges at locally adapted scales; (b) extracted features after removing multiple responses and ridge linking.
3
Building a Qualitative Shape Feature Hierarchy
As mentioned in Section 1, our qualitative feature hierarchy represents image structures in terms of blob and ridge features with a set of attribute relations. The representation is an extension of the work presented in [1]. Blob and ridge extraction is performed using automatic scale selection, as described in previous work (see [8] and [7]). It should be noted, however, that the resulting representation is not intended as complete representation. In a more general setting, extensions to other types of image features and feature relations should, of course, be considered. 3.1
Extracting Qualitative Shape Features
A scale-space representation of the image signal f is computed by convolution with Gaussian kernels g(·; t) of different variance t, giving L(·; t) = g(·; t)∗f (·). Blob detection aims at locating compact objects or parts in the image. The entity used to detect blobs is the square of the normalized Laplacian operator, ∇2norm L = t (Lxx + Lyy ).
(1)
Blobs are detected as local maxima in scale-space. Figure 1(a) shows an image of a hand with the extracted blobs superimposed. A blob is graphically represented by a circle defining a support region, whose radius is proportional to the scale. Elongated structures are localized where the multi-scale ridge detector Rnorm L = t3/2 |Lpp − Lqq |2 = t3/2 ((Lxx − Lyy )2 + 4L2xy )
(2)
assumes local maxima in scale-space. Figure 1(a) also shows the extracted ridges, represented as superimposed ellipses, each defining a support region, with width proportional to the scale. To represent the spatial extent of a detected image structure, a windowed second moment matrix 2 Lx Lx Ly g(η; tint ) dη (3) Σ= Lx Ly L2y η∈IR2
762
A. Shokoufandeh et al.
is computed at the detected feature position and at an integration scale tint proportional to the scale of the detected image feature. There are two parameters of the directional statistics that we make use of here: the orientation and the anisotropy, given from the eigenvalues λ1 and λ2 (λ1 > λ2 ) and their correspond˜ = 1−λ2 /λ1 , ing eigenvectors eλ1 and eλ2 of Σ. The anisotropy is defined as Q 1+λ2 /λ1 while the orientation is given by the direction of eλ1 . To improve feature detection in scenes with poor intensity contrast between the image object and background, we utilize color information. This is done by extracting features in the R, G and B color bands separately, along with the intensity image. Re-occurring features are awarded with respect to significance. Furthermore, if we have advance information on the color of the object, improvements can be achieved by weighting the significance of the features in the color bands differently. When constructing a feature hierarchy, we extract the N most salient image features, ranked according to the response of the scale-space descriptors used in the feature extraction process. From these features, a Feature Map is built according to the following steps: Merging multiple feature responses. This step removes multiple overlapping responses originating from the same image structure, the effect of which can be seen in Figure 1(a). To be able to detect overlapping features, a measure of inter-feature similarity is needed. For this purpose, each feature is associated with a 2-D Gaussian kernel g(x, Σ), where the covariance is given from the second moment matrix. When two features are positioned near each other, their Gaussian functions will intersect. The similarity measure between two such features is based on the disjunct volume D of the two Gaussians [5]. This volume is calculated by integrating the square of the difference between the two Gaussian functions to the two intersecting features A and B: (gA , gB ) corresponding |ΣA |+|ΣB | (g − gB )2 dx. The disjunct volume depends on the D(A, B) = 2 R2 A differences in position, variance, scale and orientation of the two Gaussians, and for ridges is more sensitive to translations in the direction perpendicular to the ridge. Ridge Linking. The ridge detection will produce multiple responses on a ridge structure that is long compared to its width. These ridges are linked together to form one long ridge, as illustrated in Figure 1(b). The criteria for when to link two ridges is based on two conditions: 1) they must be aligned, and 2) their support regions must overlap. After the linking is performed, the anisotropy and support region for the resulting ridge is re-calculated. The anisotropy is re-calculated from the new length/width relationship as 1-(width of structure)/(length of structure). 3.2
Assembling the Features into a Graph
Once the Feature Map is constructed, the component features are assembled into a directed acyclic graph. Associated with each node (blob/ridge) are a number of
On the Representation and Matching of Qualitative Shape
763
Feature B relative orientation
Feature A Feature B Feature A
rA
Feature B d
rB d
Feature A
φ
rA
Fig. 2. The four edge relations: two normalized distance measures, relative orientation, and bearing
attributes, including position, orientation, and support region. A feature at the coarsest scale of the Feature Map is chosen as the root. Next, finer-scale features that overlap with the root become its children through hierarchical edges. These children, in turn, select overlapping features at any scale to be their children, etc. From the unassigned features, a new root is chosen and the process is repeated until all features are assigned to the graph. Two nodes that share a parent have a sibling edge between them.1 In cases where a feature could have two parents, the feature at the coarsest scale is chosen. Associated with each sibling or hierarchical edge are a number of important geometric attributes. For an edge E, directed from a vertex VA representing feature FA , to a vertex VB representing feature FB , we define the following attributes, as shown in Figure 2: – Distance. Two measures of inter-feature distance are associated with the edge: 1) the smallest distance d from the support region of FA to the support region of FB , normalized to the the largest of the radii rA and rB ; and 2) the distance between their centers normalized to the radius rA of FA in the direction of the distance vector between their centers. – Relative orientation. The relative orientation between FA and FB . – Bearing. The bearing of a feature FB , as seen from a feature FA , is defined as the angle of the distance vector xB − xA with respect to the orientation of A measured counter-clockwise. – Scale ratio. The scale invariant relation between FA and FB is a ratio between scales tFA and tFB . – Type. Classifies the edge as parental or sibling.
4
Matching Problem Formulation
Given two images and their corresponding feature map graphs, G1 = (V1 , E1 ) and G2 = (V1 , E1 ), with |V1 | = n1 and |V2 | = n2 , we seek a method for computing their similarity. In the absence of noise, segmentation errors, occlusion, and clutter, computing the similarity of G1 and G2 could be formulated as a label-consistent graph isomorphism problem. However, under normal imaging 1
Sibling edges are not included in a structural description of the graph, thereby ensuring that it is a DAG.
764
A. Shokoufandeh et al.
conditions, there may not exist significant subgraphs common to G1 and G2 . We therefore seek an approximate solution that captures both the topological and geometrical similarity of G1 and G2 as well as corresponding node similarity. Topological similarity is a domain-independent measure that accounts for similarity in the “shapes” of two graphs, in terms of numbers of nodes, branching factor distributions, etc. Geometrical similarity accounts for consistency in the relative positions, orientations, and scales of nodes in the two graphs, while node similarity is a domain-dependent function that accounts for the similarity of the contents of corresponding nodes. In previous work, we mapped the structure of a unique rooted tree to a lowdimensional vector [16]; two vectors that were close together represented two trees that were nearly isomorphic in terms of topology, or structure. In this paper, we extend the approach to cover directed acyclic graphs, a more powerful characterization of hierarchical structure, of which trees are a special case. Moreover, whereas our previous work examined the stability of this mapping under a restricted class of structural perturbations (the perturbed tree was a subtree of the original tree or vice versa), we now extend this analysis to directed acyclic graphs and strengthen it to include any perturbation. Our previous work focused primarily on structural similarity, with node information (restricted to label information) playing a minor role. In this paper, we describe a method for mapping the geometry of a directed acyclic graph into a low-dimensional vector; in this case, two vectors that are close together represent two graphs whose nodes are similar in terms of feature type, scale, position, orientation, and support region. Finally, we combine these functions in an extension of our original algorithm (originally intended for rooted trees) that will take two directed acyclic graphs and compute their similarity. 4.1
Encoding Graph Structure
To describe the topology of a DAG, we turn to the domain of eigenspaces of graphs, first noting that any graph can be represented as an antisymmetric {0, 1, −1} adjacency matrix, with 1’s (-1’s) indicating a forward (backward) edge between adjacent nodes in the graph (and 0’s on the diagonal). The eigenvalues of a graph’s adjacency matrix encode important structural properties of the graph. Furthermore, the eigenvalues of an antisymmetric (or Hermitian) matrix A are invariant to any orthonormal transformation of the form P t AP . Since a permutation matrix is orthonormal, the eigenvalues of a graph are invariant to any consistent re-ordering of the graph’s branches. However, before we can exploit a graph’s eigenvalues for matching purposes, we must establish their stability under minor topological perturbation, due to noise, occlusion, or deformation. We will begin by showing that any structural change to a graph can be modeled as a two-step transformation of its original adjacency matrix. The first step transforms the graph’s original adjacency matrix to a new matrix having the same spectral properties as the original matrix. The second step adds a noise matrix to this new matrix, representing the structural changes due to noise and/or occlusion. These changes take the form of the addition/deletion of
On the Representation and Matching of Qualitative Shape
765
nodes/arcs to/from the original graph. We will then draw on an important result that relates the distortion of the magnitudes of the eigenvalues of the matrix resulting from the first step to the magnitude of the noise added in the second step. Since the eigenvalues of the original matrix are the same as those of the transformed matrix (first step), the noise-dependent bounds on the magnitudes of the eigenvalues therefore apply to the original matrix. The result will establish the insensitivity of a graph’s spectral properties to minor topological changes. Let’s begin with some definitions. Let AG ∈ {0, 1, −1}m×m denote the adjacency matrix of the graph G on m vertices, and assume H is an n-vertex graph obtained by adding n − m new vertices and a set of edges to the graph G. Let Ψ : {0, 1, −1}m×m → {0, 1, −1}n×n , be a lifting operator which transforms a subspace of Rm×m to a subspace of Rn×n , with n ≥ m. We will call this operator spectrum preserving if the eigenvalues of any matrix A ∈ {0, 1, −1}m×m and its image with respect to the operator (Ψ (A)) are the same up to degeneracy, i.e., the only difference between the spectra of A and Ψ (A) is the number of zero eigenvalues (Ψ (A) has n − m more zero eigenvalues then A). As stated above, our goal is to show that any structural change in graph G can be represented in terms of a spectrum preserving operator and a noise matrix. Specifically, if AH denotes the adjacency matrix of the graph H, then there exists a spectrum preserving operator Ψ () and a noise matrix E H ∈ {0, 1, −1}n×n such that: (4) AH = Ψ (AG ) + EH . We will define Ψ () as a lifting operator consisting of two steps. First, we will add n − m zero rows and columns to the matrix AG , and denote the resulting matrix by AG . Next, AG will be pre- and post-multiplied by a permutation matrix P and its transpose P t , respectively, aligning the rows and columns corresponding to the same vertices in AH and Ψ (AG ). Since the only difference between the eigenvalues of AG and AG is the number of zero eigenvalues, and P AG P t has the same set of eigenvalues as the matrix AG , Ψ () is a spectrum preserving operator. As a result, the noise matrix EH can be represented as AH − Ψ (AG ) ∈ {0, 1, −1}n×n . Armed with a spectrum-preserving lifting operator and a noise matrix, we can now proceed to quantify the impact of the noise on the magnitudes of the original graph’s eigenvalues. Specifically, let λk for k ∈ {1, ..., n} denote the magnitude of the k th largest eigenvalue of the matrix A. A seminal result of Wilkinson [20] (see also Stewart and Sun [17]) states that: Theorem 1. If A and A + E are n × n antisymmetric matrices, then: λk (A) + λk (E) ≤ λk (A + E) ≤ λk (A) + λ1 (E),
for all k ∈ {1, ..., n}.
(5)
For H (perturbed graph) and G (original graph), the above theorem yields, for all k ∈ {1, ..., n}:
766
A. Shokoufandeh et al.
λk (Ψ (AG )) + λk (EH ) ≤ λk (AH ) ≤ λk (Ψ (AG )) + λ1 (EH ) ⇐⇒ λk (EH )
≤ λk (AH ) − λk (Ψ (AG )) ≤ λ1 (EH ) ⇐⇒
(6)
|λk (AH ) − λk (Ψ (AG ))| ≤ max{|λ1 (EH )|, |λk (EH )|}, and since |λ1 (EH )| = ||EH ||2 (spectral radius), we have: |λk (AH ) − λk (Ψ (AG ))| ≤ ||EH ||2 .
(7)
The above chain of inequalities gives a precise bound on the distortion of the magnitudes of the eigenvalues of Ψ (AG ) in terms of the largest eigenvalue of the noise matrix EH . Since Ψ () is a spectrum preserving operator, the magnitudes of the eigenvalues of AG follow the same bound in their distortions. The above result has several important consequences for our application of a graph’s eigenvalues to graph matching. Namely, if the perturbation EH is small in terms of its complexity, then the magnitudes of the eigenvalues of the new graph H (e.g., the query graph) will remain close to their corresponding nonzero eigenvalues of the original graph G (e.g., the model graph), independent of where the perturbation is applied to G. The magnitude of the eigenvalue distortion is a function of the number of vertices added to the graph due to the noise or occlusion. Specifically, if the noise matrix EH introduces √k new vertices to G, then the distortion of every eigenvalue can be bounded by k − 1 (Neumaier [10]). This bound can be further tightened if the noise matrix has simple structure. For example, if EH represents a simple path on k vertices, then its norm can be bounded by (2 cos π/(k + 1)) (Lov´asz and Pelik´an [9]). In short, large distortions are due to the introduction/deletion of large, complex subgraphs to/from G, while small structural changes will have little impact on the higher order eigenvalues G. The magnitudes of the eigenvalues of a graph are therefore stable under minor perturbations in graph structure. Having established the stability of a DAG’s eigenvalues under minor perturbation of the graph, we can now proceed with our task of capturing a graph’s structure with a low-dimensional vector. We could, for example, define a vector to be the sorted eigenvalues of a DAG. However, for large DAGs, the dimensionality of the index (and model DAG database) would be prohibitively large. Moreover, since the graph’s eigenvalues reflect global structure, they cannot accommodate significant occlusion. Our solution to these problems will be based on: 1) computing eigenvalue sums to reduce dimensionality, and 2) to compute a vector at each node that reflects the node’s local structure in the hierarchy. We will now briefly review our encoding of directed acyclic graph structure, originally used to encode undirected, rooted tree structure [16]. For brevity, subsequent references to eigenvalues imply magnitudes of eigenvalues. Specifically, let D be a DAG whose maximum branching factor is ∆(D), and let the subgraphs of its root be D1 , D2 , . . . , DS , as shown in Figure 3(a). For each subgraph, Di , whose root degree is δ(Di ), we compute the eigenvalues of Di ’s submatrix, sort the eigenvalues in decreasing order by absolute value, and let Si be the sum
On the Representation and Matching of Qualitative Shape S1 > − S2 > − S3 > − ... > − Sdmax
V = [S1,S2,S3,...,Sdmax]
b
distance
Si = |λ1| +| λ2| + ... + | λk|
Si a
...
1
2
a
n
b
c d
φ
e
a
k
c
relative orientation freq.
b
φ
bearing
c d
767
e
freq. distance
relative scale
(a)
rel. orient. freq.
freq.
d e
bearing
rel. scale
(b)
Fig. 3. Forming the structural and geometric signatures.
of the δ(Di ) − 1 largest absolute values. The sorted Si ’s become the components of a ∆(D)-dimensional vector assigned to the DAG’s root. If the number of Si ’s is less than ∆(D), then the vector is padded with zeroes. We can recursively repeat this procedure, assigning a vector to each nonterminal node in the DAG, computed over the subgraph rooted at that node. We call each such vector a topological signature vector, or TSV. Although the eigenvalue sums are invariant to any consistent re-ordering of the DAG’s branches, we have given up some uniqueness (due to the summing operation) in order to reduce dimensionality. We could have elevated only the largest eigenvalue from each subgraph (non-unique but less ambiguous), but this would be less representative of the subgraph’s structure. We choose the δ(Di ) − 1 largest eigenvalues for two reasons: 1) the largest eigenvalues are more informative of subgraph structure, and 2) by summing δ(Di ) − 1 elements, we effectively normalize the sum according to the local complexity of the subgraph root. In [16], we describe an efficient method for computing the sum of the k largest eigenvalues directly (without computing the individual eigenvalues). 4.2
Encoding Graph Geometry
The above encoding of structure suffers from the drawback that it does not capture the geometry of the nodes. For example, two graphs with identical structure may differ in terms of the relative positions of their nodes, the relative orientations of their nodes (for elongated nodes), and the relative scales of their nodes. Just as we derived a topological signature vector, which encodes the “neighborhood” structure of a node, we now seek an analogous “geometrical signature vector”, which encodes the neighborhood geometry of a node. Below, we derive a geometrical signature of a node’s local context. This geometrical signature will be combined with our topological signature in a new algorithm that computes the distance between two directed acyclic graphs. Let G = (V, E) be a graph to be recognized (input image). For every pair of vertices u, v ∈ V , if there is an edge E = (u, v) between them, we let Ru,v denote the attribute vector associated with edge E. The entries of each such vector represent the set of relations R = {distance, relative orientation, bearing,
768
A. Shokoufandeh et al.
scale ratio} between u and v, as shown in Figure 3(b). For a vertex u ∈ V , we let N (u) denote the set of vertices v ∈ V such that the directed edge (u, v) corresponds to a sibling relation. For a relation p ∈ R, we will denote P(u, p) as the set of values of relation p between vertex u and all the vertices in the set N (u), i.e., P(u, p) corresponds to the entry p of the vectors Ru,v for v ∈ N (u).2 Given two graphs G = (V, E) and G = (V , E ) with vertices u ∈ V and u ∈ V , we compute the similarity between u and u in terms of their respective values in the sets P(u, p) and P(u , p), for all p ∈ R. Now, let dp (u, u ) denote the Hausdorff distance between sets P(u, p) and P(u , p), i.e. dp (u, u ) = max max min |a − b| . a∈P(u,p) b∈P(u ,p)
In other words, the distance between sets P(u, p) and P(u , p) is the smallest value d such that every element of P(u, p) has an element of P(u , p) within distance d. If the relation is not normalized, we can introduce a translation, t, and compute the distance wp (u, u ) between sets P(u, p) and P(u , p) by finding the value of t for which the distance between points of P(u, p) and P(u , p) is minimized, i.e., max min |a + t − b| , wp (u, u ) = min dp (u, u ) = min max t∈R
t∈R
a∈P(u,p) b∈P(u ,p)
wp (u, u ) can be computed using the algorithm of Rote [14], in time O((|N (u)| + |N (u )|) log(|N (u)| + |N (u )|)). Given the the values of wp (u, u ) for all p ∈ R, we arrive at a final node similarity function for vertices u and u : W(u, u ) = e− 4.3
p∈R
wp (u,u )
.
Matching Algorithm
In previous work, we developed an algorithm for computing the distance (and correspondence) between two unique rooted undirected trees [16]. We now extend our algorithm to compute the distance between two directed acyclic graphs. As mentioned earlier, our major challenge is in computing an approximate isomorphism when, due to noise, occlusion, etc., a nontrivial subgraph isomorphism may not exist. Let’s consider relaxing the isomorphism constraint for a moment. Each node in our graph (query or model) is assigned a TSV, which reflects the underlying structure in the subgraph rooted at that node. In addition, each node is assigned a set of relation histograms, which capture the geometry of the graph. If we simply removed all the edges in our two graphs, we would be faced with the problem of finding the best correspondence between the nodes in the query and the nodes in the model; two nodes could be said to be in close correspondence 2
The exception to this rule is the orientation relation. Rather than use absolute orientation, measured with respect to a reference direction, we instead use the angle from the previous edge in a clockwise ordering edges emanating from a vertex.
On the Representation and Matching of Qualitative Shape
769
if the distance between their TSVs and the distances between their geometric feature histograms was small. In fact, such a formulation amounts to finding the maximum cardinality, maximum weight matching in a bipartite graph spanning the two sets of nodes. At first glance, such a formulation might seem like a bad idea (by throwing away all that important graph structure!) until one recalls that the graph structure is really encoded in the node’s TSV, and the graph geometry is encoded in the node’s relation histograms. Is it then possible to reformulate a noisy, largest isomorphic subgraph problem as a simple bipartite matching problem? Unfortunately, in discarding all the graph structure, we have also discarded the underlying hierarchical structure. There is nothing in the bipartite graph matching formulation that ensures that hierarchical constraints among corresponding nodes are obeyed, i.e., that parent/child nodes in one graph don’t match child/parent nodes in the other. This reformulation, although softening the overly strict constraints imposed by the largest isomorphic subgraph formulation, is perhaps too weak. We could try to enforce the hierarchical constraints in our bipartite matching formulation, but no polynomial-time solution is known to exist for the resulting formulation. Clearly, we seek an efficient approximation method that will find corresponding nodes between two noisy, occluded DAGs, subject to hierarchical constraints. Our algorithm, a modification to Reyner’s algorithm [13], combines the above bipartite matching formulation with a greedy, best-first search in a recursive procedure to compute the corresponding nodes in two rooted DAGs. As in the above bipartite matching formulation, we compute the maximum cardinality, maximum weight matching in the bipartite graph spanning the two sets of nodes. Edge weight will encode a function of the topological similarity, the geometric similarity, and a domain-dependent node similarity. Moreover, the relative weights of these two components can be specified in the model on a node by node basis!, allowing certain geometrical constraints to be relaxed while others strengthened. The result will be a selection of edges yielding a mapping between query and model nodes. As mentioned above, the computed mapping may not obey hierarchical constraints. We therefore greedily choose only the best edge (the two most similar nodes in the two graphs, representing in some sense the two most similar subgraphs), add it to the solution set, and recursively apply the procedure to the subgraphs defined by these two nodes. Unlike a traditional depth-first search which backtracks to the next staticallydetermined branch, our algorithm effectively recomputes the branches at each node, always choosing the next branch to descend in a best-first manner. In this way, the search for corresponding nodes is focused in corresponding subgraphs (rooted DAGs) in a top-down manner, thereby ensuring that hierarchical constraints are obeyed. In moving from rooted trees to DAGs, we have to deal with multiple reachability of a vertex in a DAG. Specifically, there may exist two vertices u and v, such that one is not an ancestor of the other, that share a subgraph rooted at vertex w. If the vertices in w are matched, for example, while recursively descending from u, they must not be made available for matching
770
A. Shokoufandeh et al.
procedure isomorphism(G,H) Φ(G, H) ← ∅ ;solution set d ← max(δ(G), δ(H)) ;TSV degree for u ∈ VG { ;compute TSV at each node and unmark all nodes in G compute χ(u) ∈ Rd−1 (see Section 4.1) unmark u } for v ∈ VH { ;compute TSV at each node and unmark all nodes in H compute χ(v) ∈ Rd−1 (see Section 4.1) unmark v } call match(root(G),root(H)) return(cost(Φ(G, H)) end procedure match(u,v) do { let Gu ← rooted unmarked subgraph of G at u let Hv ← rooted subgraph of H at v compute |VG | × |VH | weight matrix Π(Gu , Hv ) u v M ← max cardinality, max weight bipartite matching in G(VG , VH ) with weights from Π(Gu , Hv ) (see [4]) u v (u , v ) ← max weight pair in M Φ(G, H) ← Φ(G, H) ∪ {(u , v )} call match(u ,v ) mark Gu mark Gv } while (Gu = ∅ and Hv = ∅)
Fig. 4. Algorithm for Matching Two Hierarchical Structures (DAG’s) (modified from our previous algorithm for matching undirected rooted trees [16])
while descending v. We therefore need to keep track of matched vertices in our top-down matching process. As mentioned above, the algorithm computes, at each step, a maximum cardinality, maximum weight matching in the bipartite graph. This requires that we define an objective function that can evaluate the quality of a matching. Given feature map graphs, G1 = (V1 , E1 ) and G2 = (V2 , E2 ), representing query and model images, respectively, along with a partial matching function, Φ : V1 → V2 , between the regions of G1 and G2 , we define the mapping matrix, M (Φ), between G1 and G2 , to be a |V1 | × |V2 |, {0, 1} matrix as follows: 1 if u ∈ V1 , v ∈ V2 , u = f (v), Mu,v = 0 otherwise. Since Φ() is a bijective mapping, M (Φ) will satisfy the following conditions: v∈V2 Mu,v ≤ 1 ∀u ∈ V1 , u∈V1 Mu,v ≤ 1 ∀v ∈ V2 . Given this formulation of the mapping, Φ(), we define the similarity of G1 and G2 to be: S(Φ) = ε Mu,v |Φ(u, v)| + (1 − ε) γ(u). (8) u∈V1 v∈V2
u∈V∈
The terms of the convex combination S(Φ) reflect the similarity between the query and model graphs. Specifically, the first term represents the quality of
On the Representation and Matching of Qualitative Shape
771
correspondence between the matched regions, while the second term penalizes unmatched regions in model graph. By definition, the value of γ(v) is one if the the region v in G2 is matched to a region u in G1 ; otherwise, it will have a value strictly less than one if the vertex u is unmatched under Φ(). Specifically, for a vertex v we will define an error function E(v) which grows proportionally to the complexity of its topological signature, the relative scale of v with respect to its neighbors, and the normalized support size. The penalty for an unmatched region v is defined as: γ(v) = e−E(v) .
(9)
Finally, the parameter of convexity ε will be computed through a least-squares approximation, using a set of sample training query and model correspondences. Note that the penalizing of unmatched model nodes (regions) is a a domaindependent assumption for domains in which the target object is unoccluded, i.e., all model regions should be visible. For domains in which target occlusion can be expected, ε can be set to 1.0. In terms of algorithmic complexity, observe that during the depth-first construction of the matching chains, each vertex in G or H will be matched at most once in the forward procedure. Once a vertex is mapped, it will never participate in another mapping again. The total time complexity of √ constructing the matching chains is therefore bounded by O(n2 n log log n), for n = √ max(n1 , n2 ) [4]. Moreover, the construction of the χ(v) vectors will take O(n √nL) time, implying √ that the overall complexity of the algorithm is max(O(n2 n log log n), O(n2 nL)). The above algorithm therefore provides, in polynomial time better than O(n3 ), an approximate optimal solution to the largest isomorphic subgraph problem in the presence of noise. It is important to note that since TSV’s are computed at each node, thereby providing both global structural descriptions (at coarser scale nodes) and local structural descriptions (at finer scale nodes), the algorithm can exploit locality of representation to accommodate occlusion and clutter. Moreover, the matching algorithm does not require connectedness, allowing multiple roots to spawn disjoint DAG’s. However, there is a potential problem when there are, for example, two candidate roots. If there are (finer scale) nodes overlapping the two roots, then the underlying DAG structure will depend on the choice of root, affecting the stability of the description. Under these conditions (and more generally), we allow a node to be a child of both (multiple) roots, thereby providing invariance to root selection order.
5
Demonstration
We demonstrate our representation and matching framework on a gesture recognition problem and a face detection problem. Figure 5(a) shows an example gesture image, along with extracted graph superimposed. Hierarchical edges are shown as red lines, so we see that the ridges corresponding to the fingers are all children of the blob corresponding to the palm. This means that all fingers
772
A. Shokoufandeh et al.
Fig. 5. Gesture Recognition. On the left are detected features as vertices superimposed on the image with the hierarchical relations/edges (red-parent, green-sibling), while on the right is an example computed correspondence between two Feature Maps.
are siblings, connected with green lines. We conducted experiments with query gestures against a uniform background and with query gestures against complex backgrounds. For each of seven gesture classes, we had five different exemplar queries, varying slightly in shape, finger articulation, scale, etc. That exemplar whose sum distance to all other members of the same class was minimum was designated as the model and removed from the database. All other queries were then compared to each model, yielding 100% recognition over the 28 trials. The first three rows of Table 1 illustrate three of the 28 trials, showing one example query from each of the first three classes, while Figure 5(b) illustrates the computed correspondence between an example pair of gesture Feature Maps. The distances in the matrix reflect similarity, while the boxed entry is that prototype (with maximum similarity) chosen to be the label of the query on the left. In the next set of experiments, the uniform query background was replaced by a complex background, varying in degree of clutter. Five cluttered queries from each of seven classes were compared to the same model graphs, yielding a 57% (20/35) recognition rate. Of the 15 queries that were most similar to another class, 10 were second closest to the correct class. The next three rows of Table 1 illustrate two correctly matched queries and one incorrectly matched query. Upon examination, it was found that the abundance of blobs and ridges, due to clutter, tended to favour the more complex models whose additional fingers could be accounted for by the additional “accidental” features. We are currently conducting experiments in which the geometric signature weight is strengthened, to see if that filters out extraneous features (at the possible cost of allowing less within-class deformation/articulation).
On the Representation and Matching of Qualitative Shape
773
Table 1. Similarity Between Query and Each Prototypical Model View (most similar model view is boxed) Query
Similarity to Model
12.3519
5.7314
6.0184 7.3514 10.2547 11.1544 9.1162
4.0512
6.8804
5.2172 3.2655 2.8425
4.1812
5.9484
5.3701
3.1344
8.3965 4.4714 7.5609
4.2079
2.7155
15.1816
9.0173
5.4417 13.1903 10.1817 15.9527 13.2277
21.8351 11.0117 12.1684 15.8837 9.2105 17.7449 16.3712
10.4261
3.4102
4.1917 4.0021 7.2581
5.6933
4.9601
In the final experiment, we attempt to find a face in cluttered scenes containing a face. Five cluttered scenes, each containing a face, were matched to a face model. In each of the five trials, the scene object most similar to the model face was the scene face (or some portion thereof). Figure 6 shows two such examples, with the face feature correspondences overlaid. The results illustrate the algorithm’s ability to deal with cluttered backgrounds, missing features, and deformed objects. It is important to note that we are not claiming a superior framework for either gesture recognition or face detection; there are specific solutions to these problems that will yield better results. Rather, we are simply demonstrating that our framework for representing and matching blob decompositions is applicable to a variety of domains (we are currently exploring its application to other domains) and under a variety of imaging conditions.
774
A. Shokoufandeh et al.
Fig. 6. Face Detection Experiment. On the left is the model face image with extracted blobs, while on the right is the closest match, with corresponding features shown. In the top example, the face, left eye, and mouth regions were detected, while in the bottom example, the face and both eyes were detected. The threshold for blob detection is low to generate more clutter.
6
Conclusions
Matching two images whose similarity exists at a qualitative level and at different scales requires a hierarchical representation in terms of a set of qualitative parts. Blob and ridge features, together with an array of relational attributes, provide a hierarchical characterization of the coarse shape of an object. The matching framework presented attempts to exploit both topological and geometrical properties of multi-scale feature hierarchies. Preliminary testing on two separate domains (without domain-specific tuning) suggests that the approach may offer general applicability. However, much further work and experimentation remains, including the exploration of alternative graph formulations of the blob decomposition, enriching the description of a blob, and comprehensively testing the framework in the presence of occlusion, background clutter, and lighting variation. Finally, we plan to test the framework on other generic recognition domains, including view-based 3-D object recognition.
References 1. L. Bretzner and T. Lindeberg. Qualitative multi-scale feature hierarchies for object tracking. Journal of Visual Communication and Image Representation, 11:115–129, 2000.
On the Representation and Matching of Qualitative Shape
775
2. P. J. Burt. Attention Mechanisms for Vision in a Dynamic World. In Proceedings of the International Conference on Pattern Recognition, Vol.1, pages 977–987, The Hague, The Netherlands, 1988. 3. J. L. Crowley and A. C. Sanderson. Multiple Resolution Representation and Probabilistic Matching of 2–D Gray–Scale Shape. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1):113–121, January 1987. 4. H. Gabow, M. Goemans, and D. Williamson. An efficient approximate algorithm for survivable network design problems. Proc. of the Third MPS Conference on Integer Programming and Combinatorial Optimization, pages 57–74, 1993. 5. I. Laptev and T. Lindeberg. Tracking of multi-state hand models using particle filtering and a hierarchy of multi-scale image features. In Proc. Scale-Space’01, Vancouver, Canada, Jul. 2001. 6. T. Lindeberg. Detecting Salient Blob–Like Image Structures and Their Scales With a Scale–Space Primal Sketch—A Method for Focus–of–Attention. International Journal of Computer Vision, 11(3):283–318, December 1993. 7. T. Lindeberg. Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision, 30:117–154, 1998. 8. T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30:77–116, 1998. 9. L. Lov´ asz and J. Pelic´ an. On the eigenvalues of a tree. Periodica Math. Hung., 3:1082–1096, 1970. 10. A. Neumaier. Second largest eigenvalue of a tree. Linear Algebra and its Applications, 46:9–25, 1982. 11. S. Pizer, C. Sarang, P. Joshi, P. Fletcher, M. Styner, G. Tracton, and J. Chen. Segmentation of single-figure objects by deformable m-reps. In Proceedings, MICCAI, pages 862–871, 2001. 12. R. P. N. Rao, G. J. Zelinsky, M. M. Hayhoe, and D. H. Ballard. Modeling Saccadic Targeting in Visual Search. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 830–836. MIT Press, Cambridge, MA, 1996. 13. S. W. Reyner. An analysis of a good algorithm for the subtree problem. SIAM J. Comput., 6:730–732, 1977. 14. G. Rote. Computing the minimum Hausdorff distance between two-point sets on a line under translation. Information Processing Letters, 38(3):123–127, 1991. 15. A. Shokoufandeh, I. Marsic, and S. Dickinson. View-based object recognition using saliency maps. Image and Vision Computing, 17:445–460, 1999. 16. K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker. Shock graphs and shape matching. International Journal of Computer Vision, 30:1–24, 1999. 17. G. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, San Diego, 1990. 18. J. Triesch and C. von der Malsburg. Robust classification of hand postures against complex background. In Proc. Int. Conf. on Face and Gesture Recognition, pages 170–175, Killington, Vermont, Oct. 1996. 19. J. Tsotsos. An inhibitory beam for attentional selection. In L. Harris and M. Jenkin, editors, Spatial Vision in Humans and Robots. Cambridge University Press, 1993. 20. J. Wilkinson. The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, England, 1965. 21. L. Wiskott, J.-M. Fellous, N. Kr¨ uger, and C. von der Malsburg. Face Recognition by Elastic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):775–779, July 1997.
Combining Simple Discriminators for Object Discrimination Shyjan Mahamud, Martial Hebert, and John Lafferty Dept. of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
Abstract. We propose to combine simple discriminators for object discrimination under the maximum entropy framework or equivalently under the maximum likelihood framework for the exponential family. The duality between the maximum entropy framework and maximum likelihood framework allows us to relate two selection criteria for the discriminators that were proposed in the literature. We illustrate our approach by combining nearest prototype discriminators that are simple to implement and widely applicable as they can be constructed in any feature space with a distance function. For efficient run-time performance we adapt the work on “alternating trees” for multi-class discrimination tasks. We report results on a multi-class discrimination task in which significant gains in performance are seen by combining discriminators under our framework from a variety of easy to construct feature spaces.
1
Introduction
A object discriminator can in general be thought of as any function that induces a partition of the space of images X. For example, a very crude example of a discriminator would be a function that tests whether the average intensity or some other simple statistic of a fixed subwindow of the input image crosses a threshold - in this case the image space is partitioned into two. Examples of more complex discriminators would be decision trees where each decision node is a simple discriminator as in the example above and the set of leaf nodes of the tree corresponds to the partition of the image space. Yet another example is a nearest prototype discriminator, in which a set of prototypes in some feature space (like color, shape, etc.) induces a partition, where each subset of image space in the partition contains the images that are closest to one of the prototypes (in other words, the partition is the voronoi diagram induced by the set of prototypes, see figure § 1). An ideal discriminator will induce a partition in which each subset of image space in the partition contains images of objects from a single class. Such an ideal discriminator might be hard to construct if the partition required is complex. In practice, it might be easier to construct discriminators that induce simple partitions but which are far from ideal. We can then think of combining such
This work was supported by NSF Grant IIS-9907142 and DARPA HumanID ONR N00014-00-1-0915
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 776–790, 2002. c Springer-Verlag Berlin Heidelberg 2002
Combining Simple Discriminators for Object Discrimination
Image Space
00000000000 11111111111 00000000000 11111111111 1111111111111 0000000000000 00000000000 11111111111 0000000000000 1111111111111 00000000000 11111111111 0000000000000 1111111111111 000000 111111 00000000000 11111111111 0000000000000 1111111111111 000000 111111 00000000000 11111111111 0000000000000 1111111111111 000000 111111 X 00000000000 11111111111 0000000000000 1111111111111 000000 111111 00000000000 11111111111 0000000000000 1111111111111 0 1 000000 111111 00000000000 11111111111 0000000000000 1111111111111 0 1 000000 111111 00000000000 11111111111 0000000000000 1111111111111 0 1 000000 111111 00000000000 11111111111 0000000000000 1111111111111 0 1 000000 111111 00000000000 11111111111 00000000 11111111 0000000000000 1111111111111 X 0 1 000000 111111 00000000000 11111111111 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 00000000000 11111111111 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 X 00000000 11111111 0000000000000 1111111111111 0 1 000000 111111 00000000 11111111 0000000000000 1111111111111 0 1 00000000 11111111 0 1 00000000 11111111 0 1 00000000 11111111 0 1 0 1
00000000000 11111111111 11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 X 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000 11111111 00000000000 11111111111 00000 11111 00000000000 11111111111 00000000 11111111 00000000000 11111111111 X 00000 11111 00000000000 11111111111 00000000 11111111 00000000000 11111111111 00000 11111 00000000 11111111 00000000000 11111111111 00000 11111 00000000 11111111 00000000000 11111111111 00000 11111 00000000 11111111 00000000000 11111111111 00000 11111 00000000 11111111 00000000000 11111111111 00000 11111 00000000 11111111 00000000000 11111111111 00000000 11111111 00000000000 11111111111 00000000 11111111 00000000000 11111111111 00000000 11111111
777
Image Space
Fig. 1. The partition of image space induced by discriminators. Three classes of objects are shown within the image space (depicted as an ellipse). The discriminator on the left (a nearest 3-prototype discriminator for illustration, constructed in some feature space like color, edge, shape, etc., the prototypes are marked by ×) discriminates the three classes quite well but is complex. The discriminator on the right (with 2 prototypes) does not discriminate as well as the one on the left, but is simpler to implement. The partition boundaries in each case is given by the voronoi diagram induced by the prototypes which are at the center of each cell.
simple discriminators to create a more powerful discriminator. In this paper, we show how the maximum entropy (ME) framework [6,8] can be used to combine simple discriminators. Under the ME framework, we seek a combined discriminator that is constrained to be consistent (in a manner that will be made precise shortly) with each of the simple discriminators, but is otherwise least committed. The resulting scheme is very similar to boosting [2] which was initially derived from computational learning principles (see § 3 for a discussion of the similarities and differences). The ME framework can also be generalized to handle regularization to avoid over-fitting in a principled manner [1] compared with standard boosting. In related work in computer vision, the framework has been used in the context of selecting good features for texture synthesis [10]. The framework is also known to be dual to the more familiar maximum likelihood (ML) framework for exponential distributions, but the justification for our approach is more natural under the ME framework. A second issue we address is that of selecting which simple discriminators to combine. Given a large collection of N simple discriminators, we want to select a fixed number K N from this collection which when combined in the above framework gives the best discriminator possible. In practice, K might be constrained by the run time performance required. Two schemes have been proposed in the literature, one under the ME framework [10], and another under the ML framework [8]. Using the known duality between the two frameworks,
778
S. Mahamud, M. Hebert, and J. Lafferty
we show that these two seemingly different selection schemes are in fact the same (see § 4). For good run-time performance we would like to combine the simple discriminators in an efficient structure like a tree. For this purpose, we adapt the work on alternating trees [3] - a generalization of the more familiar decision trees for multi-class discrimination (see § 6). In § 7 we illustrate our scheme on a multi-class discrimination task where significant gains in performance are seen by combining simple discriminators based on different types of features.
2
Formulation
For simplicity of implementation and wide applicability, in our work we use the nearest prototype (NP) discriminators as the simple discriminators that we choose to combine. As described earlier, the partition in image space induced by such a discriminator corresponds to the voronoi diagram (see figure § 1) with prototypes at the center of each cell. An NP discriminator has wide applicability since it can be constructed in any feature space that has a distance function. For example, the distance function for intensities in an image patch is typically just Euclidean, the distance for histograms is the non-linear χ2 distance, the distance for edge maps is the hausdorff distance, etc. An NP discriminator is specified by the feature space that it is constructed in and the number and locations of the prototypes in that feature space. A particularly simple implementation that we use in our work, chooses the location of the prototypes from those derived from the set of training images rather than searching for the best prototypes in the whole feature space. This has the benefit of easy implementation even in feature spaces like histograms or edge maps where the distance function is non-linear and a search over the whole feature space may be infeasible. See § 5 for the details of our implementation of NP discriminators. Our task is to learn a multi-class discriminator for objects of interest, given a training set S = {(x1 , ci ), (x2 , c2 ), . . . , (xm , cm )}, where xi ∈ X are the training images and ci ∈ {1, . . . , l} are class labels. We assume that the objects are centered in the training images (see Fig.(4) for training examples from a discrimination task that we report in § 7). Let us assume that a discriminator h has been trained on S using some procedure. Recall that a discriminator partitions the image space X. In the usual formulation, at run-time on input x ∈ X a discriminator outputs the class labels cj of all images xj in the training set S that fall in the same subset of the image space partition induced by h that contains the input x. These class labels cj can then be post-processed and weighted by the relative number wj of training examples from each class that fall in the same partition as x to give a confidence value for each label. Alternatively a simpler implementation would report the class label(s) that is associated with the most number of training images that fall in the same partition as x. In our work, we use an alternate formulation where the original multi-class discrimination task is reduced to an equivalent two-class problem as follows. This
Combining Simple Discriminators for Object Discrimination
779
formulation simplifies the development of the framework as well as allows a finergrained selection of “good” discriminators compared with the original multiclass framework. Instead of thinking of a discriminator as outputting the class label(s) given an input image, we can think of a discriminator that determines if a given pair of input images belongs to the same class or different classes of objects. Formally, given a pair of images (xi , xj ), the discriminating function h : X × X → {−1, +1}, outputs the label h(xi , xj ) = +1 if the pair belongs to the same partition induced by h, otherwise it outputs h(xi , xj ) = −1. Under this formulation, the training set S gets transformed into a corresponding training set (which we will continue to denote as S) for the two-class problem : S ≡ {((xi , xj ), yij ) | i, j = 1, . . . , m, yij ∈ {−1, +1}}
(1)
where the new class labels yij is assigned +1 if the condition ci = cj holds for the original class labels, and −1 otherwise. After learning a discriminator h from this set, at run-time on input x, just as for the original multi-class formulation described above, the discriminator outputs the class labels cj corresponding to all the training examples xj for which h(x, xj ) = +1, i.e. xj falls in the same partition as x. Again, the class labels can be weighted by the relative number wj of training examples from each class to give a confidence value for each label, or the class label(s) associated with the most number of training examples xj can be reported. In the next section we consider the task of combining the outputs of a set of such simple discriminators under the maximum entropy framework.
3
Maximum Entropy Framework for Combining Discriminators
We can think of the tuple ((xi , xj ), yij ) ∈ S as an assertion which states that the pair of images (xi , xj ) belongs to the same class (i.e., yij = +1) or in different classes (yij = −1). For a discriminator h, the expression f (xi , xj , yij ) = yij h(xi , xj ) can be considered as a test which determines if the assertion is true (f = +1) or not (f = −1) for the discriminator h. We will call f the testing function corresponding to the discriminator h. Suppose we wish to combine a given set of simple discriminators hk , k = 1, . . . , K. As discussed above, for each of the discriminators hk , we can derive a testing function fk that determines whether an assertion is true or not for the discriminator hk . Similarly, we would like to derive a testing function F for the combined discriminator H. Intuitively, F should combine the outputs of each of the fk on an input assertion to determine the truth of that assertion. Rather than derive a binary testing function F (with outputs in {−1, +1} as is the case for each of the fk ), we will seek a more general probability distribution p over the possible output values {−1, +1} that gives the likelihood that an input assertion is true or not given the outputs of the testing functions fk on that assertion. The likelihood will give a measure of the confidence in a prediction (truth or falsity of an assertion) made by the combined discriminator. Formally
780
S. Mahamud, M. Hebert, and J. Lafferty
for our purposes, we would like to derive a conditional probability distribution p(yij | (xi , xj )) that measures the likelihood that yij = +1 or yij = −1 given the outputs of fk , k = 1, . . . , K on any assertion ((xi , xj ), yij ). What should the constraints be in selecting a probability measure p from the space of all probability measures ? The outputs of the testing functions fk on all the assertions in the training set S provide one source of constraints. Obviously since S is finite and the space of probability measures is infinite, we need additional a priori constraints to properly constrain the selection of p. The maximum entropy (ME) framework [6,8] is based on the principle that the only constraint that we can reasonably impose on the probability measure p a priori is that it be consistent with the outputs of the tests fk over all training assertions for each k, but is otherwise “least-committed”. The probability measure p is defined to be consistent with a test fk for a given k if a set of chosen statistics of fk over all possible assertions z ∈ X × X × {−1, +1} under the probability measure p matches the empirical statistics over the training set S. Intuitively, we want to select the probability measure that is at least in agreement with the training data. The simplest statistic is the mean. The empirical mean is defined by : fk ≡
1 |S|
fk (xi , xj , yij )
((xi ,xj ),yij )∈S
Note that in principle one can impose additional constraints corresponding to higher order moments. But in practice, typically only the constraint for the mean is imposed since estimating the higher-order moments reliably requires more and more training data as the order of the moment increases. Given these constraints, the ME framework searches for the “least committed” probability measure. Intuitively, the least committed measure under no constraints is the uniform probability measure. As we add constraints, we would like to keep the measure as close to uniform as possible while satisfying the constraints. More generally, we would like to be as close to a prior measure q0 that may not be uniform, and is task dependent but data independent. For our task, the distance between two conditional probability measures p and q can be measured by the following conditional Kullback-Leibler (KL) divergence [8] : D(p, q) =
1 |S|
(xi ,xj )∈S yij ∈{−1,+1}
p(yij | (xi , xj )) log
p(yij | (xi , xj )) q(yij | (xi , xj ))
which is non-negative and 0 iff p = q. Let M be the space of all conditional probability measures. Define the feasible set F ⊂ M as : F ≡ {p ∈ M | Ep [fk ] = fk for all k} where Ep [·] denotes the expected value operator under p. Then the ME framework entails the solution of the following problem : minimize D(p, q0 ) subject to
Combining Simple Discriminators for Object Discrimination
781
p ∈ F and a fixed prior measure q0 . In our task, we assume a uniform prior for q0 . In this case it can be shown [7] that by setting up an appropriate Lagrangian, the optimal solution denoted by pME is the logistic function : pME (yij | (xi , xj )) =
1 + exp (−
1 λ k k fk (xi , xj , yij ))
(2)
where λk is the set of Lagrange multipliers, one for each of the constraints Ep [fk ] = fk . These multipliers can be determined by minimizing the following objective function : J(λ, f ) = log(1 + exp(−λ, f (z) )) (3) z∈S
where ·, · denotes vector dot product. The dual objective looks very similar to the exponential cost function that is minimized in boosting [2]. In fact it has been recently shown [7] how boosting can be derived from the ME framework by using slightly different constraints than those used here. The boosting framework was originally inspired by computational learning principles where the exponential cost function is a continuous upper bound to the true discrete error function for the boosting classifier. Here we have derived the objective to be minimized from first principles under the ME framework. Furthermore, the ME framework allows us to generalize the scheme in a principled manner to avoid over-fitting [1] if needed.
4
Selecting Good Features
First we note a duality between the ME framework and and the maximum likelihood (ML) framework over exponential measures. Consider the family of conditional exponential probability measures : Q ≡ p ∈ M | p(yij | (xi , xj )) ∝ q0 (yij | (xi , xj ))eλ,f (xi ,xj ,yij ) , λ ∈ IRK Let p˜((xi , xj ), yij ) be the empirical distribution over the assertions determined by the training set S (since by construction each assertion in S is unique, p˜ will simply take the value 1/|S| over each assertion in S and 0 elsewhere). The log-likelihood L of a probability measure p with respect to p˜ is defined as : L(˜ p, p) ≡ −D(˜ p, p) It has been shown [8] that the probability measure pML that maximizes the likelihood over the exponential family Q is the same as pME . Thus the two optimization problems are dual to each other. Now we consider the problem of selecting good discriminators. Assume that we are given a large but finite collection C of discriminators hk , k = 1, . . . , N . This is the case in our work where we use nearest prototype discriminators for which the prototypes are chosen from the finite set of prototypes in the training
782
S. Mahamud, M. Hebert, and J. Lafferty
set (see § 5). We wish to choose K N discriminators from this collection that gives the best combined discriminator. In practice, K will be determined for example from run-time performance considerations. We can formulate two criteria for choosing the best K discriminators, one for each of the two frameworks. Under the more familiar ML framework [8], for a fixed family of probability measures Q, the criterion is to choose the K discriminators that maximize the likelihood function L(˜ p, p), p ∈ Q. Formally the ML criterion can be stated as follows : Criterion ML. For a fixed choice of K discriminators {h1 , . . . , hK } ⊂ C, let p, p). p∗ be the probability measure that maximizes the likelihood p∗ = argmax L(˜ p∈Q
Choose the K discriminators for which L(˜ p, p∗ ) is maximum over all choices of K discriminators from C. Under the ME framework on the other hand, the authors in [10] propose the use of the mini-max entropy criterion. The context of their work was selection of good features for texture synthesis. In their original formulation, the criterion assumes a uniform prior model q0 and chooses the K features such that the resulting maximum entropy probability measure pME has minimum entropy over all choices of K discriminators. This criterion might seem less intuitive at first than the ML criteria. It is based on the notion that the entropy of the probability measure determined by a given choice of K discriminators indicates how “informative” the discriminators are, the discriminators being more informative the lower the entropy. Thus the mini-max entropy criterion chooses the K most informative discriminators. Since minimizing (maximizing) the entropy of a distribution p is the same as maximizing (minimizing) the KL divergence D(p, q0 ) where q0 is set to the uniform distribution, the original mini-max entropy criterion can be generalized for arbitrary priors q0 and formally stated as follows : Criterion ME. For a fixed choice of K discriminators {h1 , . . . , hK } ⊂ C, let p∗ be the maximum entropy probability measure with constraints determined by the corresponding testing functions f1 , . . . , fK , i.e. p∗ = argmin D(p, q0 ). Choose p∈F
the K discriminators for which D(p∗ , q0 ) is maximum over all choices of K discriminators from C. We show in the following theorem that due to the duality between the ME and ML framework, these two seemingly different criteria are in fact the same when the ML criterion is applied to the exponential family Q : Theorem 1. A set of K features optimize the ME criterion iff they also optimize the ML criteria for the exponential family. Proof. We first state an analogue of the Pythagorean theorem for the KL divergence [8] : ¯ D(p, q) = D(p, p∗ ) + D(p∗ , q), forall p ∈ F, q ∈ Q ¯ is the closure of Q and by the duality theorem [8], where Q pML = argminD(˜ p, p) = p∗ = argminD(p, q0 ) = pME ¯ p∈Q
p∈F
Combining Simple Discriminators for Object Discrimination
783
We set p = p˜, the empirical distribution from the training set S, and q = q0 a prior measure, both of which are fixed for a given learning task and thus D(˜ p, q0 ) is constant. Also since the log-likelihood is given by L(˜ p, p) = −D(˜ p, p), we have : L(˜ p, p∗ ) = D(p∗ , q0 ) + const Thus choosing the K discriminators that maximize the likelihood L(˜ p, p∗ ) also ∗ maximize the KL distance D(p , q0 ) and thus the K discriminators that optimize the ML criterion also optimize the ME criterion and vice-versa. In practice, instead of searching for the best K discriminants all at once, we select each discriminant in a greedy fashion one at a time. At iteration k, we have k−1 < K discriminants that have been chosen in the previous iterations. We then choose to add the discriminant that along with the previous k − 1 discriminants gives the biggest gain in the ML or ME criteria. Concretely, this means we choose the discriminant that along with the previous discriminants minimizes the objective function J in (3). Optimization of J is a convex optimization problem in λ1 , . . . , λk that can be done using standard numerical techniques [9].
5
Nearest Prototype Discriminators
In this section, we describe in more detail the particular type of discriminators that we use in our work. As mentioned in the introduction, nearest prototype discriminators are easy to implement and widely applicable. Such a discriminator is specified by the feature space the prototypes come from (color, histograms, shape, etc.), as well as the number and locations of the prototypes. For example, in the experiments that we report later, we use edge maps of images as one of the features. In this case, a prototype in the discriminator will be an edge map. The image space will be partitioned by the set of prototypes in the discriminator where each partition corresponds to the subset of the image space that is closest to one of the prototypes. The distance function that determines this partition is the hausdorff distance between edge maps. If we restrict the choice of the locations of the prototypes to be those in the training images, we get a finite collection of discriminators C to choose from : for each feature space, given m training images, there are a total of O(rm ) discriminators with r prototypes. At each iteration of the greedy search, we can in principle exhaustively search C for the best discriminator (that which minimizes J the most along with the other discriminators that were chosen in the previous iterations). In practice, this does not scale well when m is large. Instead, we use a simple sampling technique that trades off the quality of the discriminator found for a speed-up in the search process. Rather than searching for the best discriminator at each iteration, we will search for a discriminator in the top s percentile (for example s = 0.1%), where the discriminators are ranked according to the how much they minimize J. Under this search strategy, we can show that with high probability we can find a discriminator in the top s percentile by uniformly
784
S. Mahamud, M. Hebert, and J. Lafferty
sampling the finite set of discriminators C a fixed number of times n that is independent of the total number m of training examples and the number of prototypes r required. More precisely, let 0 < δ < 1, then it can be shown that we need to sample : n>
log(1/δ) s
discriminators such that at least one of them is in the top s percentile with probability at least 1 − δ.
6
Efficient Organization of Discriminators
For run-time performance, we would like to combine the discriminators that we choose in an efficient structure. For this purpose, we adapt the work on “alternating trees”, which is a generalization of decision trees (see Fig. (2)). The salient feature that distinguishes alternating trees from regular decision trees is that a node in an alternating tree can have multiple decision nodes as children. The term “alternating” refers to alternating levels of two types of nodes : prediction nodes, which in our task predicts whether two images belong to the same class or not, and discriminator nodes which correspond to the discriminators that we choose. A predictor node is associated with a subset U of the image space X that pass through the the sequence of discriminators from the root to the predictor node. In other words, the subset U is determined by the conjunction of discriminators from the root to the predictor node. We can think of the rest of the image space X − U as the subset that the predictor node “abstains” from. The root node of the alternating tree is a predictor node associated with the entire image space X. A predictor node can have a multiple number of discriminators as children. Each discriminator node partitions the image space subset U associated with its parent predictor node. In turn, a discriminator node has predictor nodes as children, each of which corresponds to a subset of the image space in the partition induced by the parent discriminator node. The possibility of multiply partitioning the image space gives the alternating tree more flexibility compared with standard decision trees. In fact, by constraining each predictor node to have at most one discriminator node as a child, we get a standard decision tree after combining each predictor node with its sole discriminator child (if any). In order that an alternating tree of discriminators can be incorporated under the maximum entropy framework, we need to derive the testing functions fk for each discriminant hk in the tree. Recall from § 2 that the testing function f is derived from the partition induced by a discriminator h : if h(xi , xj ) denotes whether two images are in the same (h(xi , xj ) = +1) or in different (h(xi , xj ) = −1) partitions induced by the discriminator, then f (xi , xj , yij ) = yij h(xi , xj ) tests whether the assertion ((xi , xj ), yij ) is true for discriminator h. The partition associated with a discriminator hk in an alternating tree is the partition induced on the subset U of image space associated with its parent predictor node, as well as the compliment subset X − U that the parent abstains from. Concretely, if
Combining Simple Discriminators for Object Discrimination
785
Layer X
Prediction Discrimination
X
X
X
X
Prediction
Discrimination
X
X
Prediction
Fig. 2. Alternating Trees. The tree alternates between prediction nodes (ellipses) and discriminator nodes (boxes). Each prediction node is associated with a subset U of the image space (marked by ×), which is partitioned by any discriminator child of the prediction node. A prediction node outputs +1 when a pair of images are both in either the subspace U associated with the node or in the compliment X − U , otherwise it outputs −1. Each prediction node can have multiple discriminator nodes as children, each of which partitions the subset U of image space associated with its parent predictor node.
the discriminator hk partitions U into U1 and U2 for example, then the partition {X − U, U1 , U2 } is associated with hk , from which the corresponding testing function fk is derived. 6.1
Summary
We now outline the iterative scheme for selecting nearest prototype discriminators in a greedy manner while composing them in an alternating tree. Assume that we can extract a variety of features from images (intensity, edge maps, histograms, etc.). Thus each image belongs to a set of feature spaces g ∈ G. Denote by C(g, S) the finite collection of nearest prototype discriminators constructed in feature space g, where the locations of the prototypes are chosen from those in the training set S. Let h∗ (g, S) ∈ C(g, S) be the nearest prototype discriminant optimizing the objective function J by the sampling scheme described in § 5. The alternating tree is initialized to a predictor node that corresponds to the whole image space X. At the start of iteration k, let T be the alternating tree constructed so far in the previous iterations. Let Si ⊂ S denote the subset of training examples that reach the predictor node Pi in T . At iteration k, we choose the discriminant h∗ that minimizes the objective function J from among the set of all h∗ (g, Si ) over all choices of feature space g ∈ G and training sets Si
786
S. Mahamud, M. Hebert, and J. Lafferty
associated with each predictor node Pi in the tree. Note that since a predictor node Pi can have multiple children, each predictor node will participate in all iterations, unlike the case for a standard decision tree where only the current leaf nodes are considered. At the end of a fixed number of iterations, we output the final alternating tree with discriminators hk along with the optimal Lagrange multipliers λk . The corresponding testing functions fk along with λk determines the maximum entropy conditional probability distribution pME (yij | (xi , xj )) given by (2). pME can be used to determine if an input image xi belongs to the same class of objects as that of a training example xj as follows. Define the log-ratio for a pair of images (xi , xj ) with respect to the probability distribution pME as : ρ(xi , xj ) ≡
1 pME (yij = +1 | (xi , xj )) λk hk (xi , xj ) log = 2 pME (yij = −1 | (xi , xj ))
(4)
k
The log-ratio indicates whether the input image xi is in the same class as the training image xj if ρ(xi , xj ) > 0, otherwise they belong to different classes. The magnitude of the log-ratio can be thought of as the margin or confidence in making the prediction. Using this observation, on input xi we output the class label cj of the training example xj that is predicted to be in the same class as xi (ρ(xi , xj ) > 0) and whose margin is largest (|ρ(xi , xj )| is maximum over all xj ). Figure 3 gives the pseudo-code for the whole scheme. Initialize : 1. Initialize the alternating tree T with a root predictor node. 2. Let G be the collection of feature spaces in which discriminators will be constructed. 3. Let Si ⊂ S denote the training set that reaches a predictor node Pi ∈ T from the root. do for K iterations 1. For each g ∈ G and each Si corresponding to predictor node Pi ∈ T , find the optimal discriminant h∗ (g, Si ) using the sampling scheme in § 5. 2. Add the best discriminator h∗ that minimizes J from among all h∗ (g, Si ) to the alternating tree T . Output : The discriminating rule 4 with the Lagrange multipliers λk that optimizes J. Fig. 3. Pseudo-code for the iterative greedy scheme.
Combining Simple Discriminators for Object Discrimination
787
Fig. 4. Example training (top row) and testing (middle row) views of 5 objects. For each object 12 training views were taken in a circle around the object, and 24 testing views were taken with varying clutter and occlusion. The bottom row shows more testing examples for one the objects.
7
Results
We illustrate our approach on a multi-class discrimination task, where we show the benefits of combining simple discriminants from different feature spaces (edge maps and intensity histograms). In our task, we have five objects that need to be discriminated from each other (see Fig. (4)). For each object, we collected 12 grey valued training images spanning the viewpoints around the object placed at a desk. The training images have the objects at the center and are taken from a fixed distance to the camera. For our experiments, nearest prototype discriminators constructed from the following two types of feature spaces are combined to construct the alternating tree during training : Edge Maps. Edge contours are quite insensitive to some illumination and viewpoint changes. Two edge maps can be compared using the robust Hausdorff measure [5]. The Hausdorff distance h(A, B) between two edge maps A and B is given by : h(A, B) ≡ max (max min a − b, max min a − b) a∈A b∈B
b∈B a∈A
The distance is made robust to occlusion and clutter by replacing the maximum distance over all points in one set to the closest neighbour in the
788
S. Mahamud, M. Hebert, and J. Lafferty
other set in the above expression by the distance at some percentile. In our experiments we use a percentile of 75%. Second-Order Histograms. A second order or correlation histogram [4] measures the distribution of grey value pixel intensities while taking into account the relative positions of the pixels. Formally the histogram Cτ (g, g ) measures the number of pixel pairs in a subwindow centered around the object (80×80 in our experiments) that are a certain distance τ apart and which have intensities g, g respectively. The intensities in our experiments are quantized to 15 levels. A second order histogram is more informative than a simple histogram of pixel intensities due to the inclusion of the relative positions of the pixels. In our experiments τ is set to 5 or 10 pixels either horizontally or vertically. Two histograms can be measured by the χ2 distance : χ2 (C1 , C2 ) =
(C1 (b) − C2 (b))2 C1 (b) + C2 (b)
b∈bins
where b runs over the set of bins in a histogram. In general, we can consider other types of feature spaces like color, shape from shading etc. For testing, we took 24 images of each object under varying levels of occlusion for a total of testing 120 images. (see Fig. (4) for examples). In general, a complete recognition system would typically be divided into two stages : an object detection stage, which might be based on grouping principles or other pattern recognition techniques, will first identify regions of the image that might possibly contain objects of interest, and a second stage where these regions are then passed to a discriminator for the identification of a specific object. This is the approach for example in face recognition, where a general face detector can be used to first identify regions in the input image that contains a face which can then be recognized by a separate module. An object detection stage is beyond the scope of this paper, and so in our work we will assume that the objects of interest are roughly centered in the test images. Note that the testing images still have significant clutter and occlusion. Fig. (5) shows the training and testing error versus the number of discriminators in the alternating tree (or equivalently the iteration number in the greedy selection scheme). In practice, the number of discriminators in the tree will be dictated by the trade-off between the run-time performance and the testing error desired. For both the testing and training, three curves are shown based on using discriminators from both the feature spaces described above (“hausdorff+histogram” in the plot), or using discriminators from just one of the feature spaces (“hausdorff” and “histogram” in the plot). As expected, using discriminators from multiple feature spaces can significantly enhance the performance of the algorithm. In our experiment, at the end of 50 iterations the testing error was reduced from 50.42% when using only edge maps or from 30.25% when using only intensity histograms to 13.45% when using both. Most of the error is due to one of the objects (phone) which does poorly with discriminators from either of
Combining Simple Discriminators for Object Discrimination Training Error
100
Error (percentage)
80
hausdorff histogram hausdorff+histogram
90 80
70
70
60
60
50
50
40
40
30
30
20
20
10 0
Testing Error
100
hausdorff histogram hausdorff+histogram
90
789
10 5
10
15 20 25 30 35 40 Number of discriminators in the tree
(a)
45
50
0
5
10
15 20 25 30 35 40 Number of discriminators in the tree
45
50
(b)
Fig. 5. Training (a) and testing (b) error vs. number of discriminators in the alternating tree. The testing error when using discriminators from both edge maps and histograms (“hausdorff+histogram”) is significantly better (13.45% error for 50 nodes in the tree) than when using discriminators from either feature space alone (“hausdorff” at 50.42% error and “histogram” at 30.25% error).
the feature spaces. It is quite possible that other feature spaces that we haven’t considered can bring the error down even further. We note that in our approach it is conceptually no more difficult to combine discriminants from multiple feature spaces than it is to combine discriminants from just one feature space. This is one of the advantages inherent in our approach compared with other approaches like neural networks that have difficulty in working with disparate feature spaces.
8
Conclusion
We have shown how the maximum entropy framework can be used to combine simple discriminants from a variety of feature spaces in a principled manner. The nearest prototype discriminator is simple to implement and can be constructed in any feature space with a corresponding distance function. We have shown that two criteria for selecting discriminants - one in the maximum entropy framework and the other in the dual maximum likelihood framework - are in fact equivalent. For run-time performance, we have adapted the work on alternating trees for combining simple discriminators into an efficient structure. Experiments on a multi-class discrimination task validated the approach and showed significant gains by combining simple discriminants constructed in different feature spaces.
References 1. Chen, S., Rosenfeld, R.: A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing 8(1) (2000)
790
S. Mahamud, M. Hebert, and J. Lafferty
2. Freund, Y., Shapire, R.E.: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. of Computer and System Sciences 55(1) (1997) 119–139 3. Freund, Y., Mason, L.: The Alternating Decision Tree Algorithm. ICML99 (1999) 124–133 4. Haralick, R.M.: Statistical and Structural Approaches to Texture. Proc. 4th Intl. Joint Conf. Pattern Recognition (1979) 45–60 5. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing Images Using the Hausdorff Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9) (1993) 850–863 6. Jaynes, E.T.: Information theory and statistical mechanics. Physical Review 106 (1957) 620–630 7. Lebanon, G., Lafferty, J.: Boosting and Maximum Likelihood for Exponential Models. Advances in Neural Information Processing Systems 14 (2001) 8. Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of Random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4) (1997) 9. Schapire, R. E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3) (1999) 297–336 10. Zhu, S. C., Wu, Y., Mumford, D.: Filters, Random Fields and Maximum Entropy (FRAME). IJCV 27(2) (1998) 1–20
Probabilistic Search for Object Segmentation and Recognition Ulrich Hillenbrand and Gerd Hirzinger Institute of Robotics and Mechatronics, German Aerospace Center/DLR P.O. Box 1116, 82230 Wessling, Germany
[email protected]
Abstract. The problem of searching for a model-based scene interpretation is analyzed within a probabilistic framework. Object models are formulated as generative models for range data of the scene. A new statistical criterion, the truncated object probability, is introduced to infer an optimal sequence of object hypotheses to be evaluated for their match to the data. The truncated probability is partly determined by prior knowledge of the objects and partly learned from data. Some experiments on sequence quality and object segmentation and recognition from stereo data are presented. The article recovers classic concepts from object recognition (grouping, geometric hashing, alignment) from the probabilistic perspective and adds insight into the optimal ordering of object hypotheses for evaluation. Moreover, it introduces point-relation densities, a key component of the truncated probability, as statistical models of local surface shape.
1
Introduction
Model-based object recognition or, more generally, scene interpretation can be conceptualized as a two-part process: one that generates a sequence of hypotheses on object identities and poses, the other that evaluates them based on the object models. Viewed as an optimization problem, the former is concerned with the search sequence, the latter with the objective function. Usually the evaluation of the objective function is computationally expensive. A reasonable search algorithm will thus arrive at an acceptable hypothesis within a small number of such evaluations. In this article, we will analyze the search for a scene interpretation from a probabilistic perspective. The object models will be formulated as generative models for range data. For visual analysis of natural scenes, that is, scenes that are cluttered with multiple, non-completely visible objects in an uncontrolled context, it is a highly non-trivial task to optimize the match of a generative model to the data. Local optimization techniques will usually get stuck in meaningless, local optima, while techniques akin to exhaustive search are precluded by time constraints. The critical aspect of many object recognition problems hence concerns the generation of a clever search sequence. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 791–806, 2002. c Springer-Verlag Berlin Heidelberg 2002
792
U. Hillenbrand and G. Hirzinger
In the probabilistic framework explored here, the problem of optimizing the match of generative object models to data is alleviated by the introduction of another statistical match criterion that is more easily optimized, albeit less reliable in its estimates of object parameters. The new criterion is used to define a sequence of hypotheses for object parameters of decreasing probability, starting with the most probable, while the generative model remains the measure for their evaluation. For an efficient generation of the search sequence, it is desirable to make the new criterion as simple as possible. On the other hand, for obtaining a short search sequence for acceptable object parameters, it is necessary to make the criterion as informative as is feasible. As a key quantity for sequence optimization, an estimate of features’ posterior probability enters the new criterion. The method is an alternative to other, often more heuristic strategies aimed at producing a short search sequence, such as checking for model features in the order of the features’ prior probability [7,12], in a coarse-to-fine hierarchy [6], or as predicted from the current scene interpretation [4,8]. Classic hypothesize-and-test paradigms have been demonstrated in RANSAC-like [5,19] and alignment techniques [9,10]. The method proposed here is more similar to the latter in that testing of hypotheses is done with respect to the full data rather than on a sparse feature representation. It significantly differs from both, however, in that it recommends a certain order of hypotheses to be tested, which is indeed the main point of the present study. Other classic concepts that are recovered naturally from the probabilistic perspective are feature-based indexing and geometric hashing [13,20], and feature grouping and perceptual organization [14,15,11]. To keep the notational load in this article to a minimum and to support ease of reading, we will denote all probability densities by the letter p and indicate each type of density by its arguments. Moreover, we will not introduce symbols for random variables but only for the values they take. It is understood that probability densities become probabilities whenever these values are discrete.
2
Generative Object Models for Range Data
Consider the task of recognizing and localizing a number of objects in a scene. More precisely, we want to estimate object parameters (c, p) from data, where c ∈ IN is the object’s discrete class label and p ∈ IR6 are its pose parameters (rotation and translation). Suppose we use a sensor like a set of stereo cameras or a laser scanner to obtain range-data points d ∈ IR3 of one view of the scene, that is, 2 + 1/2 dimensional (2 + 1/2D) range data. A reasonable generative model of the range-data points within the volume V(c, p) of the object c with pose p is given by the conditional probability density N (c) f [φ(d; c, p)] for d ∈ S(c, p), p(d|c, p) = (1) N (c) b for d ∈ V(c, p)\S(c, p). The condition is on the object parameters (c, p). The set S(c, p) ⊂ V(c, p) is the region close to visible surfaces of the object c with pose p, b > 0 is a constant that
Probabilistic Search for Object Segmentation and Recognition
793
describes the background of spurious data points, and N (c) is a normalization constant. The surface density of points is described by a function f of the angle φ(d; c, p) between the direction of gaze of the sensor and the object’s inward surface normal at the (close) surface point d ∈ S(c, p). The function f takes its maximal value, that we may set to 1, at φ = 0, i.e., on surfaces orthogonal to the sensor’s gazing direction. Four comments are in order. First, the normalization constant N (c) generally depends upon both c and p. However, we here neglect its dependence on the object pose p. The consequence is that objects will be harder to detect, if they expose less surface area to the sensor. Second, the extension of the set S(c, p) has to be adapted to the level of noise in the range data. Third, points d ∈ S(c, p) that are close to but not on the surface have assigned surface normals of close surface points. Fourth, given the object parameters (c, p), the data points d ∈ V(c, p) can be assumed statistically independent with reasonable accuracy. Our task is to estimate the parameters c and p by optimizing the match between the generative model (1) to the range data D. Because of the conditional independence of the data points in V(c, p), this means to maximize the logarithmic likelihood L(c, p; D) := L(c, p; d) := ln p(d|c, p) (2) d∈D∩V(c,p)
d∈D∩V(c,p)
with respect to (c, p) ∈ Ω ⊂ IN × IR6 . A computationally efficient version is obtained by assuming that φ 1, which is true for patches of surface that are approximately orthogonal to the direction of gaze. Fortunately, such parts of the surface contribute most data points; cf. (1). Observing that (3) ln f (φ) = −a φ2 + O φ4 = 2a (cos φ − 1) + O φ4 , with some constant a > 0, we may neglect terms of order O(φ4 ) to obtain 2a [n(d; c, p) · g − 1] + ln N (c) for d ∈ S(c, p), L(c, p; d) ≈ ln b + ln N (c) for d ∈ V(c, p)\S(c, p),
(4)
where n(d; c, p) is the object’s inward surface-normal vector at the (close) surface point d ∈ S(c, p) and g is the sensor’s direction of gaze; both are unit vectors. The constant ln b < 0 is effectively a penalty term for data points that come to lie in the object’s interior under the hypothesis (c, p). Again, discarding the terms O(φ4 ) in (3) is acceptable as data points producing a large error will be rare; cf. (1).
3
Probabilistic Search for Model Matches
In this section, we derive the probabilistic search algorithm. We first introduce another statistical criterion function of object parameters (c, p). The new criterion, to be called truncated probability (TP), defines a sequence of (c, p)hypotheses. The hypotheses are evaluated by the likelihood (2), starting with
794
U. Hillenbrand and G. Hirzinger
the most probable (c, p)-value and proceeding to gradually less probable values, as estimated by the TP. The search terminates as soon as L(c, p; D) is large enough. We will obtain the TP from a series of steps that simplify from the ideal criterion, the posterior probability of object parameters (c, p). Although this derivation cannot be regarded as a thoroughly controlled approximation, it will make explicit the underlying assumptions and give some feeling for how far we have to depart from the ideal to arrive at a feasible criterion. The TP makes use of the probability of the presence of feature values consistent with object parameters (c, p). It is thus an essential property of the approach to treat features as random variables. 3.1
The Search Sequence
Let us define point features as surface points centered on certain local, isolated surface shapes. Examples of such point features are corners, saddle points, points of locally-extreme surface curvature etc. They are characterized by a pair (s, f ), where s ∈ IN is the feature’s shape-class label and f ∈ IR3 is its location. Consider now a random variable that takes values (s, f ) of point features that are related to the objects sought in the data. Its values are statistically dependent upon the range data D. Let us restrict the possible feature values (s, f ) to s ∈ {1, 2, . . . , m} and f ∈ D, such that only data points can be feature locations. This is equivalent to setting the feature-value probability to 0 for values (s, f ) with f ∈ / D. The restriction is not really correct as we will loose true feature locations between data points. However, it will greatly facilitate the search by limiting features to discrete values that lie close to the true object surfaces and, hence, include highly probable candidates. Let us introduce the concept of groups of point features. A feature group is a set Gg = {(s1 , f1 ), (s2 , f2 ), . . . , (sg , fg )} of feature values with fi = fj for i = j. A group Gg hence contains simultaneously possible values of g features. The best knowledge we could have about true values of the object parameters (c, p) is encapsulated in their posterior-probability density given the data D, p(c, p|D). However, we usually do not know this density, and if we knew it, its maximization would pose a problem similar to our initial one of maximizing the generative model (2). We can nonetheless expand it using feature groups, p(c, p|D) = p(c, p|D, Gg ) p(Gg |D) , p(Gg |D) = 1 . (5) Gg
Gg
The summations are over all possible groups Gg of fixed size g. Their enumeration is straightforward but tedious to explicate, so we omit this here. Note that the expansion (5) is only valid, if we can be sure to find g true feature locations among the data points. Let us now simplify the density (5) to the density pg (c, p; D) := p(c, p|Gg ) p(Gg |D) . (6) Gg
Probabilistic Search for Object Segmentation and Recognition
795
Unlike the posterior density (5), pg depends upon the feature-group size g. Formally, pg is a Bayesian belief network with a g-feature probability distribution at the intermediate node. Maximization of pg with respect to (c, p) is still too difficult a task, as all possible feature-group values contribute to all values of (c, p)-density. A radically simplifying step thus takes us naturally to the density qg (c, p; D) := max p(c, p|Gg ) p(Gg |D) , Gg
(7)
where for each (c, p) only the largest term in the sum of (6) contributes. Note that qg (c, p; D) ≤ pg (c, p; D) for all (c, p), such that qg is not normalized on the set Ω of object parameters (c, p). This density will nonetheless be useful for our purpose of guiding a search through Ω, as there only relative densities matter. As pointed out above, for all this to make sense, we need g true feature locations among the data points. To be safe, one could be tempted to set g = 1. However, the simplifying step from density pg to density qg suggests that the latter will be more informative as to the true value of (c, p), if the sum in (6) is dominated for high-density values of (c, p) by only few terms. Now, less than three point features do not generally define a finite set of consistent object parameters (c, p). The density p(c, p|Gg ) will hence not exhibit a pronounced maximum for g < 3. High-density points of pg arise then from accumulation over many g-feature values, i.e., terms in (6), and are necessarily lost in qg . Groups of size g ≥ 3, on the other hand, do define finite sets of consistent (c, p)-values, and p(c, p|Gg ) has infinite density there. Altogether, feature groups of size g = 3 seem to be a good choice for our search procedure; see, however, the discussion in Sect. 5. Let us now introduce the logarithmic truncated probability (TP), dp q3 (c, p ; D) , (8) TP(c, p; D) := lim ln →0
S (p)
where S (p) is a sphere in pose space centered on p with radius > 0. The integral and limit are needed to pick up infinities in the density q3 (c, p; D); see below. The TP is truncated in a double sense: finite contributions from q3 are truncated and q3 itself is obtained from truncating the sum in (6). According to the discussion above, it is expected that (c, p)-values of high likelihood (2) will mostly yield a high TP (8). Our search thus proceeds by evaluating, in that order, L(c1 , p1 ; D), L(c2 , p2 ; D), . . .
with
TP(c1 , p1 ; D) ≥ TP(c2 , p2 ; D) ≥ . . . ,
(c1 , p1 ) = arg max TP(c, p; D) . (9) (c,p)∈Ω
The search stops, as soon as L(ck , pk ; D) > Θ or all (c, p)-candidates have been evaluated. In the former case, the object identity ck and object pose pk are returned as the estimates. In the latter case, it is inferred that none of the objects sought is present in the scene. Alternatively, if we know a priori that one of the objects must be present, we may pick the object parameters that have
796
U. Hillenbrand and G. Hirzinger
scored highest under L in the whole sequence. The algorithm will be formulated in more detail in Sect. 3.3. In the density (7) that guides the search, one of the factors is the density of the object parameters conditioned on the values G3 of a triple of point features. This is explicitly h(G ) 3 3 ) δc,Ci (G3 ) δ[p − Pi (G3 )] i=1 γi (G
h(G ) (10) p(c, p|G3 ) = + 1 − i=1 3 γi (G3 ) ρ1 (c, p; G3 ) for G3 ∈ C, for G3 ∈ / C. ρ2 (c, p; G3 ) Here δc,c is the Kronecker delta (elements of the unit matrix) and δ(p − p ) is the Dirac-delta distribution on pose space; C is the set of feature-triple values consistent with any object hypotheses; h(G3 ) is the number of object hypotheses consistent with the feature triple G3 ; (Ci (G3 ), Pi (G3 )) are the consistent (c, p)hypotheses; γi (G3 ) ∈ (0, 1) are probability-weighting factors for the hypotheses. Generally we have h(G3 )
γi (G3 ) < 1 ,
(11)
i=1
which leaves a non-vanishing probability 1 − i γi > 0 that three consistent features do not all belong to one of the objects sought. Accordingly, the density p(c, p|G3 ) on Ω contains some finite, spread contribution ∝ ρ1 (c, p; G3 ), a “background” density, in the consistent-features case. In the inconsistent-features case, the density is similarly spread on Ω and given by some finite term ρ2 (c, p; G3 ). Clearly, only the values (Ci , Pi ) of object parameters where the density (10) is infinite will be visited during the search (9); cf. (8). The functions γi (G3 ), Ci (G3 ), Pi (G3 ) are computed by generalized geometric hashing. Let G3 = {(s1 , f1 ), (s2 , f2 ), (s3 , f3 )}. As key into the hash table we use the pose invariants (s1 , s2 , s3 , ||f1 − f2 ||, ||f2 − f3 ||, ||f3 − f1 ||) ,
(12)
appropriately shifted, scaled, and quantized. Here || · || denotes the Euclidean vector norm. The poses Pi (G3 ), i = 1, 2, . . . , h(G3 ), are computed from matching the triple G3 to h(G3 ) model triples from the hash table. Small model distortions that result from an error-tolerant match are removed by orthogonalization of the match-related pose transforms. The weight γi (G3 ) and the object class Ci (G3 ) are drawn along with each matched model triple. The weights γi (G3 ) can be adapted during a training period and during operation of the system in an unsupervised manner. One simply needs to count the number of instances where a feature triple G3 draws a correct set of object parameters (Ci , Pi ), i.e., one with L(Ci , Pi ; D) > Θ. The adapted γi (G3 ) represent empirical grouping laws for the point features. Since it will often be the case that γi (G3 ) ∝ 1/h(G3 ) for all i = 1, 2, . . . , h(G3 ), it turns out that hypotheses drawn by feature triples less common among the sought objects tend to be visited before the ones drawn by more common triples during the search (9).
Probabilistic Search for Object Segmentation and Recognition
797
In order to fully specify the TP (8) and formulate our search algorithm, it remains to establish a model for the feature-value probabilities p(G3 |D) given the data D; cf. (7). This is the subject of the next section. 3.2
Inferring Local Surface Shape from Point-Relation Densities
In principle, a generative model of range data on local surface shapes s can be built analogous to (1). Thus, we may construct a conditional probability density p(D|s, f , r), where r ∈ IR3 are the rotational parameters of the local surface patch represented by the point feature (s, f ). During recognition, however, it would then be necessary to optimize the parameters r for each s ∈ {1, 2, . . . , m} and f ∈ D, in order to obtain the likelihood of each feature value (s, f ). Besides the huge overhead of doing so without being interested in an estimate for r, this procedure would confront us with a problem similar to our original one of optimizing p(D|c, p). On the other hand, the density p(D|s, f ) is an infeasible model because of high-order correlations between the data points D. Neglecting these correlations would throw out all information on surface shape s. The strategy we pursue here is to capture some of the informative correlations between data points D in a family of new representations T (c) = {∆1 , ∆2 , . . . , ∆N }, c ∈ IR3 . Each T (c) represents the geometry of the data D relative to the point c, and each ∆i represents geometric relations between multiple data points. A reasonable statistical model is then obtained by neglecting correlations between the ∆i . Learning Point-Relation Densities. The point relations we are going to exploit here are between four data points, that is, tetrahedron geometries, where three of the points are selected from a spherical neighborhood of the fourth. For four points x1 , x2 , x3 , c ∈ IR3 we define the map x1 r/R x2 →∆(c; x1 , x2 , x3 ) := √ r2 − 34a ∈ [0, 1] × [−1, 1] × [0, 1] , d 3 √ 2 2 x3 4a/3 3(r − d ) (13) where r is the mean distance of the center c to x1 , x2 , x3 , R is the radius of the spherical neighborhood of c, d is the (signed) distance of c to the plane defined by x1 , x2 , x3 , and a is the area of the triangle they define1 . Explicitly, 3
r=
1 ||xi − c|| , 3 i=1
d = sgn[g · (x1 − c)] a= 1
(14) || (x1 − c) · [(x2 − x1 ) × (x3 − x1 )] || , || (x2 − x1 ) × (x3 − x1 ) ||
1 || (x2 − x1 ) × (x3 − x1 ) || . 2
(15) (16)
The quantity introduced in (13) will be denoted by ∆(c; x1 , x2 , x3 ), ∆(c), or simply ∆ to indicate its dependence on points as needed.
798
U. Hillenbrand and G. Hirzinger
As seen in (15), the sign of d is determined by the direction of gaze g. We can now define the family of tetrahedron representations T (c) by T (c) := ∆(c; d1 , d2 , d3 ) {d1 , d2 , d3 } ⊆ D ∧ max ||di − c|| < R i=1,2,3
∧ max ||di − c|| − min ||di − c|| < . (17) i=1,2,3
i=1,2,3
The parameter > 0 sets a small tolerance for differences in distance to c among each data-point triple d1 , d2 , d3 , that is, ideally all three points have the same distance. In practice, the tolerance has to be adapted to the density of rangedata points D to obtain enough ∆-samples in each T (c). Figure 1 depicts the geometry of the tetrahedron representations.
d3 r c r d r a d1
d2
Fig. 1. Transformation of range data points to a tetrahedron representation. For an inspection point c, the 3-tuples (r, d, a) are collected and normalized [cf. (13)] for each triple of data points d1 , d2 , d3 ∈ D with (approximately) equal distance r ∈ (0, R) from c.
One of the nice properties of the tetrahedron representation (17) is that for any surface of revolution we get at its symmetry point c samples ∆ ∈ T (c) that are confined to a surface in ∆-space characteristic of the original surface shape. For surfaces of revolution, the distinctiveness and low entropy of the distribution of range data D sampled from a surface in 3D is thus completely preserved in the tetrahedron representation. For more general surfaces, it is therefore expected that a high mutual information between ∆-samples and shape classes can be achieved. Our goal is to estimate point-relation densities for each shape class s = 1, 2, . . . , m. Samples are taken from tetrahedron representations centered on feature locations f from a training set Fs , that is, from ∪f ∈Fs T (f ). Moreover, we add a non-feature class s = 0 that lumps together all the shapes we are not interested in. The result are the shape-conditioned densities p(∆|s) for ∆ ∈ [0, 1] × [−1, 1] × [0, 1] and s = 0, 1, 2, . . . , m. In particular, p(∆|s) = p(∆|s, f ) for ∆ ∈ T (f ). Estimation of p(∆|s) is made simple by the fact that we get O(n3 ), i.e., a lot of samples of ∆ for each feature location f with n data points within a distance R. For our application, it is sufficient to count ∆-samples in bins to estimate probabilities p(∆|s) for discrete ∆-events as normalized frequencies. Inferring Local Surface Shape. Let the features’ prior probabilities be p(s) for s = 1, 2, . . . , m, that is, p(s) ∈ (0, 1) is the probability that any given data point from D is a feature location of type s. The feature priors are virtually impossible to know for general scenes. We do know, however, that p(s) 1 for
Probabilistic Search for Object Segmentation and Recognition
799
s = 1, 2, . . . , m, i.e., the overwhelming majority of data points are not feature locations. This must be so for a useful dictionary of point features. It thus makes sense to expand the logarithm of the shapes’ posterior probabilities, given l samples ∆1 (f ), ∆2 (f ), . . . , ∆l (f ) ∈ T (f ), f ∈ D, ln p[s|∆1 (f ), ∆2 (f ), . . . , ∆l (f )] = ln p(s) +
l
{ln p[∆i (f )|s] − ln p[∆i (f )|0]}
i=1
+ O[p(1), p(2), . . . , p(m)] ,
(18)
and neglect the terms O[p(1), p(2), . . . , p(m)]. Remember that p(∆|0) is the ∆density of the non-feature class. The expression (18) neglects correlations between the ∆i (f ) ∈ T (f ). The sample length l for each representation T (f ) will usually be l |T (f )|, the cardinality of the set T (f ). It is in fact crucial that we do not have to generate the complete representations T (f ) for recognition, as this would require O(n3 ) time for n data points within a distance R from f . Instead, it is possible to draw a random, unbiased sub-sample of T (f ) in O(n) and O(l) time. The first term in (18) can be split into ln p(s) = ln q(s) + ln
m
p(s ) ,
(19)
s =1
where q(s) ∈ (0, 1) are the relative frequencies of shape classes s = 1, 2, . . . , m. Now, these frequencies do not differ between features over many orders of magnitude for a useful dictionary of point features2 . The dependence of (19) on the shape class s is thus negligible compared to the sum in (18) for reasonable sample lengths l 1 (several tens to hundreds). The first term in (18) may hence be disregarded, and we are left with the feature’s log-probability Φ(s, f ; D) := ln p[s|∆1 (f ), ∆2 (f ), . . . , ∆l (f )] + const. ≈
l
{ln p[∆i (f )|s] − ln p[∆i (f )|0]} ,
(20)
i=1
that is, a log-likelihood ratio. 3.3
The Search Algorithm
We still need to determine the TP (8) for guiding the search (9). Let again G3 = {(s1 , f1 ), (s2 , f2 ), (s3 , f3 )}. It is reasonable to assume statistical independence of the point features in G3 , as long as there are many different objects in the world that have these features3 . We hence model the feature-triples’ posterior probability p(G3 |D) in (7) as 2 3
A feature that is more than a hundred times less frequent than the others should usually be dropped for computational efficiency. These objects may or may not be in our modeled set.
800
U. Hillenbrand and G. Hirzinger
p(G3 |D) =
3
p[si |∆1 (fi ), ∆2 (fi ), . . . , ∆l (fi )] ∝ exp
i=1
3
Φ(si , fi ; D)
.
(21)
i=1
The TP can then be expressed as h(G3 ) 3 TP(c, p; D) = max ln γi (G3 ) δc,Ci (G3 ) δp,Pi (G3 ) + Φ(si , fi ; D) , G3 ∈C
i=1
i=1
(22) up to additive constants. The first term under the max-operation is the contribution from feature grouping4 , the second from single features. In particular, we get TP(c, p; D) = −∞ for (c, p) ∈ / (Ci (G3 ), Pi (G3 )) G3 ∈ C, i ∈ {1, 2, . . . , h(G3 )}
,
(23)
that is, for (c, p)-values not suggested by any feature triple G3 . We are now prepared to formulate the search algorithm. The following is not meant to specify a flow of processing, but rather a logic sequence of operations. 1. From all possible feature values (s, f ) ∈ {1, 2, . . . , m} × D, select as feature candidates the values with log-probability Φ(s, f ; D) > Ξ. 2. From the feature candidates, generate all triples G3 . Use their keys (12) into a hash table to find the associated object hypotheses (Ci (G3 ), Pi (G3 )) and their weights γi (G3 ) for i = 1, 2, . . . , h(G3 ). 3. For the object hypotheses obtained, compute the TPs. 4. Evaluate the object hypotheses by the likelihood L in the order of decreasing TP, starting with the hypothesis scoring highest under TP. Skip duplicate hypotheses. Stop when L(c, p; D) > Θ for a hypothesis (c, p). 5. Return the last evaluated object parameters (c, p) as the result, if L(c, p; D) > Θ. Else conclude that there is none of the sought objects in the data. Some comments are in order. – Comment on step 1: The restriction to the most probable feature values by introduction of the threshold Ξ is optional. It is a powerful means, however, to radically speed up the algorithm, if there are many range data points D. If the probability measure on the features is reliable and Ξ is adjusted conservatively, the risk of missing a good set of object parameters is very small; see Sect. 4.1 below. / C, that is, inconsistent triples, are – Comment on step 2: Feature triples G3 ∈ discarded by the hashing process. 4
The factor δp,Pi (G3 ) is an idealization. Since the pose parameters p are continuous, the hashing with the key (12) is necessarily error tolerant. The factor is then to be replaced by an integral P (G3 ) dp δ(p − p ) for a set Pi (G3 ) of pose parameters. i
Probabilistic Search for Object Segmentation and Recognition
801
– Comment on step 3: Computation of TP is cheap as the hashed weights γi from step 2 and the feature’s log-probabilities Φ from step 1 can be used; cf. (22). – Comment on step 4: Duplicate hypotheses may be very unlikely, but this depends on when two hypotheses are considered equal. In any case, skipping them realizes the maxG3 ∈C -operation in the TP (22), ensuring that each hypothesis is drawn only by the feature triple which makes it most probable. – Comment on step 5: If the likelihood L does not reach the threshold Θ but we know a priori that at least one of the objects must be present in the scene, the algorithm may return the object parameters (c, p) that have scored highest under L. A complete scene interpretation often involves more than one recognized set of object parameters (c, p). A simultaneous search for several objects is possible within the current framework, although it will be computationally very costly. It requires evaluation of TPs and likelihoods TP(c1 , p1 , c2 , p2 , . . . ; D) ,
L(c1 , p1 , c2 , p2 , . . . ; D)
(24)
for a combinatorially enlarged set of object parameters (c1 , p1 , c2 , p2 , . . . ). A much cheaper, although less reliable, alternative is sequential search, where the data belonging to a recognized object (c, p) are removed prior to the algorithm’s restart.
4
Experiments
We have performed two types of experiment. One tests the quality of the measure (20) of the features’ posterior probability. The other probes the capabilities of the complete algorithm for object segmentation and recognition. All experiments were run on stereo data obtained from a three-camera device (the ‘Triclops Stereo Vision System’, Point Grey Research Inc.). The stereo data were calculated from three 320 × 160-pixel images by a standard least-sum-of-pixel-differences algorithm for correspondence search. The data were accordingly of a rather low resolution. They were moreover noisy, contained a lot of artifacts and outliers, and even visible surfaces lacked large regions of data points, as is not uncommon for stereo data; see Figs. 4 and 5. 4.1
Local Shape Recognition
Only correct feature values are able to draw correct object hypotheses (c, p), except from accidental, unprobable events. For an acceptable search length it is, therefore, crucial that correct feature values yield a high log-probability Φ; cf. (20) and (22). This is even more crucial, if feature candidates are preselected by imposing a threshold Φ(s, f ; D) > Ξ; cf. step 1 of the algorithm in Sect. 3.3. As point features, we chose convex and concave rectangular corners. The training set contained 221 feature locations that produced 466,485 ∆-samples for the densities p(∆|s), s ∈ {“convex corner”, “concave corner”}. For the nonfeature class s = 0 we obtained 9,623,073 ∆-samples from several thousand
802
U. Hillenbrand and G. Hirzinger
non-feature locations (mostly on planes, also some edges). The parameters for ∆ sampling were R = 15 mm and = 1 mm; cf. Sect. 3.2. The ∆-samples were counted in 15 × 20 × 10 bins, which corresponds to the discretization used for the ∆-space [0, 1] × [−1, 1] × [0, 1]. Feature-recognition performance was evaluated on 10 scenes that contained (i) an object with planar surface segments, edges, saddle points, and altogether 65 convex and concave corner points (feature locations); (ii) a piece of loosely folded cloth that contributed a lot of potential distractor shapes. The sample length was l = 50 ∆-samples; cf. Sect. 1. Since drawing ∆-samples is a stochastic process, we let the program run 10 times on each scene with different seeds for the random-number generator to obtain a more significant statistics. Corner points are particularly interesting as test features, since they allow for comparison of the proposed log-probability measure (20) with the classic method of curvature estimation. Let c1 (f ; D) and c2 (f ; D) be the principal curvatures of a second-order polynomial that is least-squares fitted to the data points from D in the R-sphere around the point f . Because the corners are the maximally curved parabolic shapes in the scenes, it makes sense to use the curvature measures C(s = “convex corner”, f ; D) := min[c1 (f ; D), c2 (f ; D)] , C(s = “concave corner”, f ; D) := min[−c1 (f ; D), −c2 (f ; D)] ,
(25) (26)
as alternative measures of “cornerness” of the point f for convex and concave shapes, respectively. We thus compare Φ(s, f ; D) as defined in (20) to C(s, f ; D) on true and false feature values (s, f ) ∈ {“convex corner”, “concave corner”} × D .
(27)
In Fig. 2 we show the distribution of Φ- and C-scores for the 65 true feature values in relation to their distribution for all possible feature values, which are more than 100,000. As can be seen, the Φ- and C-distributions for the population of true values are both distinct from, but overlapping with the Φ- and C-distributions for the entire population of feature values. A qualitative difference between Φ- and C-scoring of feature values is more obvious in the ranking they produce. In Fig. 3 we show the distribution of Φ- and C-ranks of the true feature values among all feature values, with 1 being the first rank, i.e., highest Φ/C-score, and 0 the last. Both Φ- and C-ranks of true values are most frequently found close to 1 and gradually less frequently at lower ranks. There are almost no true features Φ-ranked below 0.6. The C-ranks, in contrast, seem to scatter uniformly in [0, 1] for part of the true features. These features score under C as poorly as the non-features. If using the C-score, they would in practice not be available for object recognition, as they would only be selected to draw an object hypothesis after hundreds to thousands of false feature values. The result of the comparison comes not as a complete surprise. Some kind of surface-fitting procedure is necessary for estimation of the principal curvatures [2,3,18]. If not a lot of care is invested, fitting to data will always be very sensitive to outliers and artifacts, of which there are many in typical stereo data. A robust fitting procedure, however, has to incorporate in some way a statistical model of the data that accounts for its outliers and artifacts. We here propose to go all
Probabilistic Search for Object Segmentation and Recognition 0.4
0.4
Φ-score
0.3
0.3
0.2
0.2
0.1
0.1
-150
-100
-50
0
C-score
-300
50
803
-200
-100
0
100
200
Fig. 2. Score distribution of true (transparent bars, thick lines) and all possible (filled bars, thin lines) feature values as produced by the Φ-score (left) and the C-score (right). Horizontally in each plot extends the score, vertically the normalized frequency of feature values. 0.6
Φ-rank
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
C-rank
0.1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
Fig. 3. Rank distribution of true feature values among all possible feature values as produced by the Φ-score (left) and the C-score (right); rank 1 is the highest score in each data set, rank 0 the lowest score. Horizontally in each plot extends the rank, vertically the normalized frequency of true feature values.
the way to a statistical model, the point-relation densities, and to do without any fitting for local shape recognition. 4.2
Object Segmentation and Recognition
We have tested segmentation and recognition performance of the proposed algorithm on two objects. One is a cube from which 8 smaller cubes are removed at its corners; the other is a subset of this object. These two objects can be assembled to create challenging segmentation and recognition scenarios. The test data were taken from 50 scenes that contained the two objects to be recognized, and additional distractor objects in some of them. A single data set consisted of roughly 5,000 to 10,000 3D points. As an acceptable accuracy for recognition of an object’s pose we considered what is needed for grasping with a hand (robot or human), i.e., less than 3 degrees of angular and less than 5 mm of translational misalignment. For a fixed set of parameters of the search algorithm, in all but 3 cases the result was acceptable in this sense. Processing times ranged roughly between 1 and 50 seconds, staying mostly well below 5 seconds, on a Sun UltraSPARC-III workstation (750 MHz). In Figs. 4 and 5 we show examples of scenes, camera images, range data, and recognition results. Edges drawn in the data outline the object pose recognized first. The other object can be recognized after deletion of data points belonging to the first object.
804
U. Hillenbrand and G. Hirzinger
Fig. 4. Example scene with the two objects we used for evaluation of segmentation and recognition performance. The smaller object is stacked on the larger as drawn near the center of the figure. Shown are one of the camera images and three orthogonal views of the stereo-data set. The object pose recognized first is outlined in the data. The length of both objects’ longest edges is 13 cm. The recognition time was 1.2 seconds.
5
Conclusion
We have derived from a probabilistic perspective a new variant of hypothesizeand-test algorithm for object recognition. A critical element of any such algorithm is the ordered sequence of hypotheses that is tested. The ordered sequence that is generated by our algorithm is inferred from a statistical criterion, here called the truncated probability (TP) of object parameters. As a key component of the truncated probability, we have introduced point-relation densities, from which we obtain an estimate of posterior probability for surface shape given range data. One of the strengths of the algorithm, as demonstrated in experiments, is its very high degree of robustness to noise and artifacts in the data. Raw stereo-data points of low quality are sufficient for many recognition tasks. Such data are fast and cheap to obtain and are intrinsically invariant to changes in illumination. We have argued in Sect. 3.1 that feature groups of size g = 3, i.e., minimal groups, are a good choice for guiding the search for model matches. However, if we can be sure that more than three true object features are represented in the data, it may be worth considering larger groups. For g > 3, the density
Probabilistic Search for Object Segmentation and Recognition
805
Fig. 5. Example scene with the two objects we used for evaluation of segmentation and recognition performance; cf. Fig. 4. The recognition time was 1.5 seconds.
(10) that enters the TP (8) looks more complicated. In particular, hash tables of higher dimension are needed to accommodate the grouping laws, i.e., the weights γ(Gg ). If we do without the grouping laws, using larger groups is equivalent to having more than just the largest term of the density (6) contributing to the TP; cf. (7). In the limit of large feature groups, the approach then resembles generalized-Hough-transform and pose-clustering techniques [1,17,16]. One avenue of research that is suggested here is for generalizations of pointrelation densities. Alternative representations of range data have to be explored for learning posterior-probability estimates for various surface shapes.
References [1] Ballard, D. H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit. 13 (1981), 111–122. [2] Besl, P., and Jain, R. Intrinsic and extrinsic surface characteristics. In Proc. IEEE Conf. Comp. Vision Patt. Recogn. (1985), pp. 226–233. [3] Besl, P., and Jain, R. Three-dimensional object recognition. ACM Comput. Surveys 17 (1985), 75–145. [4] Faugeras, O. D., and Hebert, M. The representation, recognition, and positioning of 3-D shapes from range data. In Three-Dimensional Machine Vision, T. Kanade, Ed. Kluwer Academic Publishers, 1987, pp. 301–353.
806
U. Hillenbrand and G. Hirzinger
[5] Fischler, M., and Bolles, R. Random sample consensus: A paradigm for model fitting with application to image analysis and automated cartography. Comm. Assoc. Comp. Mach. 24 (1981), 381–395. [6] Fleuret, F., and Geman, D. Graded learning for object detection. In Proc. IEEE Workshop Stat. Comput. Theo. Vision (1999). [7] Goad, C. Special purpose automatic programming for 3D model-based vision. In Proc. ARPA Image Understanding Workshop (1983). [8] Grimson, W. E. L. Object Recognition by Computer: The Role of Geometric Constraints. MIT Press, Cambridge, MA, 1990. [9] Huttenlocher, D. P., and Ullman, S. Object recognition using alignment. In Proc. Intern. Conf. Comput. Vision (1987), pp. 102–111. [10] Huttenlocher, D. P., and Ullman, S. Recognizing solid objects by alignment with an image. Int. J. Computer Vision 5 (1990), 195–212. [11] Jacobs, D., and Clemens, D. Space and time bounds on indexing 3-D models from 2-D images. IEEE Trans. Patt. Anal. Mach. Intell. 13 (1991), 1007–1017. [12] Kuno, Y., Okamoto, Y., and Okada, S. Robot vision using a feature search strategy generated from a 3-D object model. IEEE Trans. Patt. Anal. Machine Intell. 13 (1991), 1085–1097. [13] Lamdan, Y., and Wolfson, H. J. Geometric hashing: a general and efficient model-based recognition scheme. In Proc. Intern. Conf. Comput. Vision (1988), pp. 238–249. [14] Lowe, D. G. Perceptual Organization and Visual Recognition. Kluwer, Boston, 1985. [15] Lowe, D. G. Visual recognition as probabilistic inference from spatial relations. In AI and the Eye, A. Blake and T. Troscianko, Eds. John Wiley and Sons Ltd., New York, 1990, pp. 261–279. [16] Moss, S., Wilson, R. C., and Hancock, E. R. A mixture model for pose clustering. Patt. Recogn. Let. 20 (1999), 1093–1101. [17] Stockmann, G. Object recognition and localization via pose clustering. CVGIP 40 (1987), 361–387. [18] Stokely, E. M., and Wu, S. Y. Surface parameterization and curvature measurement of arbitrary 3-D objects: Five practical methods. IEEE Trans. Patt. Anal. Mach. Intell. 14 (1992), 833–840. [19] Torr, P. H. S., and Zisserman, A. MLESAC: A new robust estimator with application to estimating image geometry. Comput. Vision Image Understanding 78 (2000), 138–156. [20] Wolfson, H. J. Model-based object recognition by geometric hashing. In Proc. Eur. Conf. Comput. Vision (1990), pp. 526–536.
Real-Time Interactive Path Extraction with On-the-Fly Adaptation of the External Forces Olivier G´erard1 , Thomas Deschamps1,2,3 , Myriam Greff1 , and Laurent D. Cohen3 1 Medical Imaging Systems Group, Philips Research France, 51 rue Carnot, 92156 Suresnes, France.
[email protected] 2 Now with Lawrence Berkeley National Laboratory and University of California, Berkeley, USA.
[email protected] 3 Ceremade, Universit´e Paris 9-Dauphine, 75775 Paris Cedex 16, France.
[email protected]
Abstract. The aim of this work is to propose an adaptation of optimal path based interactive tools for image segmentation (related to Live-Wire [12] and Intelligent Scissors [18] approaches). We efficiently use both discrete [10] and continuous [6] path search approaches. The segmentation relies on the notion of energy function and we introduce the possibility of complete on-the-fly adaptation of each individual energy term, as well as of their relative weights. Non-specialist users have then a full control of the drawing process which automatically selects the most relevant set of features to steer the path extraction. Tests have been performed on a large variety of medical images.
1
Introduction
Approaches in image segmentation are numerous, ranging from fully automatic methods to fully manual methods. The first ones totally avoid user’s interaction but still are an unsolved problem: even if they are well adapted to specific cases their success can not be guaranteed in more general cases. The second ones are time-consuming, hardly reproducible and inaccurate. To overcome these problems, interactive (or semi-automatic) methods are used. They combine knowledge of the user and computer capabilities to provide supervised segmentation, ranging from manual pixel level drawing to minimal user intervention. The aim of this work is to develop a method to offer the possibility even for a non-expert to draw quickly the boundary of an object of interest. He could restrict its intervention to the setting of a start point in an image. Then a contour has to be automatically found and drawn in real-time between this start point and the current cursor position. This contour has to respond to a certain constraints, such as internal and external forces and action of the user. The developed method has to let the user validate or not the result. Using this validation, the tool has to be able to learn on-line to better estimate the different parameters of the model. A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 807–821, 2002. c Springer-Verlag Berlin Heidelberg 2002
808
O. G´erard et al.
In contour oriented segmentation, one approach is to define a boundary as the minimum of an energy function that comprises many components such as internal and external forces. In the literature, there exist many techniques to perform this minimization. Classical active contours (also called Snakes or Deformable Models), introduced by Kass, Witkin and Terzopoulos [16], have received a lot of attention during the past decade. But this technique presents three main problems. First, deformable models are very sensitive to the initialization step and often get trapped in a local minimum [3]. Second, user’s control cannot be applied during the extraction but only during the initialization and the validation stages of the segmentation process. Third, the different parameters of the model are not meaningful enough in a user’s viewpoint (especially for clinicians). The application of the minimal path theory to image segmentation is a more recent technique. With this approach the image is defined as an oriented graph characterized by its cost function. The boundary segmentation problem becomes an optimal path search problem between two nodes in the graph. This approach overcomes the problem of local minima by using either dynamic programming (Dijkstra [10]), or a front propagation equation (Cohen and Kimmel [6]), mapping the non-convex cost function into a function with only one minimum. Falcao and Udupa with their Live-Wire [11,12] and Mortensen and Barrett with their Intelligent Scissors [18,19,20,21] introduced interactivity into the optimal path approach. Their method is based on Dijkstra’s graph search algorithm and gives to the user a large control over the segmentation process. The idea is the following: a start point is selected by the user on the boundary to be extracted, and an optimal path is computed and drawn in real time between this start point and the current cursor position (see Fig. 1). Thus, user’s control is applied
Fig. 1. Principle of interactive contour extraction with optimal path method: the optimal trajectory is extracted in real time between a user defined starting point and the mouse cursor.
Real-Time Interactive Path Extraction
809
also during the extraction. Mortensen and Barrett [1,20] also introduced functionalities called Path-Cooling and On-the-Fly training, which respectively lead to the possibility of drawing a closed contour, and to partial adaptation of the graph cost function. This technique seems to be the most adequate according to our target. The application consists therefore first in the implementation of a method based on minimal path search inspired from Live-Wire and Intelligent Scissors tools, as presented in section 2. Section 3 explains then several improvements brought to the interactive path extraction techniques, through the use of Eikonal equation for extracting the minimal path. Section 4 details the implementation of our On-the-Fly training method. Section 5 contains a discussion of the proposed method and future works.
2 2.1
Live-Wire and Intelligent Scissors Methods Cost Function
The optimal-path search is guaranteed to find the solution minimizing the energy function between two points. But if this function does not suit enough the object to extract, the contour obtained by this method will not be suitable. That is why we are proposing to automatically tune the parameters of the cost function during the extraction process. Nevertheless, the cost function relies on salient features of the image. A feature is supposed to describe certain properties of the boundary and of its environment (gray levels of the contour, of the background, ...). These features are then mapped through the use of a so called cost assignment function (CAF ) into cost functions which are similar to the ”‘Potential”’ instances of active contour models. For instance, the following list records the cost functions Cx associated to some features quoted in the Live-Wire [12] and Intelligent Scissors [20] papers: – – – – – –
Cg : Gradient magnitude; Cd : Direction of the gradient magnitude; CL : Laplacian feature; CI : Intensity on the positive (inside) side of the boundary, CO : Intensity on the negative (outside) side of the boundary, CE : Intensity on the boundary (edge).
The cost assignment process depends on how one wants to emphasize one or another value of the feature. For the gradient magnitude the inverse can for example be taken as a CAF in order to favor high contrasts, but a Gaussian function, centered on the gradient value one wants to highlight, could also be applied. For the Laplacian feature, the CAF is usually a zero-crossing detector. But many other types of CAF may be used. Once satisfactory individual cost functions are defined, they are combined into a total cost function. Let us call potential the weighted sum of all the individual cost functions. On the directed graph-arc from a pixel p to an adjacent pixel q, the potential used by Intelligent Scissors [20] is defined by:
810
O. G´erard et al.
P(p, q) = ωg Cg (q) + ωL CL (q) + ωd Cd (p, q) + ωi Ci (q) + ωo Co (q) + ωe Ce (q)
(1)
where each ωx is the weight of the corresponding cost function in the previous list. The energy of a path to minimize with Dijkstra [10] is then defined by Epath =
2.2
P(p, q)
(2)
(p,q)∈path
Path-Cooling
With the optimal path approach [10] it is not possible in general to extract a closed contour with only one seed point and one end point (a solution to this problem has been proposed in [4], based on saddle points). If the extracted path becomes too long, the optimal path search method will favor shorter paths cutting through the background: it is often necessary to set several points to draw the expected contour. Path-Cooling was introduced in Intelligent Scissors [1], as Bordery Cooling and achieves automatic generation of seed points. When a new seed point is generated, the boundary segment between this new point and the previous seed point is “frozen”. A new seed point is generated when a pixel in the contour is considered stable enough. The stability criterion is function of both the time spent on the active boundary segment and the path coalescence (in other terms: how many times a point has been ”drawn”). 2.3
On-the-Fly Training
For some features it is hard to decide without prior knowledge which values are to be preferred in the cost function. The notion of On-the-Fly training is introduced in Intelligent Scissors [20] and consists in adapting, during the extraction, the cost functions to the specificities of the desired contour, so as to track a contour with slowly changing properties. It is done for each feature independently from each other. The idea is the following: assuming that the user has drawn a long enough and valid boundary segment, the cost function of the feature has to be modified in order to favor contours with the same feature-values than those found on the segment, called training path. In practice, the process modifies directly the cost assignement function and not the feature itself. The CAF is basically represented as an histogram weighting, which is very easy to compute. During extraction, the CAF is iteratively modified by removing from it the training histogram (built on the training path) and scaling the result between 0 and 1. Figures 2 and 3 illustrate this process with the edge intensity feature. The advantage of On-the-Fly training is that the potential can be adapted in real-time during the extraction.
Real-Time Interactive Path Extraction
(a)
(b)
811
(c)
Fig. 2. Training initialization: (a) the feature trained; (b) its corresponding histogram at initialization; (c) its corresponding cost assignment function.
(a)
(b)
(c)
(d)
(e)
Fig. 3. Training iteration: (a) the trained path; (b) the histogram along the trained path; (c) the cost assignment function at iteration i; (d) the CAF minus the trained histogram; (e) the scaled CAF for iteration i + 1.
3
Adaptations and Improvements
In this section, the original contribution essentially lies in the introduction of a more general path search scheme and in the adaptation of the cooling speed. 3.1
Eikonal Minimal Path Extraction
We worked on using the path extraction method of Cohen and Kimmel [6] in the framework of 2D-Live-Wire and Intelligent Scissors . This extraction method uses Eikonal equation for propagation. In our implementation we use this continuous formulation and compare it with Dijkstra as used in Live-Wire [12] and Intelligent Scissors [20]. The main idea of Cohen and Kimmel method [6] is that the potential and the graph are considered to be continuous, producing a sub-pixel path. In [6], contrary to the classical snake model [16] (but similarly to geodesic active contours [2]), the contour is parameterized by the arc-length parameter s, which means that C (s) = 1, leading to a new energy form. Considering a simplified energy model2 without any second derivative term leads to the expression E(C) = {wC + P (C)}ds. Assuming that C (s) = 1 leads to E(C) = {w + P (C(s))}ds. (3) Ω
812
O. G´erard et al.
We have now an expression in which the internal forces are included in the external potential. With this approach, detailed in [5], the energy to be minimized is defined as the integral of a strictly positive functional having low values close to the desired features. Some improvements on finding the minimal path for 2D and 3D images have been introduced in [7,8,9]. 3.2
Comparison between Dijkstra and Eikonal Equation
As explained in [6], the main difference between Dijkstra and Cohen and Kimmel design of the minimal path lies in the considered metric. In the first case, the minimal path minimizes the sum of the potential along the nodes of the graph (L1 path), and in the second case, it minimizes the integral of the potential on the image domain (L2 path). With Dijkstra, the image is considered as a graph in which each pixel is a node and the weights on the vertices are functions of the energy to minimize. This method uses dynamic programming to compute the optimal path [10]. Cohen and Kimmel approach keeps a continuous framework for the problem and computes the path by solving the Eikonal equation with the Fast-Marching algorithm [5,22]. Even if Eikonal method uses integrals to compute the optimal contour, it is not slow. The average computing time ratio between both methods on synthetic and real images is near 0.9. And, because Eikonal equation produces a sub-pixel path (L2 path), it is more accurate. The continuous formulation of Cohen and Kimmel method has the advantage to keep a more general framework for the energy definition, thereby allowing applications using other kind of potentials. It also easily includes an offset term w (see (3)) to constrain the regularity of the path, while it is more difficult with dynamic programming methods [17,13]. 3.3
Path-Cooling Improvement
Path-Cooling methods are based on the idea that if a point in the active contour is stable enough, the corresponding boundary segment should be “frozen”. Our approach uses the two counters described in [20], with an adaptation: the time history is updated by multiplying (and not adding) a scaled potential-driven factor instead of a simple gradient-driven factor. The multiplication has a weighting effect which gives more influence to pixels with a low potential. At pixel p and for iteration i of the cooling process, the time history T Hi is defined as T Hi (p) = T Hi−1 (p) + ti−1 × [1 − P(p)] where ti is the consecutive time the path was active until iteration i and P the penalty in (1). The time history is useful if we consider that when the user does not move (redraw history locked, but time history incremented), he wants to emphasize the already drawn contour. But, as both thresholds are to be satisfied, even if the user moves very slowly (or not at all), the active path will freeze slowly. In
Real-Time Interactive Path Extraction
813
practice, the definition of the thresholds is not obvious and the main difficulty lies in the interpretation of the mouse motion, in terms of speed and acceleration. We can either increase the cooling speed with the mouse cursor speed or decrease it. An argument to increase the mouse speed is the following: if the user moves fast it means there is no real difficulty with the drawing. But where the contour has lots of details, the user must act slowly if he has to define little areas. An argument to decrease the cooling speed is the following: the areas where the user goes slowly are the areas which are not very well defined. On the other hand, it induces that going too slowly in other areas (hesitating, for example) may create false paths. No choice is totally satisfying in a general case, but once one method is chosen, depending on the application, Path-Cooling is a very satisfying tool to extract closed boundaries, as shown in Fig. 4.
Fig. 4. Path-Cooling on an ultrasound image of a breast: iterations of the delineation of a tumor; the white cross is the seed point at iteration 0; the black cross is the mouse cursor; the white path is the “cooled” one; the black path is the currently drawn path between the mouse cursor and the new seed point at each iteration.
4 4.1
On-the-Fly Training Improvement Training Path
Training, as described in [20], is based on the distribution of the feature values on a validated contour segment. As depicted in Fig. 5, we tested three different definitions of the training path, which can either be: 1. the last section of the frozen contour; training is applied at each setting of a seed point (white points of the path in Fig. 5); 2. the last section of the active boundary; training is applied at each mouse movement, i.e. at each path extraction (black points of the path in Fig. 5); 3. the set of points where a mouse motion event is sent to the system; training is applied at each such event (black crosses in Fig. 5).
814
O. G´erard et al.
Fig. 5. Several training paths: on this fluoroscopy image, the cursor position is shown with crosses, the path frozen part is in white, and the current free path in black.
We found that the first definition leads to better results. Training is effective when a new seed point is manually or automatically set. This setting of a seed point can be seen as the validation of a path segment. The free path, because of its high variability and dependence on the local potential, is not a suitable training area, as well as for the mouse movement marker, because it constrains the mouse cursor to stay in the vicinity of the desired contour. 4.2
New Way of Training
We have developed an improvement of the classical training method [15,14]. In existing methods the training area is limited to a portion of the path itself and only takes a positive information into account (see section 2.3). The original technique we use is based on the addition of another training area based on “negative” information. The positive training area contains pixels that could belong to the contour, while the negative training area contains pixels that do not belong to the contour. The positive area has a reinforcing role while the negative area has a penalizing role in the cost assignment process. This improvement has two consequences: firstly, the new potential is more robust and secondly, the comparison of the distributions of the features (i.e. histograms) on each area helps adapting the weights of the individual cost functions in the total potential. Definition of the positive and negative training areas. The main problem is to define suitable positive and negative areas that describe well enough which pixels belong to the contour and which don’t. We tested four approaches for the negative area definition: 1. In the minimal box including the training path, the points closer to the path than a certain distance d are considered in the positive area, the other points of the box are considered in the negative area (see Fig. 6-(a)); 2. In this box, the points closer to the path than a certain distance dp are considered in the positive area and the points further form the path than the
Real-Time Interactive Path Extraction
815
distance dp and closer to the path than a certain distance dn are considered in the negative area (see Fig. 6-(b)); 3. the positive and negative areas are made from paths coming from the neighborhood of the click-position. The paths coming from nearest neighborhood form the positive area and the path coming from furthest neighborhood form the negative area (see Fig. 6-(c)); 4. The training set of points is made from translations of the path: in the minimal box including the training path, nearest translations are the positive area and furthest translations are the negative area (see Fig. 6-(d)). For each approach, it is possible to add a weighting function based on the distance to the training path. The first approach is not accurate enough for the
(a)
(b)
(c)
(d)
Fig. 6. Different definitions of the positive and negative areas: Positive area in white, negative area in gray, other points of the box in black; (a) Positive area is a ribbon around the path, negative area is the rest of the box; (b) Positive area is a ribbon around the path, negative area is a ribbon around the positive area; (c) Areas built by the paths coming from a region around the mouse cursor; (d) Areas built with translations of the training path.
definition of the negative area. Indeed it is possible that other points of the box belong to another right path section. It is the motivation for introducing the second approach. The third method seemed a priori to be the most accurate to define a negative area. But all the paths quickly merge, thus limiting the size of the training set. Second and last approaches are very similar and give the best results with this difference that the first one is a continuous version of the second one. We therefore choose the last approach: the positive area is the set of p nearest translations and the negative area is the set of n next translations of the training path. The translation direction is given by the mean normal direction to the training path. The different translations are weighted according to their distance to the training path (see Fig. 7-(d)). The negative/positive areas are symmetric for the gradient magnitude and the edge intensity features (figure 7-(a)). But, in order to consider the non symmetric aspect of the inside and outside features, we adopt for them non symmetric training areas, depending on the direction of the gradient on the path, the path direction and the considered feature. See examples with Fig. 7-(b)-(c). The positive/negative training
816
O. G´erard et al.
(a)
(b)
(c)
(d)
Fig. 7. Training areas: in white lines the positive area and in black lines the negative one. The longest line is the training path. (a) Symmetric; (b) Asymmetric for inside feature; (c) Asymmetric for outside feature; (d) Schematic weighting function.
sets of points are used to build two distinct histograms of the feature values. The function of these histograms is twofold: to build an adapted cost function for the feature (through the construction of an adapted CAF ) and to adapt automatically the weight of the individual cost function in the global potential. On-the-Fly adaptation of individual cost functions. Individual cost functions are built with a CAF applied to the feature. Training is used to dynamically modify this cost assignment function. With our method, the algebraic difference between the positive and the negative histograms is removed from the iteratively modified CAF and the resulting cost is normalized. The positive and negative histograms are scaled the following way: firstly, to have a same scale for both training histograms, the negative histogram values are multiplied by the ratio between the number of points used to build the positive histogram and the number of points used to build the negative histogram: H − [i] =
card{H + } × H − [i] card{H − }
where H + and H − are respectively positive and negative histograms. Then, both histograms are normalized using their common min/max. The initialization used in [20] favors the gray values with highest occurrence in the feature. This is not a good initialization for images where there is a large homogeneous background as in Fig. 8-(a). We prefer an approach that does not carry any a priori about the expected feature values: the CAF is initialized with a flat line, giving the same cost to every pixels. So, the CAF obtained during the process depends only on the past and the present training data. If the training contour is not uniform enough, producing a too wide positive histogram, the classical approach will favor feature values that are perhaps not specifically relevant to the expected contour. The negative information used in our method leads to a description of what the values should not be. Combining both positive and negative information thus produces a more suitable cost function. This effect is shown in Fig. 8 on the septum wall of an echocardiographic left-ventricle image. The trained feature is the inside intensity one (Fig. 8-(b)). The classical method uses a CAF initialized with the inverse of the distribution of the feature,
Real-Time Interactive Path Extraction
(a)
(c)
(g)
(b)
(d)
(h)
817
(i)
(e)
(f)
(j)
(k)
Fig. 8. Training on the septum wall of an echographic left-ventricle image: (a) echographic left-ventricle image; (b) inside Feature Ci ; (c to f) classical training: (c) initial CAF ; (d) training histogram; (e) resulting CAF with (f) corresponding cost function; (g to k) improved training: (g) initial CAF ; (h) positive and (i) negative training histograms; (j) resulting CAF with (k) corresponding cost function.
where black levels are predominant and thus favored (Fig. 8-(c)). The training histogram is scaled between 0 and 1 (Fig. 8-(d)) and removed from the initial CAF to build a new CAF (Fig. 8-(e)) favoring very specific values. With this example the training path is homogeneous and the resulting cost function (Fig. 8(f)) is good. But with a non uniform enough training contour, the resulting potential will not be consistent. Our method initializes the CAF (Fig. 8-(g)) with an arbitrary value between 0 and 1. The positive and negative histograms (Fig. 8(h) and 8-(i)) are jointly scaled and the difference between them is removed from the initial CAF producing a less specific but carrying more accurate information histogram (Fig. 8-(j)). Thus the new cost function (Fig. 8-(k)) is more relevant. On-the-Fly adaptation of the individual cost functions is shown in Fig. 9, where iterations of the modification of the CAF of the feature Ce are shown. Adaptation of the weights. One of the main advantage of using positive and negative training areas is the evaluation of the differences between the histograms, which helps adapting the weight of the corresponding individual cost function in the global potential. If the histograms are distinct enough, we can assume that the considered feature is discriminating enough and that its weight
818
O. G´erard et al.
Fig. 9. Results of the training on a left ventricle image: First row - iterations of the real-time contour extraction with training, on a X-Ray image of the heart leftventricle; second row - modification of the CAF of the feature Ce at the same iterations; third row - corresponding feature potential Ce at the same iterations.
should be more important. In a first approach we take the mean difference between both histograms as dissimilarity criterion, expressed as: 255
c=
1 + |H [i] − H − [i]|. 256 i=0
Indeed, if the histograms are very similar this criterion will be very low and if they are different, it will be high. In practice, this criterion often produces very low values and hardly ever values above 0.5. Hence a stretching function is used to transform it into a weight ωc associated with the cost function c in the total potential. See on Fig. 10 examples of stretching functions.
(a)
(b)
(c)
Fig. 10. Examples of stretching functions: (a) Privileges small values of the cost; (b) penalizes small values of the cost; (c) penalizes values of the cost below a and favors values of the cost above a.
Real-Time Interactive Path Extraction
5
819
Discussion
We have described a very general method of path extraction. This path extraction is interactive, works in real-time, is able to achieve sub-pixel accuracy and makes use of on-line training to adapt its internal parameters. Figure 11 shows several results of the developed tool on different medical images. But of course, it can be applied to any type of digital images. The described work is not focused
Fig. 11. Results on different datasets: First row - Extraction of a guide-wire in a fluoroscopy image; second row - tumor delineation in an ultrasound breast image; third row - extraction of the left ventricle in an echocardiographic image.
on the best features to compute for the path extraction problem at hand (a lowlevel task subjects to a lot of ad-hoc choices), but is rather trying to make the best use of the available features in order to give satisfactory results to the user. The ”live” satisfaction of the user is a crucial input for the method and several methods have been proposed to ”measure” it (see the discussions in sections 3.3 and 4.1). The proposed training method is clearly able to adapt the use of each individual feature, because the training process is rather simple and general, and because we make simultaneous use of both positive (reinforcing) and negative (penalizing) information, see section 4.2. This training can also work in realtime because it acts on the cost assignement function level and not in the lower
820
O. G´erard et al.
feature level (which is more time consuming to compute). Thus, to make the best use of our work, a lot of features should be computed (once, at the start of the extraction process) and proposed for selection to the method. Of course, our work is not limited by the sample list of features presented in 2.1 and both ”general” and specific features should compete. The competition we propose is also based on the automatic setting of the individual weight applied to each feature (see section 4.2). Using positive and negative information allows us to estimate the pertinence of each feature and thus to adapt the corresponding weight accordingly. As presented here, this automatic weight adaptation requires further improvements to fully take advantage of this very nice potentiality. Then, the proposed method could be applied to a database of images using a wide range of features, leading to an ”automatic” selection (provided that an experienced user has used the tool) of the most relevant features for the current task.
6
Conclusion
We have developed an interactive, real-time and user-guided image segmentation software. The user has a full control of the drawing process (which is very important for physicians). Also, the non-specialist user has the possibility to outline the contour of an object in an image without a very precise drawing, for example with the track-ball on an echograph. We are currently testing this feature for a dedicated echographic vessel detection application. Some improvements have been brought to the bibliographical background of interactive extraction of optimal path. A very general optimal path extraction method has been efficiently used, producing in real-time very precise paths. Finally, a new method of On-the-Fly training has been developed to adapt, during extraction of the contour, the individual cost-functions and their relative weights in the total potential.
References 1. W. Barrett and E. Mortensen, “Interactive Live-Wire boundary extraction”, Medical Image Analysis, vol. 1, no. 4, pp. 331–341, 1997. 2. V. Caselles, R. Kimmel, G. Sapiro, “Geodesic active contours”, in Proc. of ICCV’95, Cambridge, USA, pp. 694–699, 1995. 3. L. D. Cohen, ”On active contour models and balloons”, CVGIP: Image Understanding, vol. 53, no. 2, pp. 211-218, March 1991. 4. L. D. Cohen and R. Kimmel, ”Global Minimum for Active Contour Models: A Minimal Path approach”, in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’96), pp. 666–673, 1996. 5. L. D. Cohen and R. Kimmel, ”Fast marching the global minimum of active contours”, in Proc. IEEE International Conference on Image Processing (ICIP’96), pp. I:473–476, 1996. 6. L.D. Cohen and R. Kimmel, “Global Minimum for Active Contour Models: A Minimal Path approach”, IJCV, vol. 24, no. 1, pp. 57–78, Aug. 1997.
Real-Time Interactive Path Extraction
821
7. T. Deschamps, Curve and Shape Extraction with Minimal Path and Level-Sets techniques - Applications to 3D Medical Imaging, Ph.D. thesis, Universit´e Paris IX, December 2001. 8. T. Deschamps and L.D. Cohen ”Minimal paths in 3D images and application to virtual endoscopy”, in Proc. sixth European Conference on Computer Vision, pp. II:543-557, 2000. 9. T. Deschamps and L.D. Cohen, “Fast extraction of minimal paths in 3D images and applications to virtual endoscopy”, Medical Image Analysis, vol. 5, no. 4, pp. 281–299, December 2001. 10. E.W. Dijkstra, “A note on two problems in connection with graphs”, Numerische Mathematic, vol. 1, pp. 269–271, 1959. 11. A.X. Falcao, J.K. Udupa, S. Samarasekera, B.E. Hirsch, “User-steered image boundary segmentation”, in SPIE Proceedings, vol. 2710, pp. 278–288, 1996. 12. A.X. Falcao, J.K. Udupa, S. Samarasekera, S. Sharma, B.E. Hirsch, R. de A. Lotufo, “User-Steered Image Segmentation Paradigms: Live-Wire and Live-Lane”, GMIP, vol. 60, no. 4, pp. 233–260, 1998. 13. D. Geiger, A. Gupta, L. Costa, J. Vlontzos, “Dynamic programming for detecting, tracking, and matching deformable contours”, IEEE Trans. PAMI, vol. 17, no. 3, pp. 294–302, March 1995. 14. M. Greff, O. G´erard, T. Deschamps, “Adaptation of potential terms in real-time optimal path extraction”, Patent Pending, April 2001, FR 01 401 696.8. 15. M. Greff, “Real-time user-guided image segmentation”, Internship report, Institut National des T´el´ecommunications, July 2001. 16. M. Kass, A. Witkin, D. Terzopoulos, “Snakes: Active contour models”, IJCV, vol. 1, no. 4, pp. 321–331, 1988. 17. N. Merlet, J. Zerubia, “A curvature-dependent energy function for detecting lines in satellite images”, in Proc. of Scandinavian Conference on Image Analysis, Tromso, Norway, pp. 699–706, May 1993. 18. E. Mortensen, B. Morse, W. Barrett, J.K. Udupa, “Adaptive boundary detection using live-wire two-dimensional dynamic programming”, in IEEE Proc. of Computers in Cardiology, pp. 635–638, October 1992. 19. E.N. Mortensen and W.A. Barrett, “Intelligent Scissors for Image Composition”, in Proc. of Computer Graphics, SIGGRAPH’95, pp. 191–198, 1995. 20. E.N. Mortensen and W.A. Barrett, “Interactive Segmentation with Intelligent Scissors”, GMIP, vol. 60, no. 5, pp. 349–384, September 1998. 21. E.N. Mortensen and W.A. Barrett, “Toboggan-Based Intelligent Scissors with a Four Parameter Edge Model”, in Proc. of the IEEE CVPR’99, pp. II:452–458, 1999. 22. J.A. Sethian, Level set methods: Evolving Interfaces in Geometry, Fluid Mechanics, Computer Vision and Materials Sciences, Cambridge University Press, 1999.
Matching and Embedding through Edit-Union of Trees Andrea Torsello and Edwin R. Hancock Dept. of Computer Science, University of York Heslington, York, YO10 5DD, UK
[email protected]
Abstract. This paper investigates a technique to extend the tree edit distance framework to allow the simultaneous matching of multiple tree structures. This approach extends a previous result that showed the edit distance between two trees is completely determined by the maximum tree obtained from both tree with node removal operations only. In our approach we seek the minimum structure from which we can obtain the original trees with removal operations. This structure has the added advantage that it can be extended to more than two trees and it imposes consistency on node matches throughout the matched trees. Furthermore through this structure we can get a “natural” embedding space of tree structures that can be used to analyze how tree representations vary in our problem domain.
1
Introduction
Recently, there has been considerable interest in the structural abstraction of 2D shapes using shock-graphs [11,17,19]. The shock-graph is a characterization of the differential structure of the boundaries of 2D shapes. It is constructed by labeling points on the Blum skeleton of an object. This is done by measuring the rate of change with skeleton length of the radius of the maximal circle inscribed within the object boundary. The shock labels indicate whether the radius is increasing, decreasing, locally maximum or minimum. These cases correspond to situations where the object boundary is tapering, constricting or a local bulge. Recently, several authors have shown how to recognize shapes by matching shockgraphs. For instance Pelillo, Siddiqi and Zucker [14] have used a novel relaxation labeling method inspired by evolutionary game-theory. Sebastian, Kimia, and Klein [12], have performed matching using graph-edit distance. Although graph-matching allows the pairwise comparison of shock-graphs, it does not allow the shape-space of shock-graphs to be explored in detail. Graphmatching may provide a fine measure of distance between structures, and this in turn may be used to cluster similar graphs. However, it does not result in an ordering of the graphs that has metrical significance under structural variations due to graded shape-changes. This latter approach has been at heart of recent developments in the construction of deformable models and continuous shape-spaces [10]. Here shape variations have been successfully captured A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 822–836, 2002. c Springer-Verlag Berlin Heidelberg 2002
Matching and Embedding through Edit-Union of Trees
823
by embedding them in a vector-space. The dimensions of this space span the modes of shape-variation. For instance, Cootes and Taylor [7] have shown how such shape-spaces can be constructed using the eigenvectors of a landmark covariance matrix. Sclaroff and Pentland [15], on the other hand, use the elastic modes of boundaries to define the shape-space. In this paper we take the view that although the comparison of shock-graphs, and other structural descriptions of shape, via graph matching or graph edit distance has proved effective, it is in some ways a brute-force approach which is at odds with the non-structural approaches to recognition which have concentrated on constructing shape-spaces which capture the main modes of variation in object shape. Hence, we aim to address the problem of how to organize shockgraphs into a shape-space in which similar shapes are close to one-another, and dissimilar shapes are far apart. In particularly, we aim to do this in a way such that the space is traversed in a relatively uniform manner as the structures under study are gradually modified. In other words, the aim is to embed the graphs in a vector-space where the dimensions correspond to principal modes in structural variation. There are a number of ways in which this can be achieved. The first is to compute the distance between shock-graphs and to use multidimensional scaling to embed the individual graphs in a low-dimensional space [22]. However, as pointed out above, this approach does not necessarily result in a shape-space where the dimensions reflect the modes of structural variation of the shock-graphs. The second approach is to extract feature vectors from the graphs and to use these as a shape-space representation. A shape-space can be constructed from such vectors by performing modal analysis on their covariance matrix. However, when graphs are of different size, then the problem of how to map the structure of a shock-graph to a vector of fixed length arises. This problem can be circumvented using graph spectral features. This approach is explored in a companion paper. In this paper we take a different approach to the problem. We aim to embed shock trees in a pattern space by mapping them to vectors of fixed length. We do this as follows. We commence from a set of shock-trees representing different shapes. From this set, we construct a super-tree from which each tree may be obtained by the edit operations of node and edge removal. Hence each shock-tree is a subtree of the super-tree. The super-tree is constructed so that it minimizes the total edit distance to the set of shock-trees. To embed the individual shock-trees in a vector-space we allow each node of the super-tree to represent a dimension of the space. Each shock-tree is represented in this space by a vector which has non-zero components only in the directions corresponding to its constituent nodes. The non-zero components of the vectors are the weights of the nodes. In this space, the edit distance between trees is the L1 norm between their embedded vectors. Hence the shape-space is arrived at by construction from the individual shock-trees. An important ingredient of this construction is the way in which we impose consistency on the separate trees. By constructing the tree-union, we impose edge-consistency constraints on the individual node correspondences
824
A. Torsello and E.R. Hancock
among the individual constituent trees. The consequence of this is that the L1 norm between embedded vectors provides a more accurate representation of the distribution of trees in shape-space. We exploit this property to locate clusters of trees. We experiment with our new method on sets of shock-trees. Here we perform multidimensional scaling on the distances between trees. We demonstrate that the clusters delivered by the L1 norms between trees in the vector-space are better than those which result if we use the raw edit distances between pairs of trees. The proposed method provides a route to graph clustering. This problem has been addressed previously through pairwise clustering [13], but is based on a coarse measure of similarity. The proposed method extracts a more detailed model of the structure present in the graphs. It is worth mentioning similar work by Shokoufandeh et al. where graphs are embedded using spectral encoding [16].
2
Shock Tree
The practical problem tackled in this paper is the clustering of 2D binary shapes based on the similarity of their shock-trees. The idea of characterizing boundary shape using the differential singularities of the reaction equation was first introduced into the computer vision literature by Kimia, Tannenbaum and Zucker [11]. The idea is to evolve the boundary of an object to a canonical skeletal form using the reaction-diffusion equation. The skeleton represents the singularities in the curve evolution, where inward moving boundaries collide. The reaction component of the boundary motion corresponds to morphological erosion of the boundary, while the diffusion component introduces curvature dependent boundary smoothing. Once the skeleton is to hand, the next step is to devise ways of using it to characterize the shape of the original boundary. Here we follow Zucker, Siddiqi, and others, by labeling points on the skeleton using so-called shock-classes [19]. According to this taxonomy of local differential structure, there are different classes associated with behavior of the radius of the maximal circle from the skeleton to the nearest pair of boundary points. The so-called shocks distinguish between the cases where the local maximal circle has maximum radius, minimum radius, constant radius or a radius which is strictly increasing or decreasing. We abstract the skeletons as trees in which the level in the tree is determined by their time of formation [16,19]. The later the time of formation, and hence their proximity to the center of the shape, the higher the shock in the hierarchy.
3
Error-Tolerant Matching of Trees and Edit-Distance
The problem of how to measure the similarity of pictorial information which has been abstracted using graph-structures has been the focus of sustained research activity for over twenty years in the computer vision literature. Moreover, the problem has recently acquired significant topicality with the need to develop
Matching and Embedding through Edit-Union of Trees
825
ways of retrieving images from large data-bases. Stated succinctly, the problem is one of inexact or error-tolerant graph-matching. Early work on the topic included Barrow and Burstall’s idea [1] of locating matches by searching for maximum common subgraphs using the association graph, and the extension of the concept of string edit distance to graph-matching by Fu and his co-workers [8]. The idea behind edit distance [23] is that it is possible to identify a set of basic edit operations on nodes and edges of a structure, and to associate with these operations a cost. The edit-distance is found by searching for the sequence of edit operations that will make the two graphs isomorphic with one-another and which has minimum cost. The set of edit operations can be problem specific, but a common choice is: – node removal: remove a node and link the children to its parent. – node insertion: the dual of node removal – node relabel: change the label associated to a node. By making the evaluation of structural modification explicit, edit distance provides a very effective way of measuring the similarity of relational structures. Moreover, the method has considerable potential for error tolerant object recognition and indexing problems. Unfortunately, the task of calculating edit distance is a computationally hard problem and most early efforts can be regarded as being goal-directed. Zhang and Shasha [27] have investigated a specialization of the tree edit distance problem, which involves adding the constraint that the solution must maintain the order of the children of a node. With this order among siblings, they showed that the tree-matching problem is in P and gave an algorithm to solve it. In subsequent work they showed that the more general unordered case was indeed an NP hard problem [28]. Among other approaches we mention the ones developed by Pelillo et al. [14], and by Bartoli et al. [2], in which the graph theoretic notion of a path string is used to transform the tree isomorphism problem into a single max clique problem. There is, however, a strong connection between the computation of maximum common subtree and the tree edit distance. In [5] Bunke showed that, under certain constraints applied to the edit-cost function, the maximum common subgraph problem and the graph edit distance problem are computationally equivalent. This is not directly true for trees, because of the added constraint that a tree must be connected. But, extending the concept to the common edited subtree, we can use common substructures to find the minimum cost edited tree isomorphism. Hierarchical graphs have an order relation induced by paths: given two nodes a and b, (a, b) is in this relation if and only if there is a path from a to b. When the directed graph is acyclical, this relation can be shown to be an order relation. The requirement that matches respect this relation and that the edited trees be connected, prevent us from applying Bunke’s result directly to tree matching and search for a common subgraph. The constraint to the edit-cost function proposed by Bunke in [5] is that the cost of deleting and reinserting the same element with a different label is
826
A. Torsello and E.R. Hancock
not greater than the cost of relabeling it. In this way we can find an optimal edit sequence without the need for a relabel operation. With this constraint on relabel cost, we are left with only node removal and node insertion operations to be performed on the data tree. Since a node insertion on the data tree is dual to a node removal on the model tree, we can further reduce the number of operations to be performed to only node removal, as long as we perform the operations on both trees. 3.1
Inexact Tree Matching as a Common Substructure Problem
Transforming node insertions in one tree into node removals in the other allows us to use only structure reducing operations. This, in turn, means that the optimal matching is completely determined by the subset of nodes left after the minimum removal sequence. Hence, we can pose the edit distance problem as a particular substructure isomorphism problem. Since node removal operations respect the order relation implicit in the hierarchy, we can reduce the substructure isomorphism problem into subproblems in a divide and conquer approach [22]. This approach derives from a number of observations. First, given two trees t1 and t2 , there are two subtrees t1 and t2 rooted at nodes v and v such that the matching induced by edit distance between those two nodes is equivalent to that obtained with the original trees. Hence the best match between t1 and t2 is equivalent to the best match given the association of root nodes (v, v ). Furthermore, this match can be found examining only descendents of v and v . If we can express the best match given the association of root nodes (v, v ) as a function of lower level matches of descendants of v and v , we can build the solution to our matching problem bottom up. In the following sections we will how we can find the best match using the lower level matches. 3.2
Editing the Transitive Closure of a Tree
In this section we show the relations between the graph theoretic concept of transitive closure of a directed acyclic graph and the edit distance. An immediate remark is that for each node removal operation Ev removing node v from the tree t, we can define the corresponding edit operation Ev on the closure Ct of the tree t. In both cases the edit operation removes the node v, all the incoming edges, and all the outgoing edges. It is important to note that the transitive closure operation and the node removal operation commute, that is we have: Lemma 1. Ev (C(t)) = C(Ev (t)) We call a subtree s of Ct obtainable if for each node v of s if there cannot be two children a and b so that (a, b) is in Ct. In other words, given two nodes a and b, siblings in s, s is obtainable if and only if there is no path from a to b in t. We can, now, prove the following: Theorem 1. A tree tˆ can be generated from a tree t with a sequence of node removal operations if and only if tˆ is an obtainable subtree of the directed acyclic graph Ct.
Matching and Embedding through Edit-Union of Trees
827
Using this result, we can show that the minimum cost edited tree isomorphism between two trees t and t is, structurally, a common consistent subtree of the two directed acyclic graphs Ct and Ct . The minimum cost edited tree isomorphism is a tree that can be obtained from both model tree t and data tree t with node removal operations. By virtue of the theorem above, this tree is an obtainable subtree of both Ct and Ct , furthermore, the tree must be generated with minimum combined edit cost. 3.3
Cliques and Common Obtainable Subtrees
In this section we show how to use these results to induce a divide and conquer approach to edited tree matching. Given two trees t and t to be matched, we calculate the transitive closures Ct and Ct and look for a common obtainable tree that induces the optimal matches. The maximum such tree corresponds to the maximum common obtainable subtree of Ct and Ct . For each pair of nodes v and w of the two trees to be matched, we divide the problem into a maximum common obtainable subtree rooted at v and w. That is, we fix the matching of v to w and we search for the maximum edited subtree common to the two subtrees rooted at v and w. We show that, given the cardinality of the subtree rooted at each child of v and w, we can transform the search for the maximum common substructure into the search for a max weighted clique. Solving this problem for each pair of nodes, and looking for the maximum among each node pair, we can find the isomorphism linked to the minimum edit distance. We commence by transforming the problem from the search of the minimum edit cost linked to the removal off some nodes, to the maximum of a utility function linked to the nodes that are retained. To do this we assume that we have a weight wi assigned to each node i, that the cost of matching a node i to a node j is |wi −wj |, and that the cost of removing a node is equivalent to matching it to a node with weight 0. We define the set M ⊂ N t × N t the set of pair of / M } composed nodes in t and t that match, the set Lt = {i ∈ N t |∀x, < i, x >∈ of nodes in the first tree that are not matched to any node in the second, and the set Rt = {j ∈ N t |∀x, < x, j >∈ / M }, which contains the unmatched nodes of the second set. With these definitions, we can write: d(t, t ) =
wi +
i∈Lt
wj +
j∈Rt
|wi − wj | =
∈M
=
j∈N t
i∈N t
We call the quantity U(M ) =
wi +
wj − 2
min(wi , wj ). (1)
∈M
min(wi , wj ).
∈M
the utility of the match M . Clearly the match that maximizes the utility minimizes the edit distance.
828
A. Torsello and E.R. Hancock
Let us assume that we know the utility of the best match rooted at every descendent of v and w. We aim to find the set of siblings with greatest total utility. To do this we make use of a derived structure similar to the association graph introduced by Barrow in [1]. The nodes of this structure are pairs drawn from the Cartesian product of the descendents of v and w and each pair correspond to a particular association between a node in one tree to a node in the other. We connect two such associations if and only if there is no inconsistency between the two associations, that is the corresponding subtree is obtainable. This means that we connect nodes (p, q) and (r, s) if and only if there is no path connecting p and r in t and there is no path connecting q and s in t . Furthermore, we assign to each association node (a, b) a weight equal to the utility of the best match rooted at a and b. The maximum weight clique of this graph is the set of consistent siblings with maximum total utility, hence the set of children of v and w that guarantee the optimal isomorphism. The utility of the maximum common consistent subtree rooted at v and w will be the weight of the isomorphisms rooted at the children of v and w plus the contribution of v and w, that is the weight of the clique plus the minimum of the weights of v and w. Given a method to obtain a solution for the maximum weight clique problem, we can use it to obtain the solution to our isomorphism problem. We refer to [3,22] for heuristics for the weighted clique problem.
4
Edit-Intersection and Edit-Union
In the previous sections we have seen how the edit distance between two trees is completely determined by the set of nodes that do not get removed by edit operations and, therefore, get matched. That is, in a sense, we are taking the intersection of the sets a a of nodes of the two structures. With this approach we match the trees by extracting a structure that h b b can be obtained from our original trees by removing some nodes. We have seen how the edit distance bef d e g c d tween two trees is related to this intersection structure (see Figure 1). We would like to extend the e f g concept to more than two trees so that we can coma pare a shape tree to a whole set T of trees. Moreover, this allows to determine how a new sample relates to b a previous distribution of tree structures. Formally, we would like to find the match that minimizes the e f g d sum of the edit distances between the new tree t∗ and each tree t ∈ T , with the added constraint that Fig. 1. Edit-Intersection of if node a in the new tree t∗ is matched to node b in two trees. a tree t1 ∈ T and to node c in another tree t2 ∈ T , then b must be matched to c, i.e. < a, b >∈ M ∧ < a, c >∈ M ⇒< b, c >∈ M,
Matching and Embedding through Edit-Union of Trees
829
were M is the “matches to” relation on nodes. To find this match we could find the maximum substructure that can be obtained from any tree in a set removing appropriate nodes, unfortunately not keeping unmatched nodes we are dropping too much information: the set of common nodes becomes marginal and we lose information about how the nodes distribute in the various structures. To use Bunke’s analogy [5], the maximum common substructure give us information about mean of the trees, but it completely drops any information about how sample trees distribute around this mean. To overcome this limitation we can calculate a union of the nodes: a structure from which we can obtain any tree in our set removing appropriate nodes, as opposed to the intersection of nodes, which is a structure that can obtained removing nodes from the original trees. Any such structure has the added advantage of implicitly creating an embedding space for our trees: we are guaranteed that any node in any tree matches to a node in this structure. Assigning to each node a coordinate in a vector space V , we can associate to tree t the vector v ∈ V so that vi = wi , where wi in the weight of the node of t associate with coordinate i. 4.1
Union of Two Trees
Once more, the edit-union of two trees is completely determined by the set of matched nodes. Start with the two trees and iteratively merge nodes that are matched. At the end you end up with a directed acyclical graph with multiple paths connecting various nodes (see Figure 2). This structure, thus, has more a
a
b
e
b
c
d
f
g
e
f
h g
a
a
b
b’
c
c
β d
α
α
β
a b
a a
h
α c e
f
d
β c
b’ β c
g
Intersection
Fig. 2. Edit-Union of two trees.
b α Union
Fig. 3. Edit-Union is not always a tree.
links than necessary and you cannot obtain the first tree by node removal operations alone: you need to remove the superfluous edges. If in the figure we eliminate the edges connecting node b to nodes e,f and g, we obtain a tree. Starting from this tree we can obtain either one of the original trees by node
830
A. Torsello and E.R. Hancock
removal operations alone. Furthermore, the transitive closure of this tree and the closure of the unaltered structure are identical. One would hope that such a structure always be a tree, so that we can use the matching technique already described to compare a tree to a group of trees. Unfortunately, it is not always possible to find a tree such that we can edit it to obtain the original trees: see, for example, Figure 3. In this figure α and β are subtrees. Because of the constraints posed by matching trees α and trees β, nodes b and b cannot be matched and neither b can be a child of b nor b a child of b. The only alternative is to keep the two alternative paths separate: this way we can obtain the first tree removing the node b and the second removing b. Actually, removing the nodes is not enough, shortcutting edges need to be removed, but, once again, the transitive closure of the union minus node b is identical to the closure of the first tree. 4.2
Matching a Tree to a Union
As seen, in general the union of two trees is a directed acyclical graph, and our approach can only match trees, and would fail on structures with multiple paths from one node a to node b, since it would count any match in the subtree rooted at b twice. Hence, given a joined representation we cannot directly use our approach to compare a tree to a tree set or a distribution of trees. Fortunately, we don’t need to perform a generic match between two generic directed acyclic graphs: in an edit-union we have multiple paths between node a and node b, but each tree can have only one, hence multiple paths are mutually exclusive. If we constrain our search to match nodes in only one path and we match the union to a tree, we are guaranteed not to count the same subtree multiple times. Interestingly, this constraint can be merged with the obtainability constraint: we say that a match is obtainable if for each node v there cannot be two children a and b and a node c so that there is a path, possibly of length 0, from a to c and one from b to c. This constrain reduces to obtainability for trees when c = b, but it also prevents a and b from belonging two to separate paths joining at c. Hence from a node where multiple paths fork, we can extract children matches from one path only. We want to find the match consistent with the obtainability constraint that minimizes the sum of the edit distance between the new tree and each tree in the set. To this effect we can maximize the sum of the utilities. min(wit , wjt ). U(M ) = t ∈T ∈M ∪
Where M ⊂ N t × N T is the set of matches between the nodes N t of the tree ∪ t and the nodes N T , T ∪ is the union structure of the set of trees T , and wit is the weight assigned in tree t to node i. When the edit distance is uniform (all weights are 1), the total utility of a single match is equal to the number of trees that have an instance of that node (see Figure 4). Solving the modified weighted clique problems we obtain the correspondence between the nodes in the
Matching and Embedding through Edit-Union of Trees
831
1 new tree and the nodes in each tree in the set, 1 while the edit distance obtained is the sum of the distances from the new tree to each tree 1 in T . 1 1 To be able to calculate this quantity we keep, for each node in the union structure, 1 the weights of the matched nodes. A way to do this is to assign to each node in the union 2 a vector of dimension equal to the number of trees in the set. The ith coordinate of this vector will be the weight of the corresponding 2 1 node in the ith tree, 0 if the ith tree doesn’t have a node matching to the current node. 1 This representation also allows us to easily obtain the coordinate of each tree in the set in the embedding space induced by the union: Intersection Union the ith weight of the jth node is the jth coordinate of the ith tree. Such embedding is Fig. 4. The weight of a node in defined modulo reordering of trees and nodes. the union account for every node It is worth noting that this approach can mapped to that node. be extended to match two union structures, as long as at most one has multiple paths to a node. To do this we iterate through each pair of weights drawn the two sets, that is we define the utility as: U(M ) = min(wit , wjt ), t∈T1 ,t ∈T2 ∈M ∪
∪
where M ⊂ N (T1 ) × N (T2 ) is the set of matches between the nodes of the union structures T1∪ and T2∪ . The requirement that no more than one union has multiple paths to a node is required to avoid double counting. 4.3
Joining Multiple Trees
In the previous paragraph we have seen that we can construct the edit union of a set of trees and how to compare a tree to this superstructure. We now want to show how to construct such structure. Finding the super-structure that minimizes the total distance between trees in the set is computationally infeasible, but we propose a suboptimal iterative approach that at each iteration extends the union adding a new tree to it. This is done by matching the tree to the union and then using the matched nodes to construct the union of the two structures. (t) That is, we pick a new tree t∗ and we match it against the union T ∪ , to obtain (t+1) . We proceed until we have joined every tree. the union T ∪ In order to increase the accuracy of the approximation, we want to merge trees with smaller distance first. This is because we can be reasonably confident that, if the distance is small, the extracted correspondences are correct. We
832
A. Torsello and E.R. Hancock
could start with the set of trees, merge them and replace them with the union and reiterate until we end up with only one structure, i.e. at each iteration pick two trees t1 and t2 so that the distance d(t1 , t2 ) is minimal, merge the two tree and reinsert the union structure t∪1,2 in our set of trees to be merged. Unfortunately, since we have no guarantees that the edit-union is a tree, we might end up trying to merge two graph with multiple paths to a node, and our matching algorithm cannot cope with that. For this reason, if merging two trees give a union that is not a tree, we discard the union and try with the next-best match. When no trees can be merged without duplicating paths, we merge the remaining structures using the first merging approach described. This way we are guaranteed to merge at each step at most one multi-path structure.
5
Experimental Results
We evaluate the new approach on the problem of shock tree matching. In order to asses the quality of the approach we compare the embedding obtained with those obtained using the graph clustering method described in [22]. In particular, we compare the the first two principal components of the embedding generated joining purely structural skeletal representations, with 2D multi-dimensional scaling of the pairwise distances of the shock-trees weighted with some geometrical information. The addition of matching consistency across shapes allows the embedding to better capture the structural information present in the shapes, yielding embeddings comparable, if not better, than those provided by localized geometrical information. We run three experiments with 4, 5, and 9 shapes each. In each experiment the shapes belong to two or more distinct visual clusters. In order to avoid scaling effect due to difference in the number of nodes, we normalize the embedding vectors so that they have L1 norm equal to 1, and then we extract the first 2 principal components.
Fig. 5. Top: embedding through union. Bottom: multi-dimensional scaling of pairwise distance
Matching and Embedding through Edit-Union of Trees
833
Fig. 6. Top: embedding through union. Bottom: multi-dimensional scaling of pairwise distance
Fig. 7. Top: embedding through union. Bottom: multi-dimensional scaling of pairwise distance
Figure 5 shows a clear example where the embedding obtained through editunion is better than that obtained through multi-dimensional scaling of the pairwise distances. In this case the pairwise distance algorithm consistently underestimates the distance between shapes belonging to different clusters. This is a general problem of pairwise matching: it works very well when the shapes are close, when the extracted correspondence is reliable, but as the shapes move further apart the advantage the correct correspondence has over alternative ones diminishes, until, eventually, another match is selected, which reports a higher distance. The result of this is a consistent underestimation of the distance the shapes move further apart in shape space. Figures 6 and 7 show examples where the distance in shape space is not big enough to observe the described behavior, yet the embedding obtained through union fares well against the pairwise edit-distance, especially taking into account the fact that it uses only structural information while the edit-distance matches shown weight the structure with geometrical information. In particular, Figure 7 shows a better ordering of shapes, with brushes being so tightly packed that they overlap in the image. It is interesting to note how the union embedding puts the monkey wrench (at the top) somewhere in-between pliers and wrenches: the algorithm is able to consistently match the head to the heads of the wrenches, and the handles to the handles of the pliers.
834
5.1
A. Torsello and E.R. Hancock
Synthetic Data
To augment these real world experiments, we have performed the embedding on synthetic data. The aim of the experiments is to characterize the ability of the approach to generate a shape space. To meet this goal we have randomly generated some prototype trees and, from each tree, we generated five or ten structurally perturbed copies. The procedure for generating the random trees was as follows: we commence with an empty tree (i.e. one with no nodes) and we iteratively add the required number of nodes. At each iteration nodes are added as children of one of the existing nodes. The parents are randomly selected with uniform probability from among the existing nodes. The weight of the newly added nodes are selected at random from an exponential distribution with mean 1. This procedure will tend to generate trees in which the branch ratio is highest closest to the root. This is quite realistic of real-world situations, since shock trees tend to have the same characteristic. To perturb the trees we simply add nodes with using the same approach. In our experiments the size of the prototype trees varied from 5 to 20 nodes. As we can see from Figure 8, the algorithm was able to clearly separate the clusters of trees generated by the same prototype. Figure 8 shows three experiments with synthetic data. The images at the top are produced embedding 5 structurally perturbed trees per prototype: trees 1 to 5 are perturbed copies of 10 6 7
7 10 89 6
21 5
2
3 4
3
5 4
1
6 34
5
8 7
11 19 1314 18 15 1220 16 17
Fig. 8. Synthetic clusters
219 10
8 9
Matching and Embedding through Edit-Union of Trees
835
the first prototype, 6 to 10 of the second. The bottom image shows the result of the experiment with 10 structurally perturbed trees per prototype: 1 to 10 belong to one cluster, 11 to 20 to the other. In each image the clusters are well separated.
6
Conclusions
In this paper we investigated a technique to extend the tree edit distance framework to allow the simultaneous matching of multiple tree structures. With this approach we can impose a consistency of node correspondences between matches, avoiding the underestimation of the distance typical of pairwise edit-distances approaches. Furthermore through this methods we can get a “natural”embedding space of tree structures that can be used to analyze how tree representations vary in our problem domain. In a set of experiments we apply this algorithm to match shock graphs, a graph representation of the morphological skeleton. The results of these experiments are very encouraging, showing that the algorithm is able to group similar shapes together in the generated embedding space.
References 1. H. G. Barrow and R. M. Burstall, Subgraph isomorphism, matching relational structures and maximal cliques, Inf. Proc. Letter, Vol. 4, pp.83, 84, 1976. 2. M. Bartoli et al., Attributed tree homomorphism using association graphs, In ICPR, 2000. 3. I. M. Bomze, M. Pelillo, and V. Stix, Approximating the maximum weight clique using replicator dynamics, IEEE Trans. on Neural Networks, Vol. 11, 2000. 4. H. Bunke and G. Allermann, Inexact graph matching for structural pattern recognition, Pattern Recognition Letters, Vol 1, pp. 245-253, 1983. 5. H. Bunke and A. Kandel, Mean and maximum common subgraph of two graphs, Pattern Recognition Letters, Vol. 21, pp. 163-168, 2000. 6. W. J. Christmas and J. Kittler, Structural matching in computer vision using probabilistic relaxation, PAMI, Vol. 17, pp. 749-764, 1995. 7. T. F. Cootes, C. J. Taylor, and D. H. Cooper, Active shape models - their training and application, CVIU, Vol. 61, pp. 38-59, 1995. 8. M. A. Eshera and K-S Fu, An image understanding system using attributed symbolic representation and inexact graph-matching, PAMI, Vol 8, pp. 604-618, 1986. 9. L. E. Gibbons et al., Continuous characterizations of the maximum clique problem, Math. Oper. Res., Vol. 22, pp. 754-768, 1997 10. T. Heap and D. Hogg, Wormholes in shape space: tracking through discontinuous changes in shape, ICCV, pp. 344-349, 1998. 11. B. B. Kimia, A. R. Tannenbaum, and S. W. Zucker, Shapes, shocks, and deformations I, International Journal of Computer Vision, Vol. 15, pp. 189-224, 1995. 12. T. Sebastian, P. Klein, and B. Kimia, Recognition of shapes by editing shock graphs, in ICCV, Vol. I, pp. 755-762, 2001. 13. B. Luo, et al., A probabilistic framework for graph clustering, in CVPR, Vol. I, pp. 912-919, 2001.
836
A. Torsello and E.R. Hancock
14. M. Pelillo, K. Siddiqi, and S. W. Zucker, Matching hierarchical structures using association graphs, PAMI, Vol. 21, pp. 1105-1120, 1999. 15. S. Sclaroff and A. P. Pentland, Modal matching for correspondence and recognition, PAMI, Vol. 17, pp. 545-661, 1995. 16. A. Shokoufandeh, S. J. Dickinson, K. Siddiqi, and S. W. Zucker, Indexing using a spectral encoding of topological structure, in CVPR, 1999. 17. K. Siddiqi and B. B. Kimia, A shock grammar for recognition, in CVPR, 507-513, 1996. 18. K. Siddiqi, S. Bouix, A. Tannenbaum, and S. W. Zucker, The hamilton-jacobi skeleton, in ICCV, pp. 828-834, 1999. 19. K. Siddiqi et al., Shock graphs and shape matching, Int. J. of Comp. Vision, Vol. 35, pp. 13-32, 1999. 20. K-C Tai, The tree-to-tree correction problem, J. of the ACM, Vol. 26, pp. 422-433, 1979. 21. A. Torsello and E. R. Hancock, A skeletal measure of 2D shape similarity, Int. Workshop on Visual Form, LNCS 2059, 2001. 22. A. Torsello and E. R. Hancock, Efficiently computing weighted tree edit distance using relaxation labeling, in EMMCVPR, LNCS 2134, pp. 438-453, 2001 23. W. H. Tsai and K. S. Fu, Error-correcting isomorphism of attributed relational graphs for pattern analysis, Sys., Man, and Cyber., Vol. 9, pp. 757-768, 1979. 24. J. T. L. Wang, K. Zhang, and G. Chirn, The approximate graph matching problem, in ICPR, pp. 284-288, 1994. 25. R. C. Wilson and E. R. Hancock, Structural matching by discrete relaxation, PAMI, 1997. 26. K. Zhang, A constrained edit distance between unordered labeled trees, Algorithmica, Vol. 15, pp. 205-222, 1996. 27. K. Zhang and D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. of Comp., Vol. 18, pp. 1245-1262, 1989. 28. K. Zhang, R. Statman, and D. Shasha, On the editing distance between unorderes labeled trees, Inf. Proc. Letters, Vol. 42, pp. 133-139, 1992.
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms Thomas M. Breuel Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 [email protected]
Abstract. Over the last decade, a number of methods for geometric matching based on a branch-and-bound approach have been proposed. Such algorithms work by recursively subdividing transformation space and bounding the quality of match over each subdivision. No direct comparison of the major implementation strategies has been made so far, so it has been unclear what the relative performance of the different approaches is. This paper examines experimentally the relative performance of different implementation choices in the implementation of branch-and-bound algorithms for geometric matching: alternatives for the computation of upper bounds across a collection of features, and alternatives the order in which search nodes are expanded. Two major approaches to computing the bounds have been proposed: the matchlist based approach, and approaches based on point location data structures. A second issue that is addressed in the paper is the question of search strategy; branch-and-bound algorithms traditionally use a “best-first” search strategy, but a “depth-first” strategy is a plausible alternative. These alternative implementations are compared on an easily reproducible and commonly used class of test problems, a statistical model of feature distributions and matching within the COIL-20 image database. The experimental results show that matchlist based approaches outperform point location based approaches on common tasks. The paper also shows that a depth-first approach to matching results in a 50-200 fold reduction in memory usage with only a small increase in running time. Since matchlist-based approaches are significantly easier to implement and can easily cope with a much wider variety of feature types and error bounds that point location based approaches, they should probably the primary implementation strategy for branch-and-bound based methods for geometric matching.
1
Introduction
Matching of points or other geometric primitives under geometric transformations is an important problem in many applications, including computer vision and robotics. Early work on matching has included the Generalized Hough Transform[2], heuristic search[8], and a variety of other techniques. Over the last decade, branch-and-bound techniques have become increasingly popular for geometric matching problems Such techniques work by recursively subdividing the space of parameters of the geometric A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 837–850, 2002. c Springer-Verlag Berlin Heidelberg 2002
838
T.M. Breuel
transformations under consideration and computing upper bounds on the possible quality of match for any match possible within those subregions of parameter space[4]. Branch and bound techniques are attractive because they can guarantee optimality of the returned solutions to within specified numerical tolerances and because they appear to have complexities and running times similar to those of alignment methods[6]. Two major approaches to branch-and-bound techniques for geometric matching have emerged over the last decade. The first is an approach based on matchlists and (optionally) a depth-first search strategy[4]. The second is based on data structures for fast proximity searches or discrete Voronoi diagrams [14,12,9] and has its origins in earlier hierarchical matching schemes organized around multi-resolution image analysis [3,10]. Matchlist-based approaches to geometric matching are very easy to implement: all a user needs to supply is a quality-of-match function for a pair of features (and an upper bound function derived from the quality of match function). An efficient, selfcontained matchlist-based branch-and-bound implementation takes a few hundred lines of C++ or Fortran code with no reference to complex data structures. This makes it easy to apply matchlist-based branch-and-bound techniques to complex matching tasks like the matching of line segments[6], text line models[7], and MDL descriptions[5]. In contrast, in approaches that rely on geometric data structures for fast proximity searches, using features like points with associated orientations, extended line segments, or groups of geometric features becomes increasingly difficult because the corresponding data structures for proximity searches can be difficult to implement. However, it is not a priori obvious that matchlist-based approaches are particularly efficient. We might guess that the use of a point location or range query data structure could result in significant speedups, as some researchers have argued[14]. To date, no direct comparison of the two approaches has been carried out. This paper presents experimental comparisons of the two approaches, as well as a more in-depth analysis of the statistical properties of matchlists. It also presents a direct comparison of depth-first and best-first matching strategies.
2
Branch and Bound Algorithms for Geometric Matching
2.1 An Instance of the Geometric Matching Problem To simplify the presentation for the purpose of this paper, we will be concrete and examine the geometric matching problem among point sets in two dimensions under translations and rotations; more general formulations can be found in the references[6]. Define a model to be a set of points M = {m1 , . . . , mr } ⊆ R2 and an image to be another set of points B = {b1 , . . . , bs } ⊆ R2 . We consider a bounded set of isometric transformations Tall ; that is, the set of transformations T : R2 → R2 consisting of translations and rotations and parameterize these transformations by a vector (∆x, ∆y, α)T ∈ Tall , where Tall = [∆min , ∆max ] × [∆min , ∆max ] × [0, 2π) ⊆ R3 is the space of all transformations. We also assume a bounded error notion of quality of match Q(M, B, ; T ) under the Euclidean distance; where it is clear from context, we will omit the dependence of Q on M , B, and/or . That is, the quality of match assigned to a transformation T is the number of model points the transformation brings within an error bound of of some image
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
839
point. If we write predicate for the standard indicator function, which assumes the value 1 if the predicate is true, 0 otherwise, we can write this quality of match function as Q(T ) = max ||T (m) − b|| < . (1) m∈M
b∈B
The task of a geometric matching algorithm is to find a transformation Tmax (usually not unique) that maximizes the quality of match for a given M , B, and : Tmax (M, B, ) = arg max Q(T ; M, B, ) T ∈T all
(2)
This formulation can be easily generalized to other transformation spaces and other quality of match measures, most importantly, maximum likelihood matches under a Gaussian error model; for details, see [6]. 2.2
Branch and Bound Algorithms
Let us now turn to a description of a branch and bound algorithm for geometric matching. As input, the algorithm is given a set of model points M = {m1 , . . . , mr } ⊆ R2 and a set of image points B = {b1 , . . . , bs } ⊆ R2 . In terms of these points, we define a quality of ˆ match function Q(T ) as in Equation 1, as well as an upper bound Q(T) ≥ maxT ∈T Q(T ) (the details of how this upper bound is computed will be discussed below). For the purposes of this paper, we consider the space of transformations of R2 consisting of translations and rotations. Here, x0 , x1 , y0 , and y1 define the minimal and maximal translations the search algorithm is going to consider. Using these preliminaries, can organize the search for a globally optimal solution to Equation 2 as follows: Algorithm 1 1. The algorithm maintains a priority queue of search states. When two search states have the same priority, the state with the lower depth in the search tree is preferred. 2. Initially, we add a node to the priority queue representing the space of all transformations, Tall . 3. Each search state is associated with a subregion of transformation space Tk ⊆ Tall . ˆ k = Q(T ˆ k ) ≥ Q(T ); the It is further associated with an upper bound ∀T ∈ Tk : Q upper bound serves as the priority of the state. 4. The algorithm removes the state with the highest upper bound from the priority queue. ˆ k ), terminate the search and 5. Pick some transformation T ∈ Tk ; if Q(T ) = Q(T return T as a solution. 6. Otherwise, we split the region Tk into two disjoint subregions T2k and T2k+1 along its largest dimension and insert these subregions back into the priority queue This type of algorithm is known as a branch and bound algorithm in the global optimization and operations research literature. The particular variant described in Algorithm 1 is a best-first search. In order to obtain a depth-first search, we make a few modifications:
840
T.M. Breuel
Fig. 1. A random instance of geometric matching problems used for various empirical performance measurements throughout the paper. The model is described in Section 3.1.
ˆ ) as the priority in the priority queue, we use the depth of the – Instead of using Q(T search, k, as the priority. – Instead of terminating in Step 5, we record the new solution if it is better than the previously best solution but do not terminate. – In Step 6 we do not enqueue subregions if their upper bound is already worse the the best solution found so far. – The algorithm terminates when the search queue is empty. A depth-first search can also be implemented without use of a priority queue[4], but the priority-queue based algorithm is presented here because it allows us to use essentially the same implementation for comparing the performance of depth-first and best-first strategies. 2.3
Geometric Computation of Upper Bounds
In the discussion so far, there has been no mention of how to compute the upper bounds ˆ Q(T). Under a bounded error model, computation of this upper bound amounts to computing, for each model feature, whether some image feature might fall within the given error bound of a transformed model feature. In different words, we bound each term in Equation 1. The algorithm described in [4] takes advantage of special properties of linear error bounds under translation, rotation and scale [1] or affine transformations that allow this test to be carried out easily and efficiently using a small number of dot products, resulting in a tight upper bound. For other kinds of transformations and error measures, a tight upper bound is difficult to compute, but we do not actually require a tight upper bound for the algorithm to converge. Therefore, branch-and-bound algorithms can be formulated using more general upper bounds (e.g., [9]) that are easier to derive. Briefly, we start by determining a bound on the location of each model point mj under any transformation T ∈ T. This bound can be expressed as any convenient geometric
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
841
region, for example a bounding rectangle or a bounding circle in the image plane. For the purpose of exposition, let us assume that we express this bound as a circular error bound of diameter δ around some point Tc (m) for some Tc ∈ T (preferably “central” in that region). Tc and δ can, in general, be dependent on m. That is, we choose Tc and δ such that ∀T ∈ T : ||Tc (m) − T (m)|| ≤ δ. Then it is easy to see (using the triangle inequality) that ∀T ∈ T, b ∈ R2 : ||T (m) − b|| ≤ ||Tc (m) − b|| + δ
(3)
Finally, we can use this to bound Q(T). Let T be any transformation in some subregion T of transformation space. Then, by bounding the terms of the sum individually using Equation 3: Q(T) = max max ||T (m) − b|| < (4) T ∈T
≤
m∈M
m∈M
b∈B
max ||Tc (m) − b|| < + δ b∈B
(5)
ˆ Given this inequality, we can compute Q(T) fairly simply by computing Tc and δ and evaluating the summation and maximization over all pairs of model and image features. In order to apply the algorithms presented in this paper, we need to obtain an upper ˆ bound Q(T) ≥ maxT ∈T Q(T ). We have generally expressed the connection between the two by deriving, given a set of transformations T, a representative transformation Tc and ˆ an error bound δ such that we can choose Q(T; ) = Q(Tc ; + δ). Furthermore, the only sets of transformations T encountered by the algorithm are axis aligned hyperrectangles in transformation space Tall . A different way of looking at this is in terms of the region that a model point m traces out under transformations in the set T, the swept area [9]. Formally, this set is written as as {T m : T ∈ T} ∈ R2 . We are bounding this set by the circle centered at Tc with radius + δ . 2.4
Efficient Evaluation of Upper Bounds
We could, of course, evaluate the upper bound in Equation 4 directly. This requires |B| |M | evaluations at each node in the search tree, a fairly expensive proposition. A simple optimization is to speed up the search for matching image points, that is, points b that fall within a distance of + δ by using a point location data structure [12]. Then, the evaluation of Equation 4 can be carried out in |M | steps. A particularly simple point location data structure is the distance transform, which has also been applied in branch-and-bound type geometric matching algorithms [3,10]. An alternative approach is based on matchlists [4]. In a matchlist-based approach, we associate with each state in the priority queue in Algorithm 1, not just a region in transformation space and a bound, but also a list of all the correspondences between model points and image features that are consistent with matching under the error bounds and the transformations represented by that region. As we subdivide regions in transformation space, we only need to examine correspondences that are consistent with the parent region.
842
T.M. Breuel
A matchlist approach is simpler to implement because it does not require the introduction of a separate point location data structure. For example, a complete implementation of the branch and bound algorithm used in this paper is about 900 lines of C++ code, including data structures; the ANN library[11] is about 6900 lines of code. In fact, the difficulty of implementing point-location based approaches for features or error bounds that are more complex than simple point features has driven the choice of benchmark problems used in this paper: by using simple point features in the benchmarks, we can use standard, highly-optimized implementations of point-location data structures like the ANN library. Using other feature types would have forced the use of heuristics or much more complex data structures in the point location-based implementations, putting point location based approaches at a disadvantage and complicating the interpretation of the results. Higher-dimensional transformation spaces (e.g., including scale changes or affine transformations) were not used in these experiments because matching unoriented unlabeled point features optimally in higher dimensional transformation spaces is costly using any known approach and because such transformation spaces often require more complex error models.
3
Experimental Results
To compare the performance of matchlist and point-location based approaches, this paper reports two set of experiments: a direct comparison of matchlist and point-location methods, and results from a more detailed instrumentation of a matchlist-based matching algorithm. 3.1
Datasets
In order to present actual results on running times and performance, we need to pick a space of geometric matching problems over which to measure and compare the performance of different matching algorithms. For this paper, we will be using two sets of data. The first consists of randomly constructed matching problems in which images and models consist of uniformly distributed points in subsets of R2 . This model has the advantage of being easy to reproduce and easy to understand; it also has been used previously in the literature [12]. The second dataset consists of images from the standard COIL-20[13] database. For the simulated datasets, each random problem instance is generated as follows. Each model consists of 20 points uniformly sampled from the rectangle [−100, 100] × [−100, 100]. Each image consists of 10 points randomly selected from the model, rotated by a random angle in the interval [0, 2π) and translated by a random translation drawn from [0, 512]×[0, 512]. Each such model point is additionally perturbed by an error vector randomly and uniformly selected from the set {v ∈ R2 : ||v|| < 5}; this represents a 5% error on the location of model features in the image. Additionally, each image contains a random number (between 10 and 160) of background points, uniformly selected from the rectangle [0, 512] × [0, 512]. The image points (both model and background points) are randomly permuted before being used as input to the matching algorithms.
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
(a)
(b)
843
(c)
Fig. 2. Sample models and images from the COIL-20 database used in the performance measurements. Shown are (a) four models matched against the images, (b) a sample image, (c) edges extracted from the sample image and used by the matching algorithm.
In the results reported below, measurements are averages over 200 problem instances. Simulations were carried out on a 400MHz Pentium laptop. There are a wide variety of possible choices for parameters in these kinds of simulations. The choices made above appear to represent the qualitative behavior of the algorithms reasonably well within a statistical model that is easy to explain and reproduce. No claims are made that this simple uniform model represents accurately the statistics of all real geometric matching problems, but some geometric matching problems appear to have distributions that are fairly uniform For experiments with the COIL-20 database, edges were extracted using a Canny edge detector, the resulting edges were polygonized, and samples uniformly at 4 pixels. This gives rise to images consisting of between 82 and 283 edge samples. No gradient information was used in the matches (see the Discussion below for the rationale). As models, four object parts were selected manually and subjected to the same feature extraction process. Examples of these images are shown in Figure 2. For the benchmarks, each of the four parts was matched against the same set of 50 randomly selected images, resulting in a total of 200 runs of the algorithms. 3.2
Direct Comparison of Point Location and Matchlists Methods
To determine the relative performance of a point location approach and a matchlist approach, a branch and bound algorithm was implemented with two alternative upper bound functions. The first upper bound computation uses the point location approach described by Mount et al.[12]. The point location data structure (ANN) used in that implementation is publically available [11] and was used as the point location data structure for point-location-based upper bound computations in the experiments. As an additional optimization, the point location-based implementation took advantage of matchlists by eliminating non-matching model features for further consideration in child nodes in the search tree. It was verified that this resulted in some speed up compared to a naive implementation that requires time proportional to |B|.
844
T.M. Breuel
Fig. 3. Relative performance of the matchlist method compared to the point location method. Both axes show running times in seconds. Note the different scales of the axes. The slope of the curve is 3.7, meaning that the matchlist method runs on average 3.7 times faster than the point location method.
On the randomly generated problem instances, using the matchlist based approach resulted in a speedup of a factor of 2.43 and 3.75 (for 120 and 10 points of clutter, respectively) compared to the point location approach using the ANN library. These results are reproduced by the experiments on the COIL-20 data set. Figure 3 shows a scatter plot of running times of the matchlist-based method compared to the point location-based method (note the different scales of the axes). Each datapoint represents a run of each algorithm on the same pair of model and image. The slope of the line is 3.7, meaning that the matchlist based approach runs on average 3.7 times faster than the point location method on this dataset. 3.3
Statistical Properties of Matchlists
These results are perhaps surprising at first glance. In fact, the initial matchlist in a matchlist based approach is of size |M | |B|, and the initial few subdivisions do not reduce the matchlist at all. Only once Tk becomes small enough so that the uncertainty +δ of the projected model features will become sufficiently small so as not to include all image features, will the match list be reduced below that size. The efficiency of a matchlist ˆ based approach to evaluating Q(T) then depends on the size of the average match list that occurs during the search. If most search states are expanded with small matchlists, a matchlist based approach is efficient. If most search states were expanded with large matchlists, the matchlist based approach would be hopelessly inefficient compared to
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
845
5
Average Size of Matches per Model Point
3 2 0
1
average # matches per model point
4
best first depth first
0
50
100
150
200
# clutter points
Fig. 4. Size of the average matchlist. The matching problem used is as given in Section 3.1. As in the other results, the x-axis gives the amount of clutter; the y-axis shows the number of correspondences on the average matchlist.
approaches based on point location. To address this question, the branch and bound matching algorithm was instrumented to keep statistics about the size of matchlists at various depths in the search tree. The question of the size of matchlists can then be easily answered by collecting statistics from actual runs of the algorithm. For the problem defined in Section 3.1, these results are shown in Figure 4. As we can see, the average size of the matchlist ¯l is a little over two, meaning that there are, on average, a little over two image features on the matchlist for each model feature. This makes it very difficult to outperform the matchlist based approach using a point location data structure, since the lookup performed by the point location data structure would have to take less time than the time for two computations of ||m − b|| (plus a small amount of bookeeping overhead). In fact, this result was already suggested early on by execution profiling of implementations in [4], which showed that the system spent roughly equal amounts of time in the computation of Tc (m) and ||m − b||, making further reductions in the time spent on distance computations of limited benefit. It is instructive to look at these statistics in a little more detail. Figure 5 contains two types of curves. The solid curve indicates what fraction of the total number of search states that are expanded by the algorithm are expanded below a certain depth. We see that only a small fraction of the states is expanded below depth 12. The broken line indicates the average size of the matchlist for search states at a certain depth. It shows that the average size of the matchlist becomes small below the depth at which most states occur during the search.
846
T.M. Breuel
40 30
0.6
20
0.4
0
0.0
10
0.2
cumulative fraction avg. image features per model feature
avg. image features per model feature
cumulative fraction of nodes
0.8
50
60
1.0
Cumulative Fraction of Nodes Expanded, by Depth
0
10
20
30
40
50
60
70
depth
Fig. 5. Fraction of nodes expanded at different depths, and average number of image features matching each model feature at different depths. These results are for the problem described in Section 3.1; the different curves are at a clutter of 10, 20, 40, 80, 120, and 160. The solid curve indicates what fraction of the total number of search states that are expanded by the algorithm are expanded below a certain depth. The broken line indicates the average size of the matchlist for search states at a certain depth.
There is a simple heuristic explanation for this behavior: the algorithm must keep subdividing regions in transformation space recursively until it can start pruning the search tree, leading to an initial exponential growth in the number of nodes, However, pruning is possible only once the matchlist has shrunk significantly. Therefore, exponential growth of the number of nodes will tend to occur somewhat beyond the point where the matchlist has shrunk, resulting in the observed complexity. The results in Figure 6 suggest a nearly perfectly exponential growth up to a certain depth, followed by the onset of pruning.
3.4
Best First vs. Depth First
So far, we have remained within a traditional branch-and-bound framework, by expanding the most promising search node, no matter at what depth in the search tree it may occur. We might call this a “best first” search algorithm. An alternative approach is to use a “depth first” approach. A depth-first search can be implemented by using search ˆ as the priority in the priority queue, as described in depth instead of the upper bound Q Section 2.2.
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
847
1e+06 1e+04 1e+02
nodes expanded (log scale)
1e+08
Nodes Expanded by Depth
0
10
20
30
40
50
60
70
depth
Fig. 6. Number of nodes expanded at different depths (logarithmic scale). Notice the initial exponential growth. These results are for the problem described in Section 3.1; the different curves are at a clutter of 10, 20, 40, 80, 120, and 160.
With this modification, we are no longer guaranteed anymore that the first solution that satisfies our acceptance criteria in Step 5 is, in fact optimal. The algorithm therefore needs to continue searching until the complete search tree has been examined. At first sight, this may seem to result in a much larger number of node expansions and therefore might be considerably less efficient. What makes this algorithm practical is the observation that when the algorithm has found a candidate solution in Step 5, we need not expand further any states that we encounter whose upper bound estimate is less than, or equal to, the quality of the best solution we have found so far. In practice, this means that the additional cost in terms of runtime of a depth-first search strategy is often modest. Experimental results demonstrating this point are shown in Figure 7. Furthermore, experimentally, the relative overhead becomes smaller the larger the geometric matching problem becomes, meaning that larger overhead is paid on small problems which are already solved quickly. Why might we want to adopt a depth-first search strategy? Because the amount of memory it requires is very much smaller than that of a best first strategy. Over a common range of parameters explored in our experiments, the depth first search approach requires between 1/50 and 1/200 the amount of memory of the best first strategy (Figure 8). On systems with limited amounts of memory, a depth-first approach may be the only feasible approach for implementing branch-and-bound algorithms. Even on general purpose workstations with large amounts of memory, the moderate increase in running times associated with a depth first search may be justifiable in return for greatly reduced memory requirements when implementing these algorithms on embedded systems. A
848
T.M. Breuel
3.0
40
Running Times
2.0
0
1.0
10
1.5
20
ratio
times
30
2.5
depth first breadth first ratio
0
50
100
150
200
nclutter
Fig. 7. Running times of the depth first and best first search strategies compared. The problem is as described in Section 3.1, with the number of clutter points given by the x-axis. The left y-axis shows the running time in seconds, the right y axis shows the ratio of the depth-first running times to the best-first running times.
depth-first approach also makes it easier to tolerate the larger memory footprint associated with splitting regions T into 2n subregions, rather than using binary splits in Step 6 in Algorithm 1, resulting in performance that can equal or surpass that of a best-first search with binary splits (data not shown).
4
Discussion
This paper has compared alternative implementations of two fundamental aspects of branch-and-bound algorithms for geometric matching. The first is the question of how to compute upper bounds on the quality of match function Q(T ) over a given set of transformations. The second is the question of which order to expand search nodes in. The experimental results presented in this paper suggest that on common and nontrivial problems, namely uniformly distributed model and image features under bounded error, a matchlist-based approach outperforms a point-location based approach. It was also demonstrated that tese results also carry over to matching experiments with real image data extracted from the COIL-20 database. Of course, results from empirical performance measurements on two benchmark problems does not guarantee that that the advantage of matchlist based implementations holds over all possible problem domains. However, the additional results presented in Section 3.3 show a simple way of determining whether a matchlist-based approach is efficient: by measuring the average number of image features matching each model
A Comparison of Search Strategies for Geometric Branch and Bound Algorithms
849
200
6000
Memory Requirements
4000 1000
50
2000
100
3000
3894430
ratio
#correspondences (x1000)
150
5000
breadth first depth first ratio
0
50
100
150
0
0
21560 200
nclutter
Fig. 8. Comparison of the memory requirements of the depth first and best first search strategies. The problem is as described in Section 3.1, with the number of clutter points given by the x-axis. The left y-axis shows the number of correspondences (in thousands) that need to be retained in the search queue (proportional to the memory requirements) seconds, the right y axis shows the ratio of the best-first to the depth-first memory requirements.
feature. This information can be obtained by a simple and efficient instrumentation of the matchlist-based matching code, or even from a standard execution profile. This allows us to use a simple matchlist-based approach first. Only when an execution profile suggests that a matchlist-based algorithm spends a large amount of time in, or a large number of invocations of, feature distance computations compared to geometric transformations of model features, would we need to consider other implementation strategies. The empirical fact that matchlist-based approaches are often more efficient than point-location based approaches is good news because point location based methods are more difficult to implement even in the simplest cases (they require an extra data structure), and become increasingly complex for the kinds of feature types and error bounds encountered in real-world matching problems. For example, implementing point location datastructures for features associated with gradient information, extended features like line segments, or error bounds that depend on, and vary with, the transformation (e.g., error bounds that are large perpendicular to the gradient and small parallel to the gradient) requires work that goes significantly beyond traditional k-D tree or trie methods. In contrast, these real-world complexities are easily handled in a matchlist based approach because, using matchlists, an implementor only needs to supply a qualityof-match function comparing an individual model feature that has already undergone transformation with an image feature (and possibly an upper bound function). Another way of looking at matchlist based approaches is that they construct, implicitly during the
850
T.M. Breuel
branch-and-bound search itself, a proximity search data structure that is well-adapted to the needs of the matching problem. The second set of experimental results deals with the tradeoff between depth-first and best-first (i.e., a priority-queue based approach) search strategies. It is not surprising that a depth-first approach uses much less space than a best-first search, but it has not been clear until now whether the potentially larger number of nodes that needs to be explored in a depth-first approach would put such an approach at a severe disadvantage. The results presented in this paper suggest that the overhead of a depth-first approach is modest and the reduction in memory usage is large. A depth-first approach therefore warrants serious consideration in the implementation of branch-and-bound methods for geometric matching. Of course, it is impossible to test and compare all the different conditions under which a geometric matching algorithm might be used. The author hopes that other practitioners will see this paper as an incentive for comparing different approaches to the implementation of branch-and-bound algorithms for geometric matching problems.
References 1. Henry S. Baird. Model-Based Image Matching Using Location. MIT Press, Cambridge, MA, 1985. 2. D. H. Ballard. Generalized hough transform to detect arbitrary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(2):111–122, 1981. 3. G Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm. PAMI, 10:849–865, 1988. 4. T. M. Breuel. Fast Recognition using Adaptive Subdivisions of Transformation Space. In IEEE Conference on Computer Vision and Pattern Recognition, pages 445–451, 1992. 5. T. M. Breuel. Higher-Order Statistics in Object Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 707–708, 1993. 6. T. M. Breuel. Branch-and-bound algorithms for geometric matching problems. Submitted to Computer Vision and Image Understanding, 2001. 7. T. M. Breuel. Robust least square baseline finding using a branch and bound algorithm. In Document Recognition and Retrieval VIII, SPIE, San Jose, 2001. 8. E. Grimson. Object Recognition by Computer. MIT Press, Cambridge, MA, 1990. 9. M. Hagedoorn and R. C. Veltkamp. Reliable and efficient pattern matching using an affine invariant metric. Technical Report RUU-CS-97-33, Dept. of Computing Science, Utrecht University, 1997. 10. D.P. Huttenlocher, G.A. Klanderman, and W.J. Rucklidge. Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):850– 63, 1993. 11. D. M. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching. In CGC 2nd Annual Fall Workshop on Computational Geometry, 1997. 12. D.M. Mount, N.S. Netanyahu, and J. Le Moigne. Efficient algorithms for robust feature matching. Pattern Recognition, 32(1):17–38, 1999. 13. S. Nene, S. Nayar, and H. Murase. Columbia object image library: Coil. Technical Report CUCS-006-96, Department of Computer Science, Columbia University, 1996. 14. Clark F. Olson. Locating geometric primitives by pruning the parameter space. Pattern Recognition, 34(6):1247–1256, June 2001.
Face Recognition from Long-Term Observations Gregory Shakhnarovich, John W. Fisher, and Trevor Darrell Artificial Intelligence Laboratory Massachusetts Institute of Technology {gregory,fisher,trevor}@ai.mit.edu
Abstract. We address the problem of face recognition from a large set of images obtained over time - a task arising in many surveillance and authentication applications. A set or a sequence of images provides information about the variability in the appearance of the face which can be used for more robust recognition. We discuss different approaches to the use of this information, and show that when cast as a statistical hypothesis testing problem, the classification task leads naturally to an information-theoretic algorithm that classifies sets of images using the relative entropy (Kullback-Leibler divergence) between the estimated density of the input set and that of stored collections of images for each class. We demonstrate the performance of the proposed algorithm on two medium-sized data sets of approximately frontal face images, and describe an application of the method as part of a view-independent recognition system.
1
Introduction
Recognition in the context of visual surveillance applications is a topic of growing interest in computer vision. Face recognition has generally been posed as the problem of recognizing an individual from a single “mug shot”, and many successful systems have been developed (e.g. Visionics, Bochum/USC). Separately, systems for tracking people in unconstrained environments have become increasingly robust, and are able to track individuals for minutes or hours [8, 5]. These systems typically can provide images of users at low or medium resolution, possibly from multiple viewpoints, over long periods of time. For optimal recognition performance, the information from all images of a user should be included in the recognition process. In such long-term recognition problems, sets of observations must be compared to a list of known models. Typically the models themselves are obtained from sets of images, and this process can be thought of as a set matching problem. There are many possible schemes for integrating information from multiple observations, using either “early” or “late” integration approaches. Early integration schemes might consist of selecting the “best” observation from each set using some quality metric, or simply averaging together all observations prior to classification. In the case of late integration, the common statistical approach is to take the product of the likelihoods of each observation. This approach, while A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 851−865, 2002. Springer-Verlag Berlin Heidelberg 2002
852
G. Shakhnarovich, J.W. Fisher, and T. Darrell
simple and well justified under certain conditions, often fails to account for the presence of outliers in the observation data as well as the fact that the data are expected to be observed with a certain variability (e.g., repeated observations of the most likely face of an individual does not yield a typical set of faces for that individual, if we presume human faces occur with some natural variability.) We propose an approach to face recognition using sets of images, in which we directly compare models of probability distributions of observed and model faces. Variability in the observed data is considered when comparing distribution models, and thus outliers and natural variation can be handled properly. Many approaches to distribution modeling and matching are possible within this general framework. In this paper we develop a method based on a Gaussian model of appearance distribution and a matching criterion using the KullbackLeibler divergence measure. In our method, observations and models can be compared efficiently using a closed-form expression. In addition to being potentially more accurate, our approach can be more efficient than approaches that require comparing each observation with each model example image since our model distributions are compactly parameterized. We have evaluated our approach on two different data sets of frontal face images. The first was a collection of people observed at a distance in frontal view in an indoor office environment. The second was computed using a recently proposed scheme for integration across multiple viewpoints [15], in which images from several cameras with arbitrary views of the user’s face were combined to generate a virtual frontal view. With both data sets, our distribution matching technique offered equal or improved recognition performance compared to traditional approaches, and showed robustness with respect to the choice of model parameters. Throughout the paper we will assume independence between any two samples in a set of images, thus ignoring, for example, the dynamics of facial expression. While disregarding a potentially important cue , this will also remove the assumption of consecutive frames, making it easier to recognize a person from sparse observations such as are available, for instance, in surveillance systems, where the subject does not face the camera all the time. Furthermore, the training set may be derived from sparse or unordered observations rather than a sequence. The remainder of the paper is organized as follows. We start with a discussion of previous work on this problem and of known methods appropriate for classifying sets of images. In Sections 3 and 4 we present the statistical analysis leading to our distribution matching algorithm, described in detail in Section 5. Section 6 contains a report on the empirical comparative study of the discussed methods, and is followed by conclusions and discussion in Section 7.
2
Previous Work
The area of recognition from a single face image is very well established. Both local, feature-based and global, template-based methods have been extensively
Face Recognition from Long-Term Observations
853
explored [3]. The state of the art today is defined by a family of recognition schemes related to eigendecomposition of the data directly [11] and through Linear Discriminate Analysis [1], or to local feature analysis of images [13, 17], all of which achieve very high performance. However, face recognition from an image sequence, or more generally from a set of images, has been a subject of relatively few published studies. The common idea in most of the published work is that recognition performance can be improved by modeling the variability in the observed data. Recently proposed detection and tracking algorithms use the dynamics of the consecutive images of a face, and integrate recognition into the tracking framework. In [6], the variation of the individual faces is modeled in the framework of the Active Appearance Model. In [7], the “temporal signature” of a face is defined, and feed-forward neural network is used in order to classify the sequence. In [2], people are recognized from a sequence of rotating head images, using trajectories in a low-dimensional eigenspace. They assume that each image is associated with a known pose – a situation which is uncommon in practice. In all three papers, the temporal constraints of the sequence (i.e., the images being consecutive in time) are crucial. We are interested in recognition under looser assumptions, when the images are not necessarily finely sampled, or even ordered in time. 2.1
Late Integration Strategies
A number of methods are applicable to the situation where multiple observations of the same person’s face are accumulated over time. This is an instance of the more general problem of fusion of evidence from multiple measurements. Kittler et al in [9] present a statistical interpretation of a number of common methods for cross- modal fusion, such as the product, maximum , and majority rules, which are also appropriate for late integration over a set of observations from a single modality. For example, it can be shown that under the assumption that the samples in the set are distributed i.i.d., the product rule is the maximum likelihood classifier of the set: w∗ = argmax wi
n 4 t=1
& p(xt |wi ) = argmax p X (n)|wi , wi
(1)
where xt is the t-th sample in the set of n observations X (n) . The max rule chooses the identity with the highest estimated likelihood of a single face image, while the mean rule prefers the identity that gives the highest mean likelihood over the input images. The majority rule, which is an instance of a voting strategy, observes classification decisions made on all of the input images separately, and picks the label assigned to the largest number of images. Though lacking clear statistical interpretation, the same combination rules can be applied to scores instead of likelihood values (e.g., taking the product or mean of the distances in feature space).
854
2.2
G. Shakhnarovich, J.W. Fisher, and T. Darrell
Early Integration and the MSM Algorithm
The above combination rules are generally “late” integration strategies, in that they combine likelihoods rather than observations. An alternative approach consists of combining the input images themselves and mapping the whole sequence to a feature space in which the classification is performed. The simplest example of such a technique is classification of X (n) based on the mean image x ¯. More sophisticated approaches capture higher-order moments of the distribution. In particular, the Mutual Subspace Method (MSM) of Yamaguchi et al [18] is noteworthy. In MSM, a test set of images is represented by the linear subspace spanned by the principal components of the data. This subspace is then compared to each of the subspaces constructed for the training data with dissimilarity measured by the minimal principal angle between the subspaces1 . This approach works even with coarsely sampled sequences, or with general unordered sets of observations. However, it does not consider the entire probabilistic model of face variation, since the eigenvalues corresponding to the principal components, as well as the means of the samples, are disregarded in the comparison. The schematic examples in Figure 1 illustrate the shortcomings of ignoring the eigenvalues and means. In Figure 1(a) the three 2D ellipses correspond to the two principal components of three data sets of 300 points obtained from 3D Gaussian distributions p0 (solid line), p1 (dotted) and p2 (dashed) with the same mean but different covariance matrices. In fact p0 is the same as p1 but with noise that slightly rotates the ellipse plane. In terms of distribution divergence – either K-L or L2 – p0 is closer to p1 than to p2 . However, the Mutual Subspace approach fails to recognize this, since the 2D principal subspaces of p1 and p2 are identical. The shortcomings of the subspace angle matching are even clearer in Figure 1(b). Here the two principal components of p0 and p2 again lie in the same subspace, while the principal subspace of p1 is slightly rotated. In addition, in this case important information about similarity of p0 and p1 is contained in the positions of the means, disregarded by the MSM. The MSM approach has the desirable feature that it builds a compact model of the distribution of observations. However, it ignores important statistical characteristics of the data and thus, as we show in Section 4, its decisions may be statistically sub-optimal. In this paper we develop an alternative approach, which takes into account both the means and the covariances of the data and is grounded in the statistical approach to classification.
3
Statistical Recognition Framework
We assume that images of the kth person’s face are distributed according to an underlying probability rule pk , and that the input face images are distributed 1
the minimal angle between any two vectors in the subspaces
Face Recognition from Long-Term Observations
855
p
1
p0
p
2
p
1
p0 p2
(a) p1 and p2 have the same mean
(b) p1 and p2 have different means
Fig. 1. Illustration of the difference between the Mutual Subspace Method and distribution divergence method. The 2D ellipses, embedded in 3D, correspond to the first 2 principal components of the estimated distributions. By the minimal principal angle measure, p0 (solid lines) is closer to p2 (dotted), while in terms of distribution similarity – KL-divergence or L2 -norm – p1 (dashed) must be chosen.
according to some p0 . The task of the recognition system is then to find the class label k ∗ satisfying (2) k ∗ = argmax Pr(p0 = pk ), k
subject to Pr(p0 = pk∗ ) ≥ 1 − δ for a given confidence threshold δ, or to conclude the lack of such a class (i.e. to reject the input). Note that score-producing classifiers, which choose the identity that maximizes a score function and not a probability, must effectively assume that the posterior class probability of the identities is monotonic with the score function. Setting the minimal threshold on a score sufficient to make a decision is equivalent to setting δ. In this paper, we do not deal with the rejection mechanisms, and instead assume that p0 is in fact equal to some pk in the database of K subjects. Therefore, given a set of images distributed by p0 , solving (2) amounts to optimally choosing between K hypotheses of the form which in statistics is sometimes referred to as the two-sample hypothesis: given two samples (in our case, the training set of the kth subject and the test set), do these two sets come from the same distribution? When only a single sample point from p0 is available, it is known that the optimal hypothesis test is performed by choosing k maximizing the posterior probability of pk given this sample. When a larger sample set is available, the optimal test becomes a comparison of posteriors of different models given this set. If all the classes have equal prior probabilities, this becomes a comparison of the likelihoods of the test set under different class models. In the following section, we discuss ways of performing this test, and its relationship to the Kullback-Leibler divergence between distributions. In reality the distributions pk are unknown and need to be estimated from data, as well as p0 . In this paper, we follow [12] and estimate the densities in the space of frontal face images by a multivariate Gaussian. Each subject has its own
856
G. Shakhnarovich, J.W. Fisher, and T. Darrell
density, which is estimated based on the training samples of that subject’s face. Our algorithm, described below, does not require (nor uses) temporal adjacency of the images in the input sequence.
4
Density Matching and Hypothesis Testing
Here we show the relationship of classical hypothesis testing to density matching via K-L divergence. A conclusion of the analysis, supported by subsequent experiments, is that when recognition is done by comparing sets of observations, direct estimation of KL-divergence between densities inferred from training data (i.e. the model densities) and densities inferred from samples under test is a principled alternative to standard approaches. This follows from an analysis of the K-ary hypothesis test, which can be stated as follows: (n)
H1 : X0 .. .
∼ p1 (x)
HK :
∼ pK (x)
(3)
(n) X0
(n)
with the goal of determining which hypothesis, Hk , best explains the data, X0 . (n) The notation Xk ∼ pj (x) implies that the kth sample set is drawn from the " $ (n) jth density. In the discussion which follows X0 = x10 , . . . , xn0 indicates the (n) sample set under test and p0 (x) the density from which it is drawn while Xk (k = 1 . . . K) indicates the kth sample set and pk (x) the model density inferred from that sample set. Although not commonly used, the K-ary hypothesis test: (n)
H1 : X 1 .. .
∼ p0 (x)
HK :
∼ p0 (x)
(4)
(n) XK
is, under mild conditions [10], equivalent to 3. In 3 we quantify and compare the ability of our inferred model densities to explain the samples under test, while in equation 4 we infer the density of our test samples and then quantify and compare its ability to explain our model samples. Assuming that the xi are i.i.d. (independent and indentically distributed) and the classes have equal prior probability, it is well known, by the Neyman-Pearson lemma (e.g. see [14]), that an optimal statistic for choosing one hypothesis over the other is the log-likelihood function; that is, 5 ( ( // (5) log pk xi0 Hk = argmax k
for the hypothesis test of (3) and Hk = argmax k
i
5 i
( ( // log p0 xik
(6)
Face Recognition from Long-Term Observations
857
for hypothesis test of (4). Note the implication that all samples are used in determination of the best hypothesis. Under the i.i.d. assumption it can be shown that the log-likelihood of samples drawn from pl (x) under model distribution pk (x) has the following expected (with respect to pl (x)) and asymptotic behavior [4]: ! # N ( ( i // 1 5 (7) Epl log pk xl = − (H(pl (x)) + D (pl (x) ||pk (x))) N i=1 . ' N ( ( i // 1 5 (8) = lim log pk xl N →∞ N i=1 where D (pl ||pk ) is the well-known asymmetric Kullback-Leibler (K-L) divergence [10] defined as 0 ) % p(x) dx . (9) D(p||q) = p(x) log q(x) It can be shown that D (pl ||pk ) ≥ 0, with equality only when pl (x) = pk (x). The K-L divergence quantifies the ability of one density to explain samples formalized by the information theoretic notion of relative entropy. In the context of K-ary hypothesis testing we see that asymptotically (as we gather more samples) and in expectation (given a finite sample set) that we choose the model density, pk (x), which is closest to the sample density, pl (x), in K-L divergence sense. In this light we might recast the K-ary hypothesis test as one of trying to estimate the K-L divergence between the distribution underlying our training data and the distribution underlying the sample data under test. The difference between the two hypothesis tests is in the direction one computes the divergence. The hypothesis selection rules of 5 and 6 become Hk = argmax
−D (p0 ||pk )
(10)
Hk = argmax
− (H (pk ) + D (pk ||p0 ))
(11)
k
k
respectively. Consequently, we will investigate several methods by which one can approximate the K-ary hypothesis test via direct computation/estimation of KL divergence. Depending on the assumptions made we unavoidably introduce errors into the distribution estimates which in turn introduce errors into our estimate of the K-L divergence. In the first, and most commonly used, method we assume a parametric density form (Gaussian) with parameters estimated from the training data. The log-likelihood of the samples under test are computed using equation 5 and plugging in the estimated Gaussian densities. This is equivalent to first finding the Gaussian densities which are closest to the “true” model densities in the K-L sense and then second, of the Gaussian distributions, finding the one which is closest in the K-L sense to the distribution of the sample under test.
858
G. Shakhnarovich, J.W. Fisher, and T. Darrell
Secondly, we estimate the parameters of the distribution of our test sample and subsequently evaluate the log likelihood of our training data using equation 6. This is equivalent to first finding the Guassian distribution which is closest to the “true” test distribution and then, of the training distributions, selecting that which is closest to the estimated testing density in the K-L sense. Finally, using a recent result [19], we estimate Gaussian distribution parameters for both training and test data and compute K-L divergences analytically.
5
Classification Based on Kullback-Leibler Divergence
In this section we define the proposed classification scheme based on the KLdivergence between the estimated model and input probability densities, and present the computation details of our scheme. The first phase of our algorithm consists of modeling the data distribution as a single multivariate Gaussian constructed from two factors – the density in the principal subspace and the isotropic “noise” component in the orthogonal subspace. This is a fairly standard procedure. The second, novel phase is the closed-form evaluation of the KL-divergence between the models estimated for input set and the ones estimated for each subject from the training data. The resulting values can be treated as a score for classification or rejection decisions. 5.1
Computation of DKL (pk !p0 )
In the general case of two multivariate distributions, evaluating DKL (pk !p0 ) is a difficult and computationally expensive task, especially for high dimensions, and is typically performed by means of numeric integration, and computationally expensive, especially for high dimensions. However, a recent result [19] for the case of two normal distributions pk and p0 provides a closed form expression: ) 0 / d |Σk | 1 1 ( −1 DKL (p0 !pk ) = log + Tr Σ0 Σ−1 xk − x¯0 )(¯ xk − x ¯0 )T − , k + Σk (¯ 2 |Σ0 | 2 2 (12) where d is the dimensionality of the data (number of pixels in the images), x ¯k and x ¯0 are the means of the training set for the kth subject and of the input set, respectively, and Σk and Σ0 are the covariance matrices of the estimated Gaussians for the kth subject and for the input set, respectively. After estimating Σ0 and finding its eigendecomposition, an operation taking O(d3 ), we can compute the determinant in (12) in linear time, by taking the product of the eigenvalues. Calculation of the matrix products will require an additional workload of O(d3 ). Therefore, for K subjects in the database, we need O(Kd3 ) operations. To compute DKL (p0 !pk ), we exchange the indices 0 and k in (12). Since KL-divergence is an asymmetric measure, we should expect the results to be different for the two “directions”.
Face Recognition from Long-Term Observations
5.2
859
Normal Approximation of Face Appearance Densities
For the sake of completeness, we describe our calculation of Σ0 and Σk , which is a straightforward application of PCA, as done, e.g., in [12]. Let Φk be the orthonormal matrix of eigenvectors (columns) and Λk be the diagonal matrix of non-decreasing eigenvalues λk1 ≥ λk2 ≥ . . . ≥ λkd of the auto-correlation matrix (n) (n) Sk = (Xk − x ¯k )(Xk − x¯k )T of the kth set of images. We will denote by the subscript k the components of such a decomposition for the training set of images of the kth subject, and by the subscript 0 the components of the decomposition of the test image set. The maximum likelihood estimate of a multi-dimensional Gaussian from the data is N (· ; x ¯k , Sk ). However, following the assumption that the true dimensionality of the data (with the noise excluded) is lower than d, we choose a corresponding to the desired retained energy2 E: subset 6Mk −1of Mk ≤ n eigenvectors, 6Mk λi < E ≤ i=1 λi . The chosen eigenvectors define the principal linear i=1 subspace Lk . Our model must still explain the variance in the complementary orthogonal subspace L⊥ k ⊥ Lk . Following [12], we replace each one of the remaining diagonal elements of Λk by their mean ρk . This solution, which estimates the distribution in L⊥ k as isotropic Gaussian noise with variance ρk , can be shown to minimize the KL-divergence from the true distribution, if the Gaussianity assumption is correct. The estimated density of the faces in the k-th class is then the product of the Gaussian densities in the two subspaces, and can be written in the full d-dimensional space as (13) pk (x) = N (x; x¯k , Σk ) , where x ¯k is the mean of the images in k-th class, and λk1 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 . λkMk 0 . 0 1 T E Σk = Φ k Λ E Φ , Λ = k k k 0 . 0 ρk 0 . . , ρk = d − M k . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . ρk
d 5
λki
i=Mk +1
Similarly, the estimated density of the observed face sequence is p0 (x) = N (x; x¯0 , Σ0 ) .
(14)
It is important to point out that x ¯k and Σk need to be computed only once for each subject, at the time the subject is entered into the database.
6
Experiments
In order to evaluate the relative performance of the proposed algorithm, we compared it to several other integration algorithms mentioned above. The comparisons were made on two data sets, described below, both containing frontal 2
alternatively one can set Mk directly; while both techniques are heuristic, we found the latter to be less robust with respect to discovering the true dimensionality
860
G. Shakhnarovich, J.W. Fisher, and T. Darrell
face images. One data set was obtained with a single camera using a face detector to collect images, while the other used a recently proposed virtual view rendering technique to generate frontal pose images from a set of arbitrary views. In our experiments we compared the performance of the following algorithms: – – – – – –
Our distribution comparison algorithm based on KL-divergence MSM Comparison by the mean log-likelihood of the set Comparison by the log-likelihood of the mean face Max / min / mean combination rules for likelihoods of individual images Max / min / mean combination rules for classifier scores of individual images (distances from the principal subspace). – Majority vote, both by likelihoods and by scores of individual images
Due to space constraints, and to keep the graphs scaled conveniently, we report only the results on those methods that achieved reasonably low classification error for some energy level. Methods not reported were performance significantly inferior to the reported ones.
(a) Conversation video
(b) Visual Hull generated face views
Fig. 2. Representative examples from the data used in the experiments. Note thethe pose and lighting variation in (a), and synthetic rendering artifacts in (b)
6.1
“Conversation” FFace Sequences
To evaluate the discussed methods on conventional data obtained with a single monocular camera, we experimented with a data set3 containing video clips of 29 subjects filmed during a conversation; we refer to this data set as the Conversation data. Each person is represented by one clip captured with a CCD camera, at 30 fps. The faces were detected using a fast face detection algorithm [16], and resized to 22×22 pixel gray level images. False positives were manually removed. In all of the images, the pose is frontal within about 15 degrees, with an even smaller range of tilt and yaw. The first 500 frames for each person were used as a training set, and the rest were used for testing. To estimate the behavior 3
courtesy of Mitsubishi Electric Research Lab
Face Recognition from Long-Term Observations
861
of the discussed algorithms for different input sizes, we partitioned the test images into sets of 60, 150 and 500 frames (2, 5 and 17 seconds) - 240, 95 and 25 sets, respectively. The examples shown in Figure 2(a) are representative of the amount of variation in the data. We decided to perform the recognition task on low-resolution images (22×22) for two reasons. One is the interest to compare performance to that on virtually rendered images, described in the following section, which are small by necessity. The second is the reduction in computation and storage space. Figure 3 shows the results for 60 and 150 frames. In both cases, DKL (pk !p0 ) achieved performance statistically indistinguishable from that of the other topperforming algorithms. For sets of 60 images its error rate was 0.04%, while for sets of 150 frames its performance (and also that of a few other algorithms) was perfect. The optimal energy cut-off point for most algorithms seems to be around 0.7, but some algorithms, including ours, show robustness around this point, staying at low error for a much wider range of E. In the absence of a clear principled way of choosing the dimensionality of the principal subspace, this is an encouraging quality. In our experiments with sets of 500 frames, the difference between many of the algorithms vanishes as they achieve zero error for some PCA dimension. We did observe, however, more robust performance with respect to PCA dimensionality, with the KL-divergence, mean face likelihood, and max rule on the likelihoods.
0.1
0.1
DKL(p0||pk)
D (p ||p ) KL
MSM Mean log−likelihood Likelihood of the Mean Max single likelhood
k
Recognition error
Recognition error
0.05
0
0
MSM Mean log−likelihood Likelihood of the Mean Max single likelhood
0.05
0.2
0.4
0.6
Energy in the principal subspace
(a) 60 frames
0.8
1
0
0.2
0.4
0.6
Energy in the principal subspace
0.8
1
(b) 150 frames
Fig. 3. Experiments with Conversation data, sequences of (a) 60 and (b) 150 frames. Note that for 150 frames, three algorithm achieve zero error rate.
862
6.2
G. Shakhnarovich, J.W. Fisher, and T. Darrell
View-Independent Recognition with Rendered Images
We have previously developed methodology for view-independent multi-modal recognition using multiple synchronized views. In this framework, views of the subject are rendered from desired viewpoints using a fast implementation of image-based Visual Hulls (VH). The viewpoints are set based on the subject’s pose, estimated from the trajectory, and according to the mode of operation of the relevant classifiers. The identity is then derived from combining an eigenspacebased face recognizer, operating on nearly frontal views of the face, and a gait recognizer, which extracts geometric features form a sequence of profile silhouettes [15].
0.9
1
DKL(p0||pk)
MSM Mean log−likelihood Likelihood of the Mean Max single likelhood Majority by likelihood
0.8
0.95 0.9
0.7
Recognition accuracy
Recognition error
0.85
0.6
0.5
0.8
DKL
MSM Log−likelihood of set Log−likelihood of the Mean
0.75
0.4
0.7
0.65
0.3 0.6
0.2
0.1 0.1
0.55
0.2
0.3
0.4
0.5
0.6
0.7
Energy in the principal subspace
0.8
0.9
(a) Performance of set recognition algorithms for different PCA dimensions
1
0.5
2
4
6
8
10
12
Rank
14
16
18
20
22
(b) Performance as a function of rankthreshold
Fig. 4. Experiments with Visual Hull- generated data, sequences of about 60 frames
The input images to the face classifier are rendered using a virtual camera placed to obtain a frontal view of the user. Thus the face image is assumed to be in roughly canonical scale and orientation, to this extent solving the pose variation problem. However, the problem of changing expression, hair style etc. remains. In addition, images are of low resolution (22×22 in our implementation) due to the typical distance of about 10 meters between the subjects and the physical cameras. Some examples of images used for training and as input to the face recognizer in our VH-based system are given in Figure 2(b). In general, recognition performance on this data set is lower than it would be on typical data. Error rate of 20% was reported in [15] for face recognition on a larger data set, that included very short input sequences. It is also worth mentioning that the frontal face view is sought within a small spatial angle of the initial estimate of the face orientation, and often a face is detected in a number of virtual views that are nearly frontal. Therefore, the
Face Recognition from Long-Term Observations
863
number of available virtual face images for t time frames is typically larger than t, sometimes by an order of magnitude. The results shown in Figure 4 were computed by leave-one-out cross validation. Each one of the 133 sets of at least 60 synthetic faces of one of the 23 subjects was used as the input, while the rest of the data was used for training. All of the classification algorithms were evaluated for different energy thresholds E varying from 0.1 (for which typically a single principal component is chosen) to .99 (typically around 100 components). As in the experiments with the Conversation data, we did not address the issue of choosing the right E, but it can be seen from the plots that in all data sets, the E producing the best results is in the range of 0.75–0.9. And again the behavior of some algorithms, including ours, is notably more robust than the others, with respect to the choice of E. For this dataset, DKL (p0 !pk ) achieves best performance for E = 0.75 at misclassification rate of 18%, compared to 24% for the log-likelihood of the input set computed as in (5). For most values of E the error rate associated with DKL (p0 !pk ) is lower than the other methods. To quantify the robustness of the method, Figure 4(b) shows the classification performance as a function of rank-threshold. That is, the vertical axis corresponds to the percentage of the correct identities within the first n, as ranked by different measures. The robust performance of our algorithm (solid line) is similar to that of the mean face likelihood,
7
Discussion, Conclusions, and Future Work
Our algorithm for recognition from a set of observations is based on classifying a model built from the set, rather than classifying the individual observations and combining the results. This approach, motivated by casting the classification of sets as a statistical hypothesis testing problem, while not uncommon in the statistical community, to the best of our knowledge has not been used in vision. Our experimental results are in accordance with the theoretical analysis, and support using statistically-motivated density matching for classification. An additional potential benefit of our algorithm, though not fully realized in our current implementation, is in further reducing the computational costs of recognition by using sequential PCA and more sophisticated matrix manipulation algorithms. We intend to continue the experiments while extending the range of the data to a larger number of classes and greater variability. For such data, the assumption that the underlying distributions are Gaussian, which allows simple computation, may not be true. We are currently investigating using alternative statistical models. Our current integration strategy ignores the dynamics of the sequence. In fact, it classifies sets rather than sequences of images, and therefore does not assume meaningful temporal constraints between the images. We expect that including the dynamics of the face appearance, when available, into the algorithm would improve the classification.
864
G. Shakhnarovich, J.W. Fisher, and T. Darrell
Another interesting direction is to extend the distribution matching approach to features other than principal components of images. For instance, one may estimate the distributions of salient features, using nostril, eye, mouth etc. detectors. Alternatively, an Active Appearance Model may be fit to each image in the training and input sets, and the distributions of the computed feature values may be compared. Finally, the general approach introduced in this paper can be applied to recognition not only of faces, but of any objects with variation in appearance across images.
References 1. Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711–720, July 1997. 2. Zoran Biuk and Sven Loncaric. Face Recognition from Multi-Pose Image Sequence. In Proceedings of 2nd Int’l Symposium on Image and Signal Processing and Analysis, pages 319–324, Pula, Croatia, 2001. 3. R. Brunelli and T. Poggio. Face recognition: features vs. templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993. 4. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley and Sons, New York, 1991. 5. Trevor Darrell, David Demirdjian, Neal Checka, and Pedro Felzenswalb. Plan-view trajectory estimation with dense stereo background models. In Proceedings of the International Conference on Computer Vision, Vancouver, BC, July 2001. 6. G.J. Edwards, C.J. Taylor, and T.F. Cootes. Improving Identification Performance by Integrating Evidence from Sequences. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 486–491, 1999. 7. S. Gong, A. Psarrou, I. Katsoulis, and P. Palavouzis. Tracking and Recognition of Face Sequences. In European Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Production, pages 96–112, Hamburg, Germany, 1994. 8. Ismail Haritaoglu, David Harwood, and Larry S. Davis. W4: Real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), August 2000. 9. Josef Kittler, Mohamad Hatef, Robert P.W. Duin, and Jiri Matas. On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, March 1998. 10. Solomon Kullback. Information Theory and Statistics. Wiley and Sons, New York, 1959. 11. Baback Moghaddam, Tony Jebara, and Alex Pentland. Bayesian face recognition. Pattern Recognition, 33:1771–1782, 2000. 12. Baback Moghaddam and Alex Pentland. Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):696–710, 1997. 13. Penio S. Penev and J. J. Atick. Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems, 7(3):477–500, 1996.
Face Recognition from Long-Term Observations
865
14. Louis L. Scharf. Statistical Signal Process: Detection, Estimation, and Time Series Analysis. Addison-Wesley Publishing Company, New York, 1990. 15. Gregory Shakhnarovich, Lily Lee, and Trevor Darrell. Integrated Face and Gait Recognition From Multiple Views. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, Lihue, HI, December 2001. 16. Paul A. Viola and Michael J. Jones. Robust real-time object detection. Technical report, COMPAQ Cambridge Research Laboratory, Cambridge, MA, February 2001. 17. L. Wiskott, J. M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. In Proceedings of International Conference on Image Processing, pages 456–463, Heidelberg, 1997. Springer-Verlag. 18. Osamu Yamaguchi, Kazuhiro Fukui, and Ken-ichi Maeda. Face recognition using temporal image sequence. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 318–323, Nara, Japan, 1998. 19. Shintaro Yoshizawa and Kunio Tanabe. Dual differential geometry associated with the Kullback-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT Journal of Mathematics, 35(1):113–137, 1999.
Helmholtz Stereopsis: Exploiting Reciprocity for Surface Reconstruction Todd Zickler1 , Peter N. Belhumeur2 , and David J. Kriegman3 1
Center for Comp. Vision & Control, Yale University, New Haven, CT 06520-8285 [email protected] 2 Department of Computer Science, Columbia University, New York, NY 10027 [email protected] 3 Beckman Institute, University of Illinois, Urbana-Champaign, Urbana, IL 61801 [email protected]
Abstract. We present a method – termed Helmholtz stereopsis – for reconstructing the geometry of objects from a collection of images. Unlike most existing methods for surface reconstruction (e.g., stereo vision, structure from motion, photometric stereo), Helmholtz stereopsis makes no assumptions about the nature of the bidirectional reflectance distribution functions (BRDFs) of objects. This new method of multinocular stereopsis exploits Helmholtz reciprocity by choosing pairs of light source and camera positions that guarantee that the ratio of the emitted radiance to the incident irradiance is the same for corresponding points in the two images. The method provides direct estimates of both depth and field of surface normals, and consequently weds the advantages of both conventional and photometric stereopsis. Results from our implementations lend empirical support to our technique.
1
Introduction
In this paper, we present a method for reconstructing the geometry of a surface that has arbitrary and unknown surface reflectance, as described by the bidirectional reflectance distribution function (BRDF) [17]. This method does not make the assumption that the BRDF is Lambertian or of some other parametric form, and it enables the reconstruction of surfaces for which the BRDF is anisotropic and spatially varying. Helmholtz stereopsis works by exploiting the symmetry of surface reflectance – it chooses pairs of light source and camera positions that guarantee that the relationship between pixel values at corresponding image points depends only on the surface shape (and is independent of the BRDF). ˆ), is the ratio of the outgoing The BRDF of a surface point, denoted fr (ˆi, e ˆ radiance to the incident irradiance. Here, i is the direction of an incident light
P. N. Belhumeur was supported by a Presidential Early Career Award IIS-9703134, NSF KDI-9980058, NIH R01-EY 12691-01, and NSF ITR ITS-00-85864. D. J. Kriegman was supported by NSF IIS 00-85864, NSF CCR 00-86094, and NIH R01-EY 12691-01.
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 869–884, 2002. c Springer-Verlag Berlin Heidelberg 2002
870
T. Zickler, P.N. Belhumeur, and D.J. Kriegman half occluded/visible shadow fixed specularity
^ vl
ol
p
^ vl
^ vr
^ n or
ol
p
^ vr
^ n or
Fig. 1. The setup for acquiring a pair of images that exploits Helmholtz reciprocity. First an image is acquired with the scene illuminated by a single point as shown on the left. Then, a second image is acquired after the positions of the camera and light source are exchanged as shown on the right.
ˆ is the direction of the outgoing ray. These are typically written as ray, and e directions in a coordinate frame attached to the tangent plane of the surface. It is not an arbitrary four dimensional function, as it is symmetric about the ˆ) = fr (ˆ e, ˆi). This symmetry condition was incoming and outgoing angles fr (ˆi, e first enunciated by Helmholtz ([10], p. 231) and is commonly referred to as Helmholtz reciprocity. To see how reciprocity can be used for stereopsis, consider obtaining a pair of images as shown in Fig. 1. The first image is captured while the object is illuminated by a single point light source. The second image is captured once the camera and light source have been exchanged. That is, the camera’s center of projection is moved to the former location of the light source, and vice versa. By acquiring images in this manner, Helmholtz reciprocity ensures that, for any mutually visible scene point, the ratio of the emitted radiance (in the direction of the camera) to the incident irradiance (from the direction of the light source) is the same for both images. This is not true for general stereo pairs that are acquired under fixed illumination (unless the BRDF of the surface is Lambertian.) In [14], it was shown that three or more pairs of images acquired in this manner are needed to establish a matching constraint, which leads to a multinocular stereo imaging geometry. It was also shown that in addition to this constraint being useful for searching depth, it contains sufficient information to directly estimate the surface normal at each point without taking derivatives of either the images or the depth map. This is similar to conventional photometric stereopsis, but here the BRDF may be unknown and non-parametric. The fact that the normal field can be estimated in addition to depth is quite significant in that it allows the surface to be reconstructed in regions of constant image brightness. For these places on the object’s surface where traditional stere-
Helmholtz Stereopsis
871
opsis fails to recover any information, Helmholtz stereopsis is able to recover the field of surface normals. This information can then be integrated – as it is in photometric stereopsis – to recover the surface shape. We should point out that in [14], Helmholtz stereopsis was introduced as one of two methods for reconstructing surfaces with arbitrary BRDFs. In that paper, only a simple binocular demonstration of the matching constraint was shown, one which could be applied under a very restricted imaging situation (fronto-parallel surfaces and small baselines); in contrast, this paper provides a full multinocular description, analysis, and implementation of Helmholtz stereopsis. In the next section, we describe the reciprocity-based method and follow in Sect. 4 with experimental results of our current implementation. Since the method combines the advantages of conventional multinocular stereopsis (estimation of depth) with those of photometric stereo (estimation of the normal field), we have summarized the similarities and differences in Sect. 3.
2
Helmholtz Stereopsis
In this section we describe our method for reconstructing surfaces with arbitrary BRDFs using a form of multinocular stereopsis. For a more detailed derivation, the reader is referred to [14]. As stated in the introduction, the method differs from standard stereopsis in that it is an active reconstruction technique – in addition to changing the camera position, the illumination of the scene is manipulated to exploit Helmholtz reciprocity. Before describing Helmholtz stereopsis, it will be helpful to provide a framework for general N-view stereopsis. Consider n calibrated cameras whose centers of projection are located at oc for c = 1, . . . , n. Define a camera centered at op to be the principal camera. This camera is used to parametrize the depth search, and while it can be one of the cameras located at oc , it need not be a physical camera (i.e., it can be virtual). Given a point q in the principal image, there is a one-parameter family of n-point sets {q1 , . . . , qn } in the other images that could correspond to q. We can parametrize this family by the depth d, and by defining a discrete set of possible values for d (d ∈ D = {d0 , . . . , dND }) we can index this family of n-point sets, Q(d) = {qc (d), c = 1, . . . , n}. A multinocular matching constraint provides a method for deciding, given a set of image intensities measured at the points Q(d), whether or not the hypothesized depth value d could correspond to a true surface point. In the case of traditional dense stereo, the surface is assumed to be Lambertian, and the constraint is simply I1 (q1 (d)) = I2 (q2 (d)) = · · · = In (qn (d)) where Ic (qc ) is the intensity at point qc in the image centered at oc . (Note that many other stereo methods exist in which the constraint involves filtered intensities as opposed to the image intensities themselves). Using this framework, we can proceed to develop a matching constraint for reciprocal image pairs. What is unique to Helmholtz stereopsis, is that this constraint is independent of the BRDF of the surface.
872
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
A single reciprocal pair of images is gathered by interchanging the positions of the light source and the camera’s center of projection as shown in Fig. 1. As shown in [14], because of Helmholtz reciprocity, given such an image pair the image irradiance values measured at corresponding pixels in the left and right images satisfy the constraint ˆl ˆr v v ˆ = w(d) · n ˆ = 0, ·n (1) il − ir |ol − p|2 |or − p|2 where, il and ir are image irradiance measurements (obtained from a radiometrically calibrated camera), ol and or are the left and right camera centers, p is ˆ l and v ˆ r are unit vectors that point from p to the camera a surface point, and v centers (see Fig. 1). For calibrated cameras and a hypothesized value for the depth d, values for these vectors and points can be computed (we write w(d) ˆ is unknown. Note that the to denote this fact), and only the surface normal n vector w(d) lies in the plane defined by p, or and ol (the epipolar plane). If we capture NP of these reciprocal pairs (each from a different pair of positions ol , or ), we will have NP linear constraints of this form. Let W(d) be ˆ ri v li the NP × 3 matrix where the ith row is given by wi (d) = ili |olivˆ−p| 2 − iri |o −p|2 . ri Then the set of constraints from (1) can be expressed as ˆ = 0. W(d) n
(2)
Clearly, for the correct depth value d the surface normal lies in the null space of W(d ), and it can be estimated from a noisy matrix using singular value decomposition. In addition, W(d ) will be rank 2, and this can be used as a necessary condition when searching the depth. Note that at least three camera/light source positions are needed to exploit this constraint. An implementation of a system that uses this constraint for surface reconstruction will be presented in Sect. 4. Before that, we present a comparison of this method with some existing reconstruction techniques.
3
Comparison with Existing Methods
In principle, Helmholtz Stereopsis has a number of advantages when compared to conventional multinocular stereopsis and photometric stereopsis. Figure 2 summarizes these. While our implementation may not fully reveal these advantages (we do not make explicit use of available half-occlusion indicators for detecting depth discontinuities), we believe that future refinements will. Assumed BRDF Most conventional dense stereo reconstruction methods assume that scene radiance is independent of viewing direction, i.e. that surface reflectance is Lambertian. However, the majority of surfaces are not Lambertian and therefore violate this assumption. For these surfaces, large-scale changes in scene radiance occur
Helmholtz Stereopsis
873
Fig. 2. A comparison of Helmholtz stereopsis with conventional dense multinocular and photometric stereopsis. A more detailed discussion of the subtleties of the entries in this table is given in Sect. 3.
as specularities shift with viewpoint, and small-scale changes occur everywhere on the surface. In addition, if the BRDF is spatially varying, these changes may occur differently at every point on the surface. Using traditional dense stereopsis, establishing correspondence in this situation is difficult, if at all possible. Most sparse, or feature-based, stereo methods also rely (albeit less heavily) on the Lambertian assumption – if the BRDF is arbitrary, the detected feature points may be viewpoint or lighting dependent Whereas viewpoint is manipulated in conventional stereopsis, in photometric stereopsis, the viewpoint remains fixed while the illumination is varied. These methods provide an estimate of the field of surface normals which is then integrated to recover the surface depth. Similar to conventional dense multinocular stereopsis, many photometric methods assume that the BRDF is Lambertian [12, 19,21]. The methods that do not make this assumption either assume that the BRDF (or the reflectance map) is completely known a priori, or can be specified using a small number of parameters [9,11,16,20]. These parametric BRDFs are often derived from physical models of reflectance and are therefore restricted to a limited class of surfaces. When the form of the BRDF is unknown, or when the form of the BRDF is spatially varying, there is insufficient information to reconstruct both the geometry and the BRDF. In [13], a hybrid method with controlled lighting and object rotation was used to estimate both surface structure and a non-parametric reflectance map. This is similar to our method in that it: 1) is an active imaging technique that exploits changes in viewpoint and illumination; and 2) considers a general, non-parametric BRDF. However, the method requires that the BRDF is both isotropic and uniform across the surface (the present method makes no such assumptions). The assumptions made about surface reflectance for three reconstruction techniques – conventional, photometric, and Helmholtz stereopsis – are summarized diagrammatically in Fig. 3. Note that many natural surfaces actually have
874
T. Zickler, P.N. Belhumeur, and D.J. Kriegman LAMBERTIAN
PARAMETRIC/ KNOWN
ARBITRARY
conventional stereopsis photometric stereopsis Helmholtz stereopsis
Fig. 3. A summary of the assumptions made about surface reflectance by three reconstruction techniques. Both conventional dense multinocular stereopsis and photometric stereopsis assume the BRDF is either Lambertian or of some other parametric form. Yet many natural surfaces (e.g. human skin, the skin of a fruit, glossy paint) do not satisfy these assumptions. In contrast to the other methods, Helmholtz stereopsis makes no assumption about the BRDF.
surface reflectance in the rightmost region of the figure and are thus excluded from reconstruction by conventional techniques. In Helmholtz stereopsis, because the relationship between the image intensities of corresponding points does not depend on viewpoint, non-Lambertian radiometric events such as specularities appear fixed to the surface of the object. In contrast with conventional (fixed illumination) stereo images, these radiometric events become reliable features, and they actually simplify the matching problem. Recovered Surface Information In conventional binocular or multinocular stereopsis, depth is readily computed. Typically, the output of the system is a discrete set of depth values at pixel or sub-pixel intervals (a depth map). In most cases, unless a regularization process is used to smooth the depth estimates, the normal field found by differentiating the recovered depth map will be very noisy. Instead of direct differentiation of the depth map, regularized estimates of the normal field can be obtained, for example, based on an assumption of local planarity [5], or through the use of an energy functional [1]. In contrast to these methods, photometric stereopsis provides a direct estimate of the field of surface normals which is then integrated (in the absence of depth discontinuities) to obtain a surface. Helmholtz stereopsis is similar to photometric stereopsis (and different from the regularization techniques used in conventional stereopsis) in that the normal field is directly estimated at each point based on the photometric variation across reciprocal image pairs. In this way, Helmholtz stereopsis combines the advantages of conventional and photometric methods by providing both a direct estimate of the surface depth and the field of surface normals. It also provides information about the location of depth discontinuities (see below). Note that for applications such as
Helmholtz Stereopsis
875
Helmholtz stereopsis
photometric stereopsis
conventional stereopsis
PLANAR; NEARLY CONSTANT ALBEDO SMALL CURVATURE; NEARLY CONSTANT ALBEDO HIGH CURVATURE OR TEXTURE
Fig. 4. A summary of the surface properties required for Lambertian surface reconstruction by conventional, photometric and Helmholtz stereo techniques. Even when the BRDF is Lambertian, conventional stereopsis is only capable of recovering surface geometry in regions of texture (i.e. varying albedo) or high curvature (i.e. edges). Neither photometric stereopsis nor Helmholtz stereopsis suffer from this limitation.
image-based rendering and image-based modeling, a good estimate of the normal field is critical for computing intensities and accurately measuring reflectance properties. Constant Intensity Regions Dense stereo and motion methods work best when the surfaces are highly textured; when they are not textured, regularization is needed to infer the surface. (This can be achieved, for example, using a statistical prior [8,15,18,2] or through regularized surface evolution [6].) Sparse stereo and motion methods also have difficulty in these regions. These methods only reconstruct the geometry of corresponding feature points, so by their nature, they cannot directly reconstruct smoothly curving surfaces whose reflectance properties are constant. In contrast, photometric stereo techniques and Helmholtz stereopsis are unaffected by lack of texture, since they can effectively estimate the field of normals which is then integrated to recover depth. See Fig. 4. Depth Discontinuities Depth discontinuities present difficulties for both traditional and photometric stereopsis. As mentioned above, when there is a depth discontinuity, it does not make sense to integrate the normal field that is output by photometric stereo techniques; photometric stereopsis cannot resolve depth discontinuities at the boundaries of objects. Likewise, traditional stereo algorithms often have trouble locating depth discontinuities. This difficulty arises for two reasons. First, if a background object has regions of constant intensity and the discontinuity in depth occurs within one of these regions, it is quite difficult to reliably locate
876
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
the boundary of the foreground object. Second, depth discontinuities induce halfocclusion in adjacent regions of the image. These regions, which are not visible in at least one of the images, often confuse the matching process. Helmholtz stereopsis simplifies the task of detecting depth discontinuities since the lighting setup is such that that the shadowed and half-occluded regions are in correspondence. To see this, consider the reciprocal pair shown in Fig. 1. When a surface point is in shadow in the left image, it is not visible in the right image, and vice versa. The shadowed regions in the images of a Helmholtz pair can therefore be used to locate depth discontinuities. If one uses a stereo matching algorithm that exploits the presence of half-occluded regions for determining depth discontinuities [1,3,4,8], then these shadowed regions may significantly enhance the quality of the depth reconstruction. Active vs. Passive Imaging Like photometric stereopsis and unlike conventional stereopsis, Helmholtz stereopsis is active. The scene is illuminated in a controlled manner, and images are acquired as lights are turned on and off. Clearly, a suitable optical system can be constructed so that the camera and light source are not literally moved, but rather a virtual camera center and light source are co-located. Alternatively, as will be shown in the next section, a simple system can be developed that captures multiple reciprocal image pairs with a single camera and a single light source.
4
Implementation and Results
In the previous sections a number of claims were made about the capabilities of Helmholtz stereopsis as a surface reconstruction technique. This sections describes an implementation of a Helmholtz stereo system, and gives results that support those claims. Specifically, in this section we will give examples of: – reconstructions of surfaces with arbitrary BRDFs (surfaces that are not Lambertian or approximately Lambertian) – the recovery of both surface depth and the normal field – reconstructions of surfaces in regions of constant image brightness Capturing Reciprocal Images To evaluate the use of Helmholtz reciprocity we have implemented a system that enables the acquisition of multiple reciprocal image pairs with a single camera and light source. These are mounted on a wheel as shown schematically in Fig. 5a. First, suppose an image is captured with the wheel in the position shown in this figure. If the wheel is rotated by 180 ◦ and another image is captured, the two images will form a reciprocal pair, and corresponding image irradiance values will satisfy the constraint in (1). It is clear that we can capture any number of these pairs by rotating the wheel through 360 ◦ while stopping to capture images at reciprocal positions.
Helmholtz Stereopsis
877
ol3 ol2
n
or2 or3
p
(a)
(b)
Fig. 5. (a) A wheel is used to capture multiple reciprocal image pairs employing a single camera and light source. By rotating the wheel through 360 ◦ , any number of fixed-baseline pairs can be captured. For example, images captured at ol2 and or2 will form a reciprocal pair. (b) An example of the wheel design shown in (a). The light source consists of a standard 100W frosted incandescent bulb fitted with a small aperture.
A picture of such a system is shown in Fig. 5b. The camera is a Nikon Coolpix 990, and the light source consists of a standard 100W frosted incandescent bulb fitted with a small aperture. The camera is both geometrically and radiometrically calibrated. The former means that the intrinsic parameters and the extrinsic parameters of each camera position are known, while the latter means that we know the mapping from scene radiance values to pixel intensities (including optical fall-off, vignetting, and the radiometric camera response). Since the lamp is not an ideal isotropic point source, it also requires a radiometric calibration procedure in which we determine its radiance as a function of output direction. An example of a set of images captured using this system is shown in Fig. 6. For all results shown in this paper the diameter of the wheel was 38cm and the distance from the center of the wheel to the scene was approximately 60cm. The reconstructions were performed from the viewpoint of a virtual principal camera located at the center of the wheel. We chose this camera to be orthographic to ensure uniform sampling of object space. Using the Matching Constraint In Sect. 2 we described a matrix constraint equation that can be used to recover the surface depth and orientation corresponding to each point q in the principal view. How this constraint is used was not specified; there are a number of possible methods, many of which can be adapted from conventional stereo algorithms. Our goal is to demonstrate the feasibility of Helmholtz stereopsis in general, so a discussion of possible methods is outside the scope of this paper. Instead,
878
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
Fig. 6. An example of 6 reciprocal images pairs captured using our rig. Image pairs are arranged vertically.
we have chosen one particularly simple implementation which will be described here. Results for four different surfaces follow in the next section. For each point q, and for each depth value d ∈ D = {d1 , d2 , . . . , dND } we can construct a matrix Wq (d) as in (2). If the hypothesized depth corresponds to a true surface point, this matrix will be rank 2, and the surface normal will be uniquely determined as the unit vector that spans its 1-D null space. (Note that since each row of W lies in the epipolar plane defined by p, oli , and ori , no two rows of W will be collinear, so rank(W) ≥ 2.) In the presence of noise, W will generally be rank three, and we require a measure for the coplanarity of the row vectors wi . Since we know that rank(W) ≥ 2, a suitable measure (and one that works well in practice) is the ratio of the second to third singular values of W. Given each matrix Wq (d), we compute the singular value decomposition, W = UΣVT where Σ = diag(σ1 , σ2 , σ3 ), σ1 ≥ σ2 ≥ σ3 . Then, our support measure to be used in the depth search is the ratio rq (d) =
σ2 . σ3
(3)
Note that at correct depth values, the ratio rq (d) will be large. The condition shown in (2) is a necessary condition that will be satisfied by true values of surface depth, but it is not sufficient. One way to resolve the ambiguity is to make some assumptions about the shape of the surface. (The BRDF remains arbitrary). One of the simplest methods, analogous to SSD matching in conventional binocular stereo, is to assume that the surface depth is locally constant. In the search for the depth at image point q◦ , we consider the ratio rq (d) at this point as well as at points in a small rectangular window W around q◦ . Then, the estimated depth at this point is given by dq◦ = arg max d∈D
q∈W
rq (d)
(4)
Helmholtz Stereopsis
879
Once we have estimated the depth d , the linear least squares estimate of the surface normal is 2 ˆ q◦ = arg min Wq◦ (d )ˆ n , ˆ n = 1, n ˆ n
(5)
which is simply given by the right singular vector corresponding to the smallest singular value of Wq◦ (d ). Note that the depth map that is recovered using (4) will have low resolution due to the assumption of local depth constancy. This initial estimate of the depth can be refined using the high frequency information provided by the field of surface normals. An example of this will be shown in the next section. As a final note, this algorithm makes no attempt to detect half-occluded regions even though this information is available through the visible shadows. We have chosen this method simply to demonstrate that reciprocity can be exploited for reconstruction. As shown in the next section, despite its simplicity, the surface reconstructions are quite good. Results Figures 7-10 show the results of applying this procedure. Each figure consists of: (a) one of the input images of the object, (b) the estimated depth, and (c) the recovered field of surface normals. Note that the viewpoints of the displayed input images are slightly different from the reconstruction viewpoints due to the use of a virtual principal camera. Figure 7 is a demonstration of a surface reconstruction in the case of nearly constant image brightness. This surface (a wax candle) is a member of the class of surfaces described at the top of Fig. 4, and it is an example of a case in which conventional stereopsis has difficulty. Notice that Helmholtz stereopsis accurately estimates the normal field, even though the depth estimates are poor. The poor depth estimates are expected since at a principal image point q, the ratio rq (d) will be nearly constant for a small depth interval about the true surface depth. The normals are accurate, however, since each corresponding matrix Wq (d) will have approximately the same null space. Figure 8 shows the results for a surface that is clearly non-Lambertian. (The specularities on the nose, teeth and feet attest to this fact.) Note that the reconstruction method is not expected to succeed in regions of very low albedo (e.g., the background as well as the iris of the eyes) since these regions are very sensitive to noise. Figures 9 and 10 show two more examples of surface reconstructions. Again, note that the recovered field of surface normals is accurate despite the low resolution of the depth estimates, even in regions of nearly constant image brightness. As mentioned at the end of the last section, it is possible to obtain a more precise surface reconstruction by integrating the estimated normal field. The examples above demonstrate that the normal field is accurately estimated, even in regions where the depth is not. To illustrate how surfaces can be reconstructed
880
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
(a)
(b)
(c)
Fig. 7. This figure shows: (a) one of 36 input images (18 reciprocal pairs), (b) the recovered depth map, and (c) a quiver plot of the recovered normal field. As expected, even though we obtain a poor direct estimate of the depth, the surface normals are accurately recovered. Note that the image in (a) is taken from a position above the principal view used for reconstruction.
(a)
(b)
(c)
Fig. 8. As in the previous figure: (a) one of 34 input images (17 reciprocal pairs), (b) the recovered depth map, and (c) a quiver plot of the recovered normal field. Regions of very small albedo (e.g. the iris of the eyes, the background) are sensitive to noise and erroneous results are expected there. Elsewhere, the depth and orientation are accurately recovered. A 9 × 9 window was used in the depth search.
in this way, we enforced integrability (using the method of Frankot and Chellapa [7] with a Fourier basis) and integrated the the vector fields shown in Figs. 7c and 10c. The results are shown in Figs. 11 and 12. As seen in these figures, the high frequency information provided by the surface normals enables the recovery of precise surfaces shape – more precise than what we could expect from most conventional n-view stereo methods. Note that it would be possible to obtain a similar reconstruction using photometric stereo, but this would require an assumed model for the reflectance at each point on the surface.
Helmholtz Stereopsis
(a)
(b)
881
(c)
Fig. 9. A reconstruction of the marked interior region (a) of a ceramic figurine. Figures (b) and (c) are the depth map, and the quiver plot of the normal field. The low resolution of the depth map is caused by the 11 × 11 window used in the depth search, but this does not affect the accuracy of the estimated normal field. Again, 18 reciprocal image pairs were used.
(a)
(b)
(c)
Fig. 10. A reconstruction for the face of a plastic doll (a). Figures (b) and (c) are the depth map, and surface normals. 18 reciprocal image pairs and a 9 × 9 window were used.
Fig. 11. The surface that results from integrating the normal field shown in Fig. 7c. A vertex for every third surface point is shown, and the surface is rotated for clarity.
882
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
Fig. 12. Three views of the surface that results from integrating the normal field shown in Fig. 10c. To demonstrate the accuracy of the reconstruction, we have refrained from texture mapping the albedo onto the recovered surface, and a real image taken from each corresponding viewpoint is displayed. The specularities on the doll’s face clearly show that the surface is non-Lambertian.
5
Discussion
This paper presents a surface reconstruction method that does not make any assumptions about the BRDF of the observed surface. It combines the advantages of both conventional N-view stereopsis and photometric stereopsis in that it di-
Helmholtz Stereopsis
883
rectly estimates both the depth and the field of surface normals of an object. In contrast to these methods, however, it can recover this geometric information for surfaces that have arbitrary, unknown, and possibly spatially varying BRDFs. The method is termed Helmholtz stereopsis, as it works by exploiting the physical principal known as Helmholtz reciprocity. Helmholtz stereopsis is a form of active multinocular stereo, and as such, it requires that multiple images be captured under controlled illumination. This paper presented an implementation of a simple wheel design that is capable of gathering these images in a controlled manner with a single camera and a simple approximation to a point light source. The results from this implementation demonstrate its ability to recover surface shape. The goal of this paper was to show empirically that the reciprocity condition satisfied by the BRDF can be exploited for surface reconstruction; there are a number of possibilities for future work. The rig that was used here was manually rotated and required a full geometric calibration a priori. One could imagine an automated system with a servo motor, a video camera replacing the digital still camera, and a self-calibration routine. In addition, although the reciprocal image pairs described in the paper contain information that can be used for locating depth discontinuities, this information was not explicitly used in our implementation. The correspondence between shadowed and half-occluded regions is nevertheless a powerful source of information, one that we plan to exploit in the future. Our model should be adapted to account for surface interreflections, and one could explore additional configurations for capturing Helmholtz pairs and a more direct method of combining depth and surface orientation information.
References 1. P. Belhumeur. A binocular stereo algorithm for reconstructing sloping, creased, and broken surfaces in the presence of half-occlusion. In Int. Conf. on Computer Vision, Berlin, Germany, 1993. 2. P. Belhumeur. A Bayesian approach to binocular stereopsis. Int. J. Computer Vision, 19, 1996. 3. P. Belhumeur and D. Mumford. A Bayesian treatment of the stereo correspondence problem using half-occluded regions. In Proc. IEEE Conf. on Comp. Vision and Patt. Recog., pages 506–512, 1992. 4. I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. Stereo without disparity gradient smoothing: a Bayesian sensor fusion solution. In British Machine Vision Conference, pages 337–346, 1992. 5. F. Deverney and O. Faugeras. Computing differential properties of 3-D shapes from stereoscopic images without 3-D models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 208–213, 1994. 6. O. Faugeras and R. Keriven. Variational principles, surface evolution, PDE’s, level set methods, and the stereo problem. IEEE Trans. Image Processing, 7(3):336–344, 1998. 7. R. T. Frankot and R. Chellappa. A method for enforcing integrability in shape from shading algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-10(4):439–451, 1988.
884
T. Zickler, P.N. Belhumeur, and D.J. Kriegman
8. D. Geiger, B. Ladendorf, and A. Yuille. Occlusions in binocular stereo. In Proc. European Conf. on Computer Vision, Santa Margherita Ligure, Italy, 1992. 9. H. Hayakawa. Photometric stereo under a light-source with arbitrary motion. J. Optical Society of America A, 11(11):3079–3089, Nov. 1994. 10. H. v. Helmholtz. Treatise on Physiological Optics. Dover (New York), 1925. 11. K. Ikeuchi. Determining surface orientations of specular surfaces by using the photometric stereo method. IEEE Trans. Pattern Anal. Mach. Intelligence, 3(6):661– 669, 1981. 12. M. Langer and S. W. Zucker. Shape-from-shading on a cloudy day. J. Opt Soc. Am., pages 467–478, 1994. 13. J. Lu and J. Little. Reflectance and shape from images using a collinear light source. Int. J. Computer Vision, 32(3):1–28, 1999. 14. S. Magda, T. Zickler, D. Kriegman, and P. Belhumeur. Beyond Lambert: Reconstructing surfaces with arbitrary BRDFs. In Int. Conf. on Computer Vision, 2001. 15. L. Matthies. Stereo vision for planetary rovers: Stochastic modeling to near realtime implementation. Int. Journal of Computer Vision, 8(1):71–91, July 1992. 16. S. Nayar, K. Ikeuchi, and T. Kanade. Determining shape and reflectance of hybrid surfaces by photometric sampling. IEEE J. of Robotics and Automation, 6(4):418– 431, 1990. 17. F. Nicodemus, J. Richmond, J. Hsia, I. Ginsberg, and T. Limperis. Geometric considerations and nomenclature for reflectance. Monograph 160, National Bureau of Standards (US), 1977. 18. T. Poggio, V. Torre, and C. Koch. Computational vision and regularisation theory. Nature, 317:314–319, 1985. 19. W. Silver. Determining shape and reflectance using multiple images. Master’s thesis, MIT, 1980. 20. H. Tagare and R. deFigueiredo. A theory of photometric stereo for a class of diffuse non-lambertian surfaces. IEEE Trans. Pattern Anal. Mach. Intelligence, 13(2):133–152, February 1991. 21. R. Woodham. Analysing images of curved surfaces. Artificial Intelligence, 17:117– 140, 1981.
Minimal Surfaces for Stereo Chris Buehler MIT/LCS
Steven J. Gortler Harvard University
Michael F. Cohen Microsoft Research
Leonard McMillan MIT/LCS
Abstract. Determining shape from stereo has often been posed as a global minimization problem. Once formulated, the minimization problems are then solved with a variety of algorithmic approaches. These approaches include techniques such as dynamic programming min-cut and alpha-expansion. In this paper we show how an algorithmic technique that constructs a discrete spatial minimal cost surface can be brought to bear on stereo global minimization problems. This problem can then be reduced to a single min-cut problem. We use this approach to solve a new global minimization problem that naturally arises when solving for three-camera (trinocular) stereo. Our formulation treats the three cameras symmetrically, while imposing a natural occlusion cost and uniqueness constraint.
1 Introduction Determining shape from stereo has often been posed as a global minimization problem [2, 13]. In these formulations one solves for the shape that best predicts the observed image data and is consistent with a set of prior assumptions. These priors often penalize discontinuities and occlusions in the solutions. Once formulated, the minimization problems are then solved with a variety of algorithmic approaches. These approaches include techniques such as dynamic programming [8], min-cut [15, 11], and alpha-expansion [19]. In this paper we show how an algorithmic technique that constructs a discrete spatial minimal cost surface (which we will abbreviate MS for minimal surface) can be brought to bear on stereo global minimization problems. We first show that finding a minimal cost surface within a discrete spatial complex is a natural generalization to the problem of finding a minimal cost path in a planar graph. We then describe how the global MS can be found in polynomial time by solving for the min-cut over the dual graph of the input spatial complex. As an example of its utility, we show how the MS approach can be used to easily derive the min-cut graph problem used by Ishikawa and Geiger to solve a two-camera stereo problem [11]. We extend the MS approach to solve a new global minimization problem that naturally arises when solving for trinocular stereo. Our formulation treats the three cameras symmetrically, while imposing a natural occlusion cost and uniqueness constraint. This has not been achieved in previous algorithms utilizing more than two cameras. To aide our formulation, we also utilize trinocular rectification [1] in a novel way.
2 Previous Work There have been many approaches in the literature to formulate stereo as a global minimization problem. Each approach makes slightly different choices as to the space of A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 885−899, 2002. Springer-Verlag Berlin Heidelberg 2002
886
C. Buehler et al.
Table 1. Comparison of algorithms according our desired criteria. Notes: 1. chmn96 also describes a full image generalization using annealing methods. 2. rc98 scales each of the counted disparity jumps by the local matching cost. 3. kz01 also investigate a fully general smoothness penalty but find that it leads to NP-Complete subproblems. gly95 chmr96 scanline/fullImage number of cams uniqueness occlusion penalty monotonicity smoothness penalty polynomial time
ig98 rc98
sl sl 1 fi fi 2 2 2 any y y y n y y y n y y y n general count+zigzag count count 2 y y y y
bvz98 kz01
MS
fi fi any 2 n y n y n n general zigzag 3 n n
fi 3 y y y count y
allowed solutions and the functional minimized. These different formulations also give rise to different algorithmic solutions, some of which run in polynomial time and some of which are NP-Complete. To help put our algorithm in proper context we compare some of the highlights from the literature along a variety of axes (see Table 1). Scanline vs. full image. Some algorithms operate on each scanline independently, while others treat the entire image as a single problem. Single scanline approaches include the work by [8, 9]. The advantage of the scanline approach is that the associated minimization problems can be solved exactly and efficiently using dynamic programming techniques. Generalizing dynamic programming beyond the single scanline has proven quite difficult, and many early “full image” formulations appealed to approximation methods (such as simulated annealing) [8]. Other full image formulations result in NPComplete problems [20, 12]. The only full-image formulations that have exact and efficient solutions have used min-cut techniques [15, 11]. Scanline formulations originally solved using dynamic programming approaches could also have been solved using shortest-path approaches. While the concept of dynamic programming does not generalize well to higher dimensions, we show that the idea of shortest path does generalize to higher dimensions as a minimal cost surface problem, which can be solved efficiently using min-cut techniques. This is the basis for our full image approach. Uniqueness. If each pixel represents a single point in space, then a uniqueness constraint should enforce that a pixel in one camera does not match more than one pixel in a different camera. Exceptions are sometimes made as in the so called tilted-plane solution [11]. When more than two cameras are used, there may be data from a subset of cameras that support a solution with multiple surface elements on the line of sight of one pixel [16]. In our work we enforce the uniqueness constraint. Occlusion. Many formulations penalize pixels in one camera that are unmatched with any pixel in other cameras, justified with Bayesian arguments as in [9, 3]. We use a similar occlusion penalty.
Minimal Surfaces for Stereo
887
Monotonicity. Many of the two-camera approaches have included monotonicity constraints (i.e., features do not change their left-right ordering between the left and right images). Although most appropriate for small baseline configurations, monotonicity together with the uniqueness constraint facilitates the use of dynamic programming approaches. One notable exception is the work of [12] where the monotonicity constraint is lifted. Unfortunately, their very general formulation leads to an NP-Complete problem. In our work we show how monotonicity can be generalized naturally to the three-camera setting as part of an efficient MS solution. Number of cameras. Many previous formulations restrict themselves to two cameras. Methods that use three or more cameras in general position can be expected to be more robust since they can exploit both horizontal and vertical parallax. Unfortunately, the multi-camera approaches [15, 20] have difficulty in penalizing occlusions and enforcing uniqueness in the input cameras. In other words, their solutions do not penalize unmatched pixels, and may allow for a pixel in one camera to match numerous pixels in other cameras. In our work we show how a two-camera formulation can be generalized naturally to three cameras in general position, while still penalizing occlusions and enforcing uniqueness. Smoothness. Most previous formulations define a penalty for non-smoothness in the resulting shape, again justified with Bayesian arguments [9, 3]. The various formulations differ greatly in how non-smoothness is penalized. In most cases non-smoothness is measured by summing the absolute value of the disparity jumps between neighboring matches [11]. Given uniqueness and monotonicity constraints, this penalty is the equivalent to an occlusion penalty; the only way to create a neighboring match with a disparity difference of k is have k unmatched pixels (see Figure 1). Cox [8] also adds a “zig zag” penalty to favor solutions with a few large discontinuities over solutions with many small discontinuities. Some authors [9, 20] have proposed the use of completely general smoothness terms (for example the square of the disparity jumps between neighboring matches). Unfortunately, the use of completely general costs in full image formulations has lead to problems with NP-Complete complexity. In the work of [12] the authors first propose the use of fully general costs, but find that not only is the resulting problem NP-Complete, but even their subproblems are too. They therefore settle on a cost that only measures the existence of discontinuities, not their magnitude. Our three camera algorithm uses a smoothness cost based on the absolute values of the disparity differences which are captured simultaneously with the occlusion penalty. Algorithms and complexity. Solving these minimization problems leads to algorithms with varying complexity. Formulations using algorithms based on dynamic programming and min-cut have polynomial efficiency, while the most general formulations give rise to NP-Complete problems. It remains an open problem to fully understand where the boundary lies between efficiently solvable formulations and intractable ones. In this paper, we extend the bounds of solvable formulations by presenting polynomial time algorithms for minimal cost surfaces and show how they can be applied naturally to a three camera stereo problem.
888
C. Buehler et al.
Fig. 1. Primal graph: Black edges are possible occlusion/disparity-discontinuity edges (pink when selected). Blue edges are possible match edges (red when selected).
Minimal Cost Surfaces. Polynomial time algorithms for computing discrete spatial minimal cost surfaces appear to be absent from the computer science literature. In the crystallography literature, a construction almost identical to ours is described in [18]. One can interpret our minimal cost surface algorithm as an application of the voxel labeling algorithm of [17]. In their algorithm, one chooses one of two labels for each voxel in a spatial complex. A cost is used to describe the likelihood of a label being used for a specific voxel, and a smoothness cost is used to penalize neighboring voxels with differing labels. If one thinks of the two labels as in front of or in back of a separating surface, one can interpret the smoothness cost as a cost associated with the face separating two voxels. Our work has also been very influenced by the construction of [15], which informally suggests the relationship between min-cut and minimal cost surfaces.
3 Polynomial Time Minimal Cost Surfaces In this section we describe the discrete spatial minimal cost surface problem (MS), which we will later use to solve our stereo formulation. Since MS is a generalization of the planar shortest path (PSP) problem, we begin by describing it first. Then we will show how to solve the PSP problem as a min-cut problem. Of course, since more efficient shortest path algorithms exist, this would not be considered a practical approach for shortest path. But the more efficient algorithms do not generalize to higher dimensions while the min-cut approach does. 3.1 Planar Shortest Path Problem formulation. In the PSP problem, we are given as input an embedded planar graph consisting of vertices V , edges E and faces F (a CW-complex). We require that
Minimal Surfaces for Stereo
889
Fig. 2. Dual graph: Pink edges correspond to dual edges of the primal graph. Red edges are selected edges that cross the min-cut of this graph.
the complex has the disk-like topology, B 2 . We associate with each of the edges e i a positive cost c(ei ). Two special vertices on the exterior of the graph are labeled b 0 and b1 . We denote by S, the set of curves made up of edges from E and whose boundary (zero-manifold) is made up of the vertices {b 0 , b1 }. For each curve s ∈ S we define its cost as the sum of its edge costs, C(s) = e∈s c(e). The shortest path problem is to find the curve s ∗ ∈ S with minimal cost. There are many efficient algorithms for the PSP problem, as well as for shortest paths in non-planar graphs [7]. Figure 1 shows a particular PSP problem that will be used below to solve a single scanline stereo problem. Oriented PSP. For our stereo formulation we will need to solve an oriented MS problem, so we describe here the oriented PSP problem. In an oriented PSP problem, we give each edge a direction (for example from right to left in Figure 1) and restrict solutions to directed paths. An equivalent representation that generalizes better to higher dimensions is to associate an orientation (normal vector) with each edge (e.g., with normals pointing towards the line connecting the camera centers in Figure 1), and require the solution path to have a consistent orientation. Reduction to Min-cut. The PSP problem can be solved in polynomial time by reduction to min-cut. First we construct the dual graph of the planar graph (Figure 2). That is, we associate with each face f i a dual vertex vˆi . We also create a single source vertex rˆ and ˆ With each non-exterior edge e i bounding two faces f j and fk , we a single sink vertex k. associate a dual edge eˆi that connects vˆj and vˆk . The two boundary vertices b 1 and b2 partition the exterior edges into two sets which we call upper U and lower L. For each exterior edge e i in U , we associate a dual edge eˆi which connects the sink kˆ with the single face that ei bounds. For each exterior edge e i in L, we associate a dual edge eˆi which connects the source rˆ with the single face that e i bounds. We set the capacity of each dual edge using the cost of the associated primal edge c(ˆ e i ) = c(ei ).
890
C. Buehler et al.
ˆ and K ˆ with rˆ ∈ R ˆ and kˆ ∈ K ˆ Def: A cut sˆ is a partition of the vertices vˆ into two sets R 4 ˆ ˆ Def: The cost of a cut is the sum of the capacities of the edges between R and K . Theorem: The edges across the min-cut sˆ∗ are dual to the edges of the minimal cost path s∗ . A formal proof is beyond the scope of this paper but will appear in a forthcoming technical report. Informally the reason why this theorem is true is that a cut of the dual graph is a partition of the dual vertices into two sets. This cut then corresponds to a partition of the primal faces into two spatial regions 5 . The dual edges across the cut correspond to the primal edges forming the boundary between these two spatial regions. Thus there is a natural duality between cuts and paths. Directed cuts for oriented curves. Given an oriented spatial graph, an associated directed dual graph can be built. The min-cut of this directed dual graph, will correspond to an oriented path in the oriented primal graph. The directed dual graph is created as follows. Given an oriented primal edge e i with a normal specified by the input, we associate with it in the dual graph a pair of opposite facing directed edges. One edge, eˆ + i in the direction of the specified normal, is assigned the capacity of the primal edge as before, ˆ− c(ˆ e+ i ) = c(ei ). A second edge, e i in the direction opposite of the normal, is given − an infinite capacity, c(ˆ e i ) = ∞; this effectively eliminates it from possible min-cut solutions 6 . Once again, the edges across the min-cut of this dual graph are dual to the minimal oriented path in the primal graph. 3.2 Minimal Cost Surface The generalization of the PSP problem to higher dimensions is straightforward 7 . In the discrete spatial minimal cost surface (MS) problem, we are given as input a CWcomplex embedded in R 3 consisting of vertices, edges, faces and cells. We require that the embedded complex has the ball-like topology, B 3 . Associated with each face f i is a cost c(fi ). Also input is a closed curve B made up of a set of edges on the exterior of the complex. We denote by S, the set of surfaces made up of faces from F and whose boundary (one-manifold) is the curve B. For each surface s ∈ S we define its cost as C(s) = f ∈s c(f ). The problem is to find the surface s ∗ with minimal cost. In this paper, we describe a polynomial time algorithm for solving this problem. In the oriented surface problem, each face has input with it an associated normal. The set of satisfactory surfaces must have a consistent set of normals. 4
5
6
7
When we deal with directed dual graphs for the oriented surface problem the cost will be the ˆ to K. ˆ sum of the edges from R In a formal proof, one needs to show that each of these two spatial regions is simply connected and is consistent with the boundaries b1 and b2 . If there are zero cost edges in the primal graph, then there potentially can be a disconnected zero cost path loop in the minimal cost path. This is easy to detect and remove If one simply removed these edges instead, then one could end up with a min cut solution whose associated primal edges do not form a valid path in the primal graph. Unlike the minimal curve problem, which can be solved efficiently even for non-planar graphs, the minimal cost surface problem can only be solved efficiently for a complex that is embedded in three dimensional space. Without this restriction, the minimal cost surface problem becomes NP-Complete by reduction to min-cost planar triangulation [14].
Minimal Surfaces for Stereo
891
The minimal cost surface can be found by reduction to min-cut. First we construct the dual graph of the spatial complex. That is, we associate with each cell c i a dual ˆ With each vertex vˆi . We also create a single source vertex rˆ and a single sink vertex k. non-exterior face f i bounding two cells c j and ck , we associate a dual edge eˆi that connects vˆj and vˆk . The boundary edges B partition the exterior faces into parts which we call upper U and lower L. For each exterior face f i in U , we associate a dual edge eˆi which connects the sink kˆ with the single cell that f i bounds. For each exterior face fi in L, we associate a dual edge eˆi which connects the source rˆ with the single cell that fi bounds. We set the capacity of each dual edge equal to the cost of the associated primal face c(ˆ ei ) = c(fi ). Theorem: The edges across the min-cut sˆ∗ are dual to the faces of the minimal cost surface s∗ .
4 Two-Camera Stereo by Minimal Cost Surface The minimal cost surface formulations can be used to solve a number of minimization problems that arise in stereo vision. We will describe a sequence of three problems. We start with a scanline formulation for two camera stereo just to establish the basic idea. Then we will describe a full image formulation for two camera stereo. This MS problem will reduce to a min-cut problem quite similar to that of Ishikawa and Geiger [11]. Finally we describe a novel three camera stereo formulation. 4.1 Two Camera Single Scanline For a “single scanline two camera” problem, we can set up the following formulation. A stereo solution is described by a set of matching pairs of pixels from the left and right cameras. We enforce uniqueness; a pixel can be matched at most once. We also enforce a monotonicity/ordering constraint on these matches. A match between a left and right pixel costs us some amount based on the inverse of the likelihood that the two pixels are images of the same point in space. For example, we can use the squared difference of their colors or a more complex metric that examines correlations of larger windows. An unmatched pixel costs us some fixed amount 8 . We define two matches to be neighbors if either their left or right pixels are adjacent. We impose a smoothness penalty on two neighboring matches with differing disparities that are both in the solution which is simply the absolute value of the difference between the two posited disparities. We can solve for the global minimum of this problem by solving for the shortest path in the planar graph shown in Figure 1. Each quadrilateral grid cell represents the intersection of the projective extrusions of two pixels. (We think of each pixel as an interval in its scanline, not as a point). We bisect each quadrilateral with a cross edge (shown in blue) projecting to exactly a one pixel-interval in each of the two images. These “match edges” represent possible pixel matches. The cost of these edges is based on the observed pixel differences. The black edges, created by projecting the pixels’ boundaries, project into pixel-intervals in one image and no pixels (an interpixel point) in the other image. These edges represent changes in disparity between matches and 8
The cost of an unmatched pixel may be derived using Bayesian methods [9, 3], and can even be altered based on the “edginess” of that image region.
892
C. Buehler et al.
Fig. 3. The intersection of the extrusions of one rectified pixel from each of two cameras forms a cuboidal voxel. The blue quadrilateral represents a possible match surface. All outer faces of the voxel represent possible disparity discontinuities.
thus carry the cost of the smoothness penalty. They also represent an unmatched or occluded pixel. With all edges, we associate normals that face the line connecting the two camera centers. If we impose both the uniqueness and monotonicity constraint, then a valid stereo solution corresponds to an oriented path in this graph between the outer boundary b 0 and b1 . The cost of the path will be the sum of the match costs plus the smoothness cost (the count of the number of occluded (pink) edges chosen). This shortest path problem can be solved using Dijkstra’s algorithm, but for explanatory reasons, we will reduce it to a min-cost problem. We do this by building the dual graph, adding a source and sink vertex as shown in Figure 2, and then solving for the min-cut. One should note the similarity of this graph with Figure 3 from [11]. The main reason for the differences between this graph and that of [11] arises from the inclusion of tilted plane (non-unique matching) solutions allowed in their formulation. 4.2 Two Camera Full Image We can use the minimal cost surface construction described above to solve for a two camera full image problem. We build a spatial complex for the minimal cost surface problem which is effectively a 3D extrusion of the planar graph of Figure 1. In this construction, the costs used in this problem are the same as the single scanline formulation above, except that vertical neighbors are also considered for the smoothness penalty. More specifically, we associate pixel values with quadrilateral pixel regions of their image planes (not single points). These pixel regions are projectively extruded to create a 3D voxel grid. Each of these voxels is bisected with a match quadrilateral drawn in blue in Figure 3. The blue quadrilaterals represent matches and are priced according to the color differences of the two associated pixels. The four vertical faces are priced to represent occlusion and smoothness costs. The two horizontal faces are priced to represent the smoothness costs 9 . We define as the boundary B, the leftmost edge of the complex, the rightmost edge of the complex, and any arbitrary connecting paths on the upper and lower exteriors of the complex 10. 9
10
The horizontal faces on the upper and lower exterior are given zero cost, since there are no neighboring pixels to measure smoothness. These can be arbitrary since the upper and lower exterior faces have zero cost.
Minimal Surfaces for Stereo
893
Fig. 4. Pairwise rectification of three images. Rectified quadrilateral pixels are divided into two triangular (inner and outer) subpixels.
Once again, we solve this problem by solving for a min-cut of the appropriate dual graph. This graph again has strong similarities to the graph in Figure 5a of [11].
5 Three-Camera Stereo by Minimal Cost Surface In this section we show how a minimal cost surface problem can be used to solve a three camera, full image formulation. We do not assume any special relationship between the cameras, nor choose any one to serve as a base, but rather treat them symmetrically. To help keep the bookkeeping tractable, we first perform trinocular rectification [1] of the images (see Figure 4). To accomplish this, we find the trifocal plane passing through the three camera centers and reproject the pictures onto an image plane parallel to this trifocal plane. Between each pair of images, there is a set of epipolar lines parallel to the line connecting those two cameras’ centers. This defines two sets of lines in each image, resulting in a quadrilateral pixel grid on each image. Each quadrilateral pixel is then bisected to create two triangular subpixels, an inner (pink) and outer (blue) triangle as shown in Figure 4. We then resample the images with one color value per triangle. The reason for this bisection will be described below. This rectification does not introduce a large amount of distortion when the cameras are configured to be roughly facing in the direction perpendicular to the trifocal plane; but degenerate configurations are possible, such as when the cameras face towards the trifocal plane. This rectification has the following nice property, if one takes one quadrilateral pixel from each of the three images, and projectively extrudes these pixels into three-space, and then takes the intersection of these three solids, this intersection will either consist of the empty set or it will consist of a cuboidal voxel (see Figure 5). This voxel has 6 quadrilateral faces. Each of these quadrilateral faces projects back as a quadrilateral pixel in one image, and as an edge between two quadrilaterals in the other two images. This voxel grid forms the basis for our minimal cost surface problem.
894
C. Buehler et al.
Fig. 5. The intersection of the projection of three rectified pixels forms a cuboidal voxel. The pink triangular face projects in the three images as inner triangular subpixels forming a possible front match surface. The blue triangular face projects in the three images as outer triangular subpixels forming a possible back match surface. All the external triangular faces of the cuboid are possible occlusion surfaces.
5.1 Three-Way Match Surfaces Only In the simplest three camera formulation, we introduce match surfaces that are seen in subpixels of all three cameras. In other words, we will not consider matches between only two of the three cameras. In the scanline and 2 camera full image formulations, we introduced bisecting edges and faces respectively into the graph which projected as pixels in both images. In this formulation we introduce two bisecting triangle faces into each voxel (see Figure 5), a front and rear triangle face. A front triangle face has the property that it projects to inner triangle subpixels in all three images. The rear triangle face projects to three outer triangular subpixels. These front and rear triangle faces trisect the voxel into three cells we call front, middle and back (see Figure 6). The edges of these two triangles faces bisect the 6 quadrilateral faces of the voxel into 12 triangular faces. The front and rear triangle faces in each voxel represent possible simultaneous three-camera-matches. The match cost for each triangle face is based on the color differences of the three triangular subpixels that it projects to. The outer faces of a voxel represent unmatched subpixels, and are assigned an occlusion cost. Just like in the single scanline formulation, this occlusion cost also accounts for a smoothness term, since disparity changes can only occur along with occlusions. As in the two camera case, in the three camera formulation we enforce a monotonicity constraint that states that 3 features seen in all three images must have the same clockwise ordering. Enforcing the uniqueness and monotonicity constraints implies that a valid stereo solution can be represented as a single oriented surface in this complex. The orientation of all these 14 triangle faces in a voxel is given by the normal that points towards the trifocal plane. This stereo formulation can be solved as a minimal cost surface problem over the complex we have described. To derive an efficient solution for the minimal cost surface,
Minimal Surfaces for Stereo
895
Fig. 6. The front and back match triangles subdivide the voxel into three cells. In the dual graph, each cell has an associated vertex.
we once again construct the dual graph to this spatial complex. We create one vertex dual to each volumetric region. In particular, there are three vertices dual to the three cells within each cuboidal voxel; one vertex dual to the tetrahedron in front of the front triangle, one dual to the tetrahedron behind the back triangle, and one dual to the middle region bounded by the two triangles (see Figure 6). These three dual vertices are connected by dual edges: front-middle and middle-back. Dual vertices from neighboring voxels are connected with dual edges piercing each of the 12 outer triangular faces of the voxel. The capacity of each edge is set to the costs associated with the faces they cross. The two edges within a voxel (pink and blue in Figure 6) represent the two possible matches. The 12 other (black) edges represent occlusions. 5.2 Adding Two-Way Match Surfaces In a solution to the “three-way matches only” formulation, a (triangular) subpixel in one image either matches exactly one pixel in both of the other two images, or matches
Fig. 7. The middle cell of Figure 6 is divided up into 8 subcells using 3 planes.
896
C. Buehler et al.
Fig. 8. Left: The middle cell of a cuboidal voxel in the two-way formulation. Right: The dual graph of a cuboidal voxel in this formulation.
no pixels in either of the two other images. By creating a more complicated spatial complex, we can allow the solution to also include surface elements that project with non zero area to only two out of the three images. This is done as follows. The front and back cells of the cuboidal voxel in Figure 6 remain intact. We then introduce three new diagonal (2-camera) planes dividing the middle cell into 8 tetrahedra as shown in Figure 7. Each of the three introduced planes has been split up into 4 triangular faces by the other two planes. Each of these triangular faces project as one half of an inner or outer triangular subpixels in two cameras (see Figure 8). Thus, in our rectified images, we now consider each original quadrilateral pixel as being decomposed into four triangular subpixels. Weights for these faces are described in the discussion section below. The minimal cost surface is found by reduction to min-cut. The dual graph of one entire cuboid is visualized on the right side of Figure 8. 5.3 Bounds In the three rectified images (see Figure 4) we index pixels as i, j (left to right, bottom to top) in camera left, j, k in camera right (bottom to top, right to left) and k, i in camera top (right to left, top to bottom). Cuboidal voxels are then indexed as i, j, k. Not all possible voxels will fall completely in the three fields of view. Those outside are discarded. The user can also set minimal and maximal allowed (positive) disparities. The disparity range can be conservatively enforced by testing each voxel using appropriate affine expressions of i + j + k. Voxels not satisfying these constraints are also discarded. The remaining voxels make up our volume of interest. A boundary curve B must be chosen to partition the external faces into two sets L and U . The dual external edges will connect to the source and the sink respectively. We generalize the boundary selection used in the single scanline formulation to select
Minimal Surfaces for Stereo
897
i i
k
k j
j
j i
k
Fig. 9. Rectification. Left: three images as seen on trifocal plane. Right: resampled and indexed by i,j,k.
the boundary here (see Figure 1). We first find the plane parallel to the trifocal plane which has the maximal area intersection with the volume. All external faces that are completely in front of this plane are placed in L, while the others are placed in U . This maximal plane test allows us to include the most possible three-camera matches in the solution. 5.4 Complexity If there are n pixels in each of the three images and we restrict our solutions to d possible disparity levels, then the spatial complex has nd voxels. With two-way faces included, this results in 10nd cells and 20nd faces; the dual graph has correspondingly many vertices and edges. Using known flow algorithms [5], this leads to a worst case time complexity of O((nd) 2 log(nd)).
6 Implementation and Results We have coded our minimal cost surface stereo algorithm and tested it on a number of datasets. An example of our rectification is shown in Figure 9. On the left is the three original images. On the right we show the resampled trinocularly rectified images. Common features in any pair of images must agree on any shared coordinates (i, j, or k). In our implementation, we calculate only a single color for each quadrilateral pixel; we do not compute separate values for the triangular subpixels. We used the publicly available PRF library to solve our min-cut problems [6]. To calculate color similarities for the match edge costs, we used the method of [4]; use of this method greatly increased the quality of the results. The occlusion weights were set by a user controlled parameter. The algorithm was run with two-way matches included,
898
C. Buehler et al.
left image
rc98
kz01
MS
Fig. 10. Comparison of this algorithm to rc98 and kz01 on two datasets. Unmatched pixels drawn in red. Note: the gray levels between the different algorithms are not calibrated.
on a 1GHz Pentium III, running Windows. Execution times were on the order of 1-5 minutes. Figure 10 shows our results and compares them to those obtained using two other algorithms. The datasets used were the Tsukuba dataset, and the toys data set from [10]. The Tsukuba dataset was cropped to 256 by 256 pixels to alleviate physical memory limitations. The toy dataset images are 310 by 210 pixels. In both datasets, we used the minimum baseline spacing giving us roughly 10-12 disparity levels. We used these data as input for the publicly available implementations of rc98 [15] and kz01 [12]. As input to our algorithm we used triplets of images in an “L” configuration with both vertical and horizontal parallax. As input to the other two codes, we used a left-right stereo pair. We adjusted their parameters attempting to produce the best results we could. In Figure 10, we show the (lower) left image of the input. The outputs are shown as grayscale disparity maps. Unmatched pixels are drawn in red. In our algorithm a pixel is made up of 4 subpixels; for simplicity, we colored the pixel by the disparity of the frontmost three-way matching subpixel. Pixels that had only occlusion or two-way faces were drawn in red. The MS algorithm appeared to give better results than rc98 for both datasets. MS seemed to do about the same as kz01 on the Tsukuba dataset and slightly better than kz01 on the toys data set (especially on the receding plane of the table top).
7 Discussion In this paper we have described an efficient method for solving for minimal cost surfaces in a 3D spatial structure. We have shown how this can be applied to a novel three camera stereo formulation. Optimally, one could use the two-way faces to model surfaces that are seen in two views but occluded in a third view. Unfortunately, an oriented surface in our complex cannot contain any regions with depth complexity greater than one as seen from any of
Minimal Surfaces for Stereo
899
the three views. Thus an oriented surface cannot represent such partially hidden layers. Therefore, we use the two-way faces not to represent matches, but as another way of representing occlusions; we price them as occlusions, and display these faces as red in the output. In practice we found that the inclusion of these faces greatly improved the results simply because they allow the minimal cost surfaces to have more directional flexibility, and provided a more accurate measurement of the occluded “surface area”. In future work we would like to explore algorithms that enforce uniqueness, but allow for true two camera matches at depths that are occluded in a third view. Future work also includes gaining a better understanding of where the boundary between NPComplete and polynomially solvable stereo problems lies.
References 1. N. Ayache and C. Hansen. Rectification of images for binocular and trinocular stereovision. Proc. International Conference on Pattern Recognition 1998, pages 11–16. 2. H. Baker. Depth from edge and intensity based stereo. PhD thesis, University of Illinois at Urbina Chanmpaign, 1981. 3. P. Belumeur and D. Mumford. A bayesian treatment of the stereo correspondance problem using half-occluded regions. IEEE CVPR ’92. 4. S. Birchfield and C. Tomasi. A pixel dissimilarity measure that is insensitive to image sampling. IEEE PAMI, 20(4):401–406, 1998. 5. C. Chekuri, A. Goldberg, D. Karger, M. Levine, and C. Stein. Experimental study of minimum cut algorithms. Proc ACM SODA, 1997. 6. B. Cherkassky and A. Goldberg. Prf library from www.star-lba.com/goldberg/soft.html. 7. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, 2001. 8. I. Cox, S. Hingorani, B. Maggas, and S. Rao. A maximum likelihood stereo algorithm. Computer vision and image understanding, 63(3), 1996. 9. D. Geiger, B. Landendorf, and A. Yuille. Occlusion and binocular stereo. IJCV, 14:211–226, 1995. 10. Aaron Isaksen. toys dataset by request. 11. H. Ishikawa and D. Geiger. Occlusions, discontinuities, and epipolar lines in stereo. Proc. ECCV 98. 12. V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions using graph cuts. Proc. ICCV 2001. 13. Y. Ohta and T. Kanade. Stereo by intra and inter scanline search using dynamic programming. IEEE PAMI, 7(2):139–154, 1985. 14. J. O’Rourke. Computational Geometry in C. Cambridge University Press, 1994. 15. S. Roy and I. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. Proc. ICCV 1998. 16. S. Seitz and C. Dyer. Photorealistic scene reconstruction by voxel coloring. IJCV, 35(2):151– 173, 1999. 17. D. Snow, P. Viola, and R. Zabih. Exact voxel occupancy with graph cuts. Proc. IEEE CVPR 2000. 18. J. Sullivan. A Crystalline Approximation Theorem for Hypersurfaces. PhD thesis, Dept. of Mathematics, Princeton University, 1990. 19. Y.Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. Proc. ICCV 1999. 20. Y.Boykov, O. Veksler, and R. Zabih. Markov random fields with efficient approximations. Proc. IEEE CVPR 1998.
Finding the Largest Unambiguous Component of Stereo Matching ˇara Radim S´ Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technicka 2, 16627 Prague, Czech Republic [email protected]
Abstract. Stereo matching is an ill-posed problem for at least two principal reasons: (1) because of the random nature of match similarity measure and (2) because of structural ambiguity due to repetitive patterns. Both ambiguities require the problem to be posed in the regularization framework. Continuity is a natural choice for a prior model. But this model may fail in low signal-to-noise ratio regions. The resulting artefacts may then completely spoil the subsequent visual task. A question arises whether one could (1) find the unambiguous component of matching and, simultaneously, (2) identify the ambiguous component of the solution and then, optionally, (3) regularize the task for the ambiguous component only. Some authors have already taken this view. In this paper we define a new stability property which is a condition a set of matches must satisfy to be considered unambiguous at a given confidence level. It turns out that for a given matching problem this set is (1) unique and (2) it is already a matching. We give a fast algorithm that is able to find the largest stable matching. The algorithm is then used to show on real scenes that the unambiguous component is quite dense (10–80%) and error-free (total error rate of 0.3–1.4%), both depending on the confidence level chosen.
1
Introduction
The problem of computational stereopsis has been studied in the computer vision field for about four decades [1]. From the very beginning it was clear that binocular stereo matching is an ill-posed problem, since, except in special cases, the solution to a given matching problem is not unique. Not much effort has been spent on understanding the inherent ambiguity of stereopsis, except for the work on the visual hull [2] and the recent work of Baker, Sim, and Kanade [3]. Our working hypothesis is that stereoscopic ambiguity originates from three independent factors: (1) Data uncertainty due to insufficient signal-to-noise ratio (SNR) in images of weakly textured objects; (2) Structural ambiguity due to the presence of periodic structures combined with the inability of a small number of cameras to capture the entire light field; (3) Task formulation: If stereoscopic vision goal is posed as scene reconstruction from image projections the problem A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 900–914, 2002. c Springer-Verlag Berlin Heidelberg 2002
Finding the Largest Unambiguous Component of Stereo Matching
901
is that the same image set can be generated by infinitely many scenes even if the above two factors do not manifest itself [4,2]. The Class 1 ambiguity is the most serious one because it is never completely avoidable. Data uncertainty can be partly reduced by using better cameras or more discriminable match quality measures [5]. It cannot be suppressed by using more cameras [3]. It has long been observed experimentally that more cameras can reduce Class 2 ambiguity significantly [6,7]. To our knowledge the necessary and sufficient conditions for eliminating this ambiguity have not yet been stated. The Class 3 ambiguity was studied in the work on the photo-hull [2]. Images from any number of cameras are photo-consistent with uncountably many scenes. One avoids the Class 3 ambiguity by separating the stereo matching problem from the scene reconstruction problem. The stereo matching itself, which is nothing but establishing the correspondence mapping between two (or among more) images, is not ill-posed in this sense. The Class 1 and 2 ambiguities require a prior model to regularize the stereoscopic matching problem. Usually, some form of uniform continuous prior is imposed. The early solutions based on the heuristic of cohesiveness [8] culminated in the classical algorithm by Pollard, Mayhew and Frisby [9]. Although some interest remained [10], many authors departed from this path and tried to pose the problem in the framework of statistical estimation theory with a probabilistic continuity prior. The ML estimators of early 90’s, based on dynamic programming, were later put on a more sound basis by [11,12] (assuming continuity along epipolar lines) and by others [13,14,15] (assuming isotropic continuity prior). Recent disparity component matching [16] and network flow [17, 18,19] formulations of the matching task also include isotropic prior model but their computational complexity is lower. Since the most important cause of ambiguity is poor SNR, one has to be very careful in selecting a proper regularizer. Consider a depth discontinuity under low signal-to-noise ratio (from at least one side of the discontinuity image). Strong artefacts may occur if continuity prior is used, see, for instance, the MAP results in Fig. 3, where the thin stripes are not recovered under low SNR. The prior model becomes locally stronger than the constraints given by the input data, which results in interpolation effect within the boundaries surrounding the weakly-textured region. But such interpolation is quite often erroneous, especially in scenes of deep range and relatively thin objects. Various attempts to switch the regularizer off at discontinuities rely on either detecting image edges under the assumption that they coincide with object boundaries [20,21] or by weighting the regularizer by image gradient [13]. This coincidence does not always occur. Alternatively, they rely on identifying hightension regions in regularized solution [22], releasing the continuity prior there, and re-computing the solution. Yet another alternative is to use robust regularizers [15]. None of those methods proved fully satisfactory, mainly because they do not work where they are supposed to: under simultaneous occurrence of Class 1 and Class 2 ambiguity.
902
ˇara R. S´
It seems a possible solution would be to reject ambiguous correspondences. If ambiguities of the first two classes are present we may want to pose the problem as finding the unambiguous component of the matching. This has been successfully attempted by Manduchi and Tomasi [23]. They were able to automatically find (Class 2) unambiguous points and match them. The resulting disparity maps are rather sparse. But it is clear that the aim cannot be at dense matching. We therefore pose stereo matching as a two-part task: P1. Establishing the largest possible unambiguous matching. P2. Partitioning the images into three components: 1. binocularly visible region where reliable matches can be found, 2. monocularly visible region due to identified occlusions, 3. ambiguous region where selecting the unique solution to the matching problem requires regularization. In this paper we will be interested in matching identifying the largest possible image subset as unambiguous at a given confidence level. This is where we differ from [23]. As will be seen in Sec. 4 there are natural scenes where the component is quite large. Is there any principle that could be used to find the largest unambiguous component of stereoscopic matching in a simple way? Let us first develop a bit of insight on a simple case: suppose (1) there is a signature assigned to each pixel in input images, (2) the signatures of corresponding points are equal, and (3) all pixel signatures are unique. Under such restricted conditions the (actual) matches dominate all alternative matches in their signature similarity measure. It is then easy to find such dominant pairs. We call the result the dominant matching. An example of dominant matching algorithm is the popular ‘symmetric maximum method.’ In the noise-free case dominant matching leaves only half-occluded regions unmatched. If signatures become corrupted by random noise (which is closer to reality), dominant matching fails in a peculiar way. Under Class 1 ambiguity, signature similarity measure can be thought of as a random variable. Some correct matches receive lower similarity and some other candidate matches receive higher similarity. Since there are usually many competitors to a given correct match, the probability of a false match becoming dominant is low compared to the probability of a correct match ceasing to be dominant. The result is that under uncertainty due to noise the dominant matching becomes rather sparse but the component of the matching that survives is exceptionally error-free [24]. The sparsity is due to the strength of the dominance condition. Fortunately, dominant matching turns out to be a special case of stable matching which was introduced in [24]. General stable matching has a greater potential to be dense. Roughly speaking, matching M is stable if every match p ∈ M has no competitor at all or if each of its competitors q (q ∈ / M ) has a competitor r ∈ M (cf. Fig. 1). Match q is a competitor to a match p if they cannot be both members of the same matching (due to uniqueness and/or ordering constraints, for instance) and the signature similarity of q is greater or equal to that of p [24].
Finding the Largest Unambiguous Component of Stereo Matching
903
Obviously, the q would be a better candidate to be matched if there were no other constraints than the value of its signature similarity alone. As will be seen later in this paper, stability is a global property of the matched set and stable matching can be computed by a non-iterative algorithm.
Fig. 1. Let M = {p, r} be a matching (full circles) on a 2 × 2 matching problem. Match p = (i, j) of similarity 0.8 has a competitor q = (i, l) of similarity 0.9. The q has a competitor r = (k, l) of similarity 1.0. The pair p is thus stable in M (its competitor has a competitor from M ). The r is trivially stable since it has no competitor.
j i
k
l p
q
0.8
0.9
s
r
0.1
1.0
After only slight changes in the definition (concerning the sharpness of the inequalities) stable matching is able to reject Class 2 ambiguities in the noise-free case. Unfortunately, it is unable to reject Class 1 ambiguities and Class 2 ambiguities under noise. The reason is two-fold: (1) stable matching as introduced above is always complete,1 (2) the inequalities in the definition are sensitive to even small noise. In Sec. 3 we formalize the problem of stable matching in a way that there is a margin allowed in the inequalities, which reduces the noise sensitivity. The margin will be computed from confidence intervals for signature similarity measure for each potential pair. We call the result Confidently Stable Matching. It is complete only in the case of no ambiguity in input data and is capable of rejecting both Class 1 and Class 2 ambiguities under noise, which is our ultimate goal. Some experimental results will be shown in Sec. 4.
2
Preliminaries
For simplicity we assume throughout this paper that we are solving calibrated stereo problem where epipolar lines coincide with image lines. This simplifies the definition of ordering constraint and makes the definitions simple and intuitive. We will assume matching is done on individual epipolar lines independently. As will be seen in the experimental section this does create any serious artifacts. Only binocular matching problem is discussed here as it is the most difficult case in stereoscopic vision. A generalization for polynocular stereo is possible but requires slightly more general theory. We first need to re-formulate and/or generalize several known concepts so that they better serve our definition of stability and its properties. Let I, J = {1, 2, . . . , N } be two sets indexing pixels on the left-image and the right-image epipolar lines, respectively and let P ⊆ I × J. The element 1
Complete matching explains all pixels as either matched or occluded, see later for a precise definition.
904
ˇara R. S´
p = (i, j) ∈ P will be called a (candidate) pair. Let c be the signature similarity measure assigning each pair p a real number c(p).2 We will visualize P as a matching table as shown in Fig. 2 in which each crossing (circled or not) represents a pair from I × J. Note that, in case of sparse matching problems, the I × J may or may not be fully populated by P . ✲j Fig. 2. A pair p in matching table P (left), the Xzone X(p) (center, empty circles), and the F -zone F (p) (right).
❄
i
sp
p∈P
❝ ❝ ❝ ❝ ❝ ❝ sp ❝ ❝ ❝ ❝ ❝ ❝ X(p) ⊆ P
❝❝❝ ❝❝❝ ❝❝❝
sp
❝❝❝ ❝❝❝ ❝❝❝
F (p) ⊆ P
The X-zone3 X(p) ⊆ P of a pair p = (i, j) ∈ P consists of all pairs q ∈ P with the same row index i or column index j, except for the pair p itself, as shown in Fig. 2 (center). Similarly, the F -zone4 F (p) ⊆ P of a pair p ∈ P is formed by two opposite quadrants around (i, j) as shown in Fig. 2 (right). The union of X(p) and F (p) will be called the FX-zone, FX(p) = X(p) ∪ F (p). A bipartite matching M is a subset of P in which each i ∈ I and each j ∈ J is represented at most once. For the sake of brevity, we omit the word ‘bipartite’ in the following text. A subset M ⊆ P is a matching iff for each p, q ∈ M it holds that p ∈ / X(q) (this is the so called ‘uniqueness constraint’ [8]). The cardinality of matching M is denoted as |M |. A maximum cardinality matching has the greatest number of pairs possible. Matching M is monotonic iff for each two pairs p, q ∈ M , p = (i, j), q = (k, l) such that k > i it holds that l > j. A subset M ⊆ P is monotonic iff for each p, q ∈ M it holds that p ∈ / F (q) (this is the so called ‘ordering constraint’ [26]). Let us, for a moment, consider the general subsets M of P (not necessarily matchings). A pair p ∈ P has an inhibition zone Z(p) ⊆ P if it holds that if p ∈ M then no pair q ∈ Z(p) must be in M . Two pairs p, q such that p ∈ Z(q) will be called discordant. For instance, X-zone is the inhibition zone for matchings and F -zone is the inhibition zone for monotonic subsets of P . See also Observation 1 later. Let p ∈ Z(q) iff q ∈ Z(p). Then the discordance relation is symmetric. In this paper we will assume that inhibition zone Z always generates symmetric discordance relation. For short we will say the zone is symmetric. The X-zone and the F -zone shown in Fig. 2 are both symmetric. 2 3 4
The talk The The
theory is independent on the choice of this measure, you may safely imagine we about normalized cross-correlation computed over small pixel neighborhoods. name of the X-zone was chosen because of its shape. F -zone stands for forbidden zone [25].
Finding the Largest Unambiguous Component of Stereo Matching
905
In the next section we define the confident stability condition, describe its properties in the view of the discussion given in the introduction, and present a simple and fast matching algorithm.
3
Confidently Stable Matching
Definition 1 (Matching Problem). Let P be a matching table, let c : P → IR and ∆ : P → IR+0 be bounded functions such that c(p) − ∆(p), c(p) is the confidence interval associated with signature similarity for pair p ∈ P , and let Z(p) be symmetric inhibition zone for each p ∈ P . Then (P, c, ∆, Z) will be called a matching problem. We will be interested in matchings that remain stable for the worst possible combination of match similarities from the confidence intervals. Such matchings will certainly be very little sensitive to Class 1 or Class 2 ambiguity. They can be regarded as unambiguous with certainty related to the confidence level chosen. In practice, this is of course limited by the capture probability of the confidence interval estimator. We will now state the confident stability condition which defines when a subset of matching table P is considered unambiguous. Later we prove an algorithm that is able to find the largest subset of this property. Definition 2 (Confident Stability). Let (P, c, ∆, Z) be a matching problem. A subset S ⊆ P is confidently stable if for each p ∈ S and every q ∈ Z(p) the following holds: If c(q) ≥ c(p) − ∆(p) then there must be r ∈ S ∩ Z(q) such that c(r) > c(q) + ∆(r). Thus, stability is a global property of subsets of P . It is not a one-step lookahead rule in a discrete minimization strategy. In fact, there is no obvious cost function associated with this problem. Later we will see there is a non-iterative algorithm that finds stable subset of maximum cardinality. See [24] for reasoning that results in this definition. It is easy to see [24] that the following holds: Observation 1 If X ⊆ Z then S is a matching. If F ⊆ Z then S is monotonic. Before we review the basic properties of Def. 2 we need to introduce the following concept: Definition 3 (Completeness). A subset M ⊆ P is complete iff for each pair p∈ / M it holds that the intersection Z(p) ∩ M is non-empty. A complete subset (matching) is ‘the most dense.’ It explains all pixels in an image as either matched or half-occluded. As discussed above, we are not searching for complete matching unless the matching problem is unambiguous. The following lemma tells us what happens if there is no ambiguity: Lemma 1. Let (P, c, ∆, Z) be a matching problem. If ∆(p) ≡ 0 for all pairs p in confidently stable subset S ⊆ P and no two discordant pairs are of equal cost c(·) then S is complete.
906
ˇara R. S´
In general, the converse is not true (S can be complete even if ∆ ≡ 0). The proof of this lemma is given in [24]. Finally, we give the existence theorem: Theorem 1. For any given matching problem (P, c, ∆, Z) there is a unique (possibly empty) confidently stable subset S ⊆ P of maximum cardinality. The theorem is proven by defining quasi-stable subset M ⊆ P which is not unique but always exists and is always complete and by showing that any confidently stable subset S must be a subset of some quasi-stable subset M . Quasi-stability is defined by choosing ∆ ≡ 0 and making the second inequality not sharp in Def. 2. Then the observation is made that any two complete quasi-stable subsets M differ in equal-cost discordant pairs only. But such pairs cannot be in S, which means S is a subset of the intersection of all quasi-stable subsets of P . It is then shown that if there are two confidently stable subsets in P they must be nested. By induction one gets there is a unique confidently stable matching of maximum cardinality, which completes the proof. Details of the proof are found in [27]. It may seem that confidently stable subset is empty quite often. In [5] we show that the sparsity of stable matching is governed by discriminability D(c) of the signature similarity function c. Let T be the set of correct matches for a matching problem. Discriminability is the probability of an inequality event D(c) = P c(p | p ∈ T ) > c(p | p ∈ / T) . (1) As long as D(c) > 0.5 the stable subset is not empty. As shown in a numerical experiment in [5], common similarity measures do differ in their discriminability, but all of them approach the undecidability limit of D(c) = 0.5 for signal-tonoise ratios well below −20dB. This means that if (1) images do not exhibit Class 2 ambiguity (repetitive texture) everywhere and (2) we are able to obtain sufficiently tight confidence interval estimates for our image measurements, confidently stable matching will be non-empty. The algorithm for finding confidently stable matching proceeds as follows: Algorithm 1 Input: Confidently stable matching problem (P, c, ∆, Z). Output: Confidently stable subset S. Procedure: 1. Initialize U := ∅, M := ∅, S := ∅, for all p ∈ P let c˜(p) := c(p). 2. If P is empty, terminate. 3. Remove from P the subset L of pairs of the largest cost c˜(p) in P . 4. For all r ∈ L do: a) if r ∈ U then R := R ∪ Z(r) b) else if r ∈ / M then i. c˜(r) := c(r) − ∆(r), ii. P := P ∪ {r}, iii. M := M ∪ {r}, iv. U := U ∪ Z(r).
Finding the Largest Unambiguous Component of Stereo Matching
907
5. For all r ∈ (L ∩ M ) \ R do: a) S := S ∪ {r}, b) P := P \ (Z(r) ∪ {r}). 6. Go to Step 2. Theorem 2 (Alg. 1 Correctness). For any matching problem (P, c, ∆, Z) Alg. 1 terminates and finds the confidently stable matching of maximum cardinality. In the proof one first notices the difference between confidently stable matching S and quasi-stable matching M is that S contains no pairs in inhibition zones of converting pairs C ⊆ P \M . The task of the converting pairs is thus to remove some pairs from M to obtain S. One then shows there is a strict partial ordering on C for any ∆ that allows us to uniquely identify C in P . One shows that the set M in the algorithm is quasi-stable matching on (P, c, ∆, Z). The statement of the theorem follows as a corollary. Details are found in [27]. To summarize the algorithm, it constructs a quasi-stable subset M from pairs in P in the order of their decreasing cost c and it simultaneously identifies converting pairs C, also in the order of their decreasing cost c − ∆. The quasistable subset is found by the successive selection of the largest-cost pair r in P (which is always quasi-stable [24]) in Steps 3 and 4(b)iii followed by the removal of its inhibition zone Z(r) from the set of all potential matches P in Step 5b. This is repeated until P is empty. But every quasi-stable pair p is forced to cycle through the loop 3–6 twice. When it appears in Step 4 for the first time, its cost c(p) has the value of the upper limit of the confidence interval. Such pair must be quasi-stable. Its cost is decreased to c(p) − ∆(p) and it is returned to P in Step 4(b)ii. This is followed by the update in Step 4(b)iv of the set U of all potential converting pairs. The same quasi-stable p is bound to appear in Step 4 for the second time. Then, unless it is in the inhibition zone of some converting pair q ∈ C (i.e. unless it is absent in the set of removed pairs R), it is confidently stable and is added to the set S in Step 5a. Pairs that are not quasi-stable cycle only once; when they are recognized as such by testing their presence in the set U in Step 4a, it is known they must be converting pairs. A converting pair q prevents any pair from its inhibition zone Z(q) to become confidently stable (by adding Z(q) to the set R). Note that some pairs from P are never visited in Step 3, as they are deleted from P in Step 5b. The correctness of the whole procedure is guaranteed by Theorem 2. Its proof gives more insight into how the algorithm works [27]. The worst-case time complexity of this algorithm is O(d2 n) if P is a n × n matching table and disparity search interval is d. The implementation of this algorithm is quite simple for FX-stable matching. One uses a property of FXzones: the union of their complements in P is a set E ⊂ P such that it is blockwise monotonic. The update of the type Q := Q ∪ FX(r) can be done in O(d) time and the query of the type r ∈ Q can be done in O(1) time. More implementation details are found in [27].
908
4
ˇara R. S´
Experimental Results
We restrict our experimental results to FX-stable subsets for which Z(p) = F (p) ∪ X(p). Note that this choice implies the confidently stable subset is a monotonic matching (it obeys ordering constraint). Ordering constraint, represented by the F part of inhibition zone (see Fig. 2) is a weak regularizer. We first show results on a designed artificial scene, see Fig. 3. The scene consists of three long thin textured stripes in front of a textured plane. The stripes are approximately 14 pixels wide in 300 × 337 images. The object-background distance was adjusted such that the width of the half-occluded region is exactly equal to the width of the stripes. The scene was illuminated by controlled stabilized illuminant, whose adjustable intensity was used to vary texture contrast. Disparity slowly varies across both the object and the background in the range of (−17, 20). Disparity search was done in the full range of ±337. We use this dataset for measuring accuracy of stereo algorithms [24]. For match similarity, modified normalized cross-correlation [28] c(WL , WR ) =
2 cov(WL , WR ) ∈ [−1, 1] var WL + var WR
(2)
was computed for 5 × 5 image windows WL , WR . A rigorous approach to constructing confidence intervals for stereoscopic matching is discussed in [29]. We approximate the interval by sensitivity of c(·, ·) to independent noise as 4 |c(WL , WR )| ∆(WL , WR ) = max α , β , α, β ≥ 0. (3) var WL + var WR Note it is not necessary that ∆ measure the image autocorrelation function properties along the epipolar line direction: the X component of the inhibition zone together with the confident stability constraint takes care of (locally) flat or periodic autocorrelation function. Computed disparity maps are shown in Fig. 3. White encodes missing matches (half-occlusions and false negatives). The electronic version of this paper uses color-coded maps where errors stand out very clearly. Low-contrast and high-contrast images of the same scene were tested and compared with ground-truth. The percentage of matching error relative to the matched set is 0.72% in the low-contrast image set and 0.47% in the high-contrast set. Results are compared with the Maximum Aposteriori Probability (MAP) solution using the Cox et al. algorithm based on dynamic programming [30].5 The matching error percentages for MAP are 59% and 12%, respectively. Errors were evaluated using methodology described in [24] which measures the consistency of matching results over a wide range of SNR. We demonstrate the performance of Alg. 1 for various choices of α and β on the Shrub image pair in Fig. 4. The central object of the scene is a sign posted on 5
Regularization penalty of the Cox et al. algorithm was set to 500 to achieve the best possible results for the high-contrast image pair.
Finding the Largest Unambiguous Component of Stereo Matching
left images
confidently stable
segmentation
909
MAP
Fig. 3. Matching results on the stripes scene. Left image of high and low contrasts (1st column), disparity maps (2nd column) and segmentations (3rd column) by confidently stable matching with α = 10, β = 0.02 (segmentation colors: green–matched; blue–half-occluded; red–ambiguous, many matches possible; yellow– ambiguous, matched/occluded uncertainty), disparity map by the MAP algorithm is shown for comparison (last column).
a thin pole just in front of a tall hedge. The leafed shrubs cannot be considered as a surface at the scale of image resolution, so no continuity prior can be used. Five shrubs can be distinguished, they differ by their distance from the camera. The roadside is imaged at high inclination angle relative to the camera pair, so in addition to the lack of texture (Class 1 ambiguity) it also exhibits large relative affine distortion between the images. The rim of the curb is parallel to epipolar lines in the left half of the image and it is thus difficult to match (Class 2 ambiguity). The background wall images as a very regular texture (Class 2 ambiguity) and it is thus difficult to match without a continuity prior. The dark space in the open window does not contain any stereoscopic information. Besides having no natural texture, the window glass partly reflects the featureless cloudy sky (both Class 1 ambiguity). The images do not exactly correspond to the same scene because of time difference breeze artefacts in the image capture. Disparity maps shown in Fig. 4 are given for varying α (constant in rows) and β (constant in columns). In the first row (α = 0) we can see that β itself cannot resolve ambiguity due to low SNR (Class 1): the noisiness in the shadowed window area remains until large values of β (2.5% of signature similarity range). The structural ambiguity (Class 2) is detected quite early: the number of mismatches on the brick wall and on the rim of the curb decreases fast with increasing β. In the first column (β = 0) we can see that α itself does not detect structural ambiguity very well: large proportion of mismatches remain on the wall and the curb rim. The Class 1 ambiguity is detected early, on the other hand.
910
ˇara R. S´
Fig. 4. The Shrub image pair (the wide-most stereo baseline of the set). Matching results for varying α (per rows: α = 0, α = 10, α = 20) and β (columns). Disparity color coding is shown beside the images (light grey: high disparity, black: low disparity, white: holes). Matching window 5 × 5, disparity search range (10, 30), image size 480 × 512, mean CPU time 8.2 sec (P7, 1GHz), percentage of matched pixels given under maps. See the PDF version with color-coded maps where matching errors are more apparent.
Finding the Largest Unambiguous Component of Stereo Matching
911
We conclude α is related to the ability to detect Class 1 ambiguity and β is related to the ability to detect Class 2 ambiguity. Both have to be non-zero. Increasing both of them decreases disparity map density but also eliminates errors. Our short experience with the algorithm shows that α = 10 and β = 0.02 is suitable for a broad range of natural scenes. Under high structural ambiguity, increasing β up to 0.04 suffices. Note that a few points of high disparity remain in the bushes for even very wide confidence intervals. These are due to the wind artefacts mentioned above. Another example of an outdoor scene is shown Fig. 5. The building is the Branitz Castle in Brandenburg, Germany (built 1772). The image is a section of a 360◦ panorama captured by a non-perspective camera with two vertically aligned line sensors placed above each other and rotating around a common vertical axis [31]. Epipolar lines are vertical in these images.
Fig. 5. Left image and disparity map of Branitz Castle. Disparity coding is the same as in Fig. 4. Matching window 7 × 7 pixels, α = 10, β = 0.02, disparity search range (0, 90), image size 2037 × 4421, 28% matched.
912
ˇara R. S´
This example was selected because of great scene depth and because it exhibits various combinations of Class 1 and Class 2 ambiguity. The disparity map is shown in Fig. 5. A few significant isolated errors appear on the pilasters and near the bottom image edge. Even small and thin objects are matched: the flower pots, the lions, and the lamp posts, the tree trunk supported by the three pales, the wire loop on the flagpole or the tips of the lightning rods on the roof corners and the chimneys. The individual branches of the foreground tree are well distinguished. Ambiguous regions are suppressed, most notably on the fa¸cade and the staircase. Note the tiled roof is matched quite densely; the stripes of false negatives are sampling artefacts (due to spatial frequencies close to Nyquist limit). Note the disparity map is especially dense on the lawn, the shrubbery alongside the building, and the foliage in the background, i.e. in regions of irregular texture. We conclude the combination of Class 2 and Class 1 ambiguity is the worst problem. But the Class 2 ambiguity can be reduced by using more views. With an independent view the disparity map density would increase significantly. To partially compare confidently stable matching with other stereo matching algorithms we run Alg. 1 on the Scharstein & Szeliski’s test set [32]. The error was evaluated only on the set of matches found by the algorithm. The results shown in Fig. 6 were computed for α = 10, β = 0.02, 5 × 5 matching widow using correlation (2) and (3). The average running time (CPU 1GHz P7) was 3.1 sec. We can see a straightforward stereo matching implementation based on the stability principle alone can already be ranked as the fourth best (Feb 2002).
tsukuba
sawtooth map dataset density tsukuba 45% sawtooth 52% venus 40% map 74%
venus error statistics BO¯ BT¯ BD 1.4% 0.22% 7.8% 1.6% 0.03% 14.0% 0.8% 0.04% 10.0% 0.3% 0.20% 3.6%
map
Fig. 6. Results on the Middlebury data set. BO¯ is the percentage of mismatches in non-occluded region, BT¯ is the number of mismatches in textureless areas, and BD is the number of errors at disparity discontinuities [32].
Finding the Largest Unambiguous Component of Stereo Matching
5
913
Conclusions
We have defined confidently stable matching that is able to (1) find the largest unambiguous component of stereo matching and (2) to identify the part of solution where regularization is necessary to resolve ambiguities. We have seen on three natural scenes and on three artificial scenes that the unique component of stable matching is large (10% to 80% of image density) even if confidence interval for match similarity measure is wide (2.5% of its range). Moreover, the experiment shows that confidently stable matching has very low error rate (0.3% to 1.4% of all matches) and that even very small objects are correctly matched. The advantage of stability principle is that it gives a great freedom in defining what is a set of good matches. Most importantly, since stability is a property of partial orderings, it is easy to extend the results to multi-criterial matching, which opens a possibility to influence the low-level vision process by some highlevel scene understanding process by means of meta-rules defining the relative importance of the individual criteria. This is a topic for our future research. Acknowledgements. The author was supported by the Grant Agency of the Czech Republic under project GACR 102/01/1371 and by the Czech Ministry of Education under project MSM 210000012. Part of this work was conducted while the author was visiting the CITR group at the University of Auckland. The support of CITR is gratefully acknowledged. The Branitz Castle images are courtesy of DLR Berlin, the Shrub images are courtesy of Carnegie Mellon University.
References 1. Julesz, B.: Towards the automation of binocular depth perception (AUTOMAP-1). In: IFIPS Congress, Munich (1962) 2. Kutulakos, K.N., Seitz, S.M.: A theory of shape by shape carving. IJCV 38 (2000) 199–218 3. Baker, S., Sim, T., Kanade, T.: A characterization of inherent stereo ambiguities. In: Proc ICCV. (2001) 428–435 4. Schlesinger, M.I.: Ambiguity in stereopsis. Personal communication (1998) ˇara, R.: Failure analysis of stable matchings. Research report (In preparation) 5. S´ 6. Stewart, C.V., Dyer, C.R.: The trinocular general support algorithm: A threecamera stereo algorithm for overcoming binocular matching errors. In: Proc ICCV. (1988) 134–138 7. Satoh, K., Ohta, Y.: Occlusion detectable stereo using a camera matrix. In: Proc ACCV. (1995) 331–335 8. Marr, D.: A note on the computation of binocular disparity in a symbolic, low-level visual processor. A.I. Memo 327, AI Lab, MIT (1974) 9. Pollard, S.B., Mayhew, J.E.W., Frisby, J.P.: PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception 14 (1985) 449–470 10. Zitnick, C.L., Kanade, T.: A cooperative algorithm for stereo matching and occlusion detection. IEEE Trans PAMI 22 (2000) 675–684
914
ˇara R. S´
11. Belhumeur, P.N.: A Bayesian approach to binocular stereopsis. IJCV 19 (1996) 237–260 12. Bobick, A.F., Intille, S.S.: Large occlusion stereo. IJCV 33 (1999) 181–200 13. Robert, L., Deriche, R.: Dense depth map reconstruction: A minimization and regularization approach which preserves discontinuities. In: Proc ICIP. (1992) 123–127 14. Barnard, S.T.: Stochastic stereo matching over scale. IJCV 3 (1989) 17–32 15. Scharstein, D., Szeliski, R.: Stereo matching with nonlinear diffusion. IJCV 28 (1998) 155–174 16. Boykov, Y., Veksler, O., Zabih, R.: Disparity component matching for visual correspondence. In: Proc Conf CVPR. (1997) 470–475 17. Ishikawa, H., Geiger, D.: Occlusions, discontinuities, and epipolar lines in stereo. In: ECCV. (1998) 232–248 18. Roy, S., Cox, I.J.: A maximum-flow formulation of the n-camera stereo correspondence problem. In: Proc ICCV. (1998) 492–499 19. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions using graph cuts. In: Proc ICCV. (2001) 508–515 20. March, R.: Computation of stereo disparity using regularization. Pattern Recognition Letters 8 (1988) 181–187 21. March, R.: A regularization model for stereo vision with controlled continuity. Pattern Recognition Letters 10 (1989) 259–263 22. Jung, D.Y., Oh, J.H., Lee, S.C., Lee, C.H., Nam, K.G.: Stereo matching by discontinuity-preserving regularization. J of Elect Eng and Inf Sci 4 (1999) 452–8 23. Manduchi, R.; Tomasi, C.: Distinctiveness maps for image matching. In: Proc ICIAP. (1999) 26–31 ˇara, R.: Sigma-delta stable matching for computational stereopsis. Research Re24. S´ port CTU–CMP–2001–25, Center for Machine Perception, Czech Technical University (2001) [ftp://cmp.felk.cvut.cz/pub/cmp/articles/sara/Sara-TR-2001-25.pdf]. 25. Krol, J.D., van der Grind, W.A.: Rehabilitation of a classical notion of Panum’s fusional area. Perception 11 (1982) 615–619 26. Yuille, A.L., Poggio, T.: A generalized ordering constraint for stereo correspondence. A.I. Memo 777, AI Lab, MIT (1984) ˇara, R.: A fast algorithm for confidently stable matching. Research Report CTU– 27. S´ CMP–2002–03, Center for Machine Perception, Czech Technical University (2002) [ftp://cmp.felk.cvut.cz/pub/cmp/articles/sara/Sara-TR-2002-03.pdf]. 28. Moravec, H.P.: Towards automatic visual obstacle avoidance. In: Proc IJCAI. (1977) 584 29. Mandelbaum, R., Kamberova, G., Mintz, M.: Stereo depth estimation: a confidence interval approach. In: Proc ICCV. (1998) 503–509 30. Cox, I.J., Hingorani, S., Maggs, B.M., Rao, S.B.: Stereo without disparity gradient smoothing: a Bayesian sensor fusion solution. In: Proc BMVC. (1992) 337–346 31. Scheibe, K., Korsitzky, H., Reulke, R., Scheele, M., Solbrig, M.: EYESCAN-a high resolution digital panoramic camera. In: Int Wkshp Robot Vision, Auckland (2001) 77–83 32. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Technical Report MSR–TR–2001–81, Microsoft Research, Redmont, WA (2001) To appear in IJCV.
Author Index
Adams, N.J. IV-82 Agarwal, S. IV-113 Ahuja, N. IV-685 Alvarez, L. I-721 Anandan, P. II-883 Aner, A. IV-388 Ansar, A. IV-282 Araujo, H. IV-237 Attias, H. I-736 Aubert, G. III-365 Auf der Maur, D. III-180 August, J. III-604 Avidan, S. III-747 Bajaj, C. III-517 Barla, A. IV-20 Barlaud, M. III-365 Barnard, K. IV-97 Barreto, J.P. IV-237 Bartoli, A. II-340 Bazin, P.-L. II-262 Beal, M.J. I-736 Belhumeur, P.N. III-869 Bell, J.W. IV-358 Belongie, S. III-21, III-531 Bhat, K.S. I-551 Bhotika, R. III-112 Bischof, H. IV-761 Black, M.J. I-476, I-692, I-784, IV-653 Blake, A. I-645, IV-67 Blanz, V. IV-3 Bø, K. III-133 Borenstein, E. II-109 Botello, S. IV-560 Boult, T. IV-590 Boyer, E. IV-221 Brand, M. I-707 Bregler, C. I-801 Bretzner, L. III-759 Breuel, T.M. III-837 Brodsky, T. I-614 Buehler, C. III-885 Buhmann, J.M. III-577
Caenen, G. III-180 Calderon, F. IV-560 Carcassoni, M. I-266 Carlsson, S. I-68, I-629, II-309 Carneiro, G. I-282, I-297 Caspi, Y. I-753 Castellani, U. II-805 Chang, J.T. II-31 Chantler, M. III-289 Charbonnier, P. I-492 Chaumette, F. IV-312 Chazelle, B. II-642 Cheeseman, P. II-247 Chefd’hotel, C. I-251 Chellappa, R. II-277, III-681 Chen, H. I-236 Chen, Y. IV-546 Chung, F. III-531 Cipolla, R. II-155, II-852 Clerc, M. II-495 Cobzas, D. II-415 Cohen, L.D. III-807, IV-531 Cohen, M.F. III-885 Collins, R. II-657 Comaniciu, D. I-173, III-561 Cootes, T.F. III-3, IV-621 Cornelis, K. II-186 Corpetti, T. I-676 Costeira, J. II-232 Coughlan, J.M. III-453 Cremers, D. II-93 Criminisi, A. I-508 Crouzil, A. IV-252 Cuf´ı, X. III-408 Daniilidis, K. II-140, IV-282 Darrell, T. III-592, III-851 David, P. III-698 Davies, R.H. III-3 Davis, L.S. I-18 DeCarlo, D. IV-327 Del Bue, A. III-561 DeMenthon, D. III-698 Deriche, R. I-251, I-721 Deschamps, T. III-807 Deutscher, J. IV-175
916
Author Index
Dick, A.R. II-852 Dickinson, S. III-759 Dobkin, D. II-642 Donato, G. III-21 Doorn, A.J. van I-158 Dornaika, F. IV-606 Drbohlav, O. II-46 Drew, M.S. IV-823 Duci, A. III-48 Duraiswami, R. III-698 Duygulu, P. IV-97 Dyer, C. IV-131 Eklundh, J.-O. III-469 Elder, J. IV-606 Ernst, F. II-217 Everingham, M. IV-34 Fablet, R. I-476 Faugeras, O. I-251, II-790 Favaro, P. II-18, II-735 Feldman, D. I-614 Felsberg, M. I-369 Ferrari, V. III-180 Ferreira, S.J. III-453 Ferrie, F.P. IV-267 Finkelstein, A. II-642 Finlayson, G.D. IV-823 Fisher, J.W. III-592, III-851 Fisher, R.B. IV-146 Fitzgibbon, A. III-304, III-487 Flandin, G. IV-312 Fleet, D.J. I-692, III-112 Florack, L. I-143, I-190 Forsyth, D.A. III-225, IV-97 Fossum, R. II-201 Fowlkes, C. III-531 Freitas, J.F.G. de IV-97 Freixenet, J. III-408 Frey, B.J. IV-715 Fua, P. II-325, II-704, III-163 Fukui, K. III-195 Funkhouser, T. II-642 Fusiello, A. II-805 Gangnet, M. I-645, I-661 Gao, X. IV-590 Gee, J. III-621 Georgescu, B. II-294 G´erard, O. III-807
Geusebroek, J.-M. I-99 Geyer, C. II-140 Giannopoulos, P. III-715 Giblin, P. II-718 Gilboa, G. I-399 Goldberger, J. IV-461 Goldenberg, R. I-461 Gomes, J. II-3 Gong, S. IV-670 Gool, L. Van II-170, II-186, II-572, II-837, III-180, IV-448 Gortler, S.J. III-885 Graham, J. IV-517 Granger, S. IV-418 Greenspan, H. IV-461 Greff, M. III-807 Grenander, U. I-37 Grossberg, M.D. I-220, IV-189 Guo, C.-e. III-240, IV-793 Gurdjos, P. IV-252 Hadjidemetriou, E. I-220 Han, F. III-502 Hancock, E.R. I-266, II-63, II-626, III-822 Hartley, R.I. II-433, II-447 Harville, M. III-543 Hassner, T. II-883 Hayman, E. III-469 Hebert, M. III-651, III-776 Heidrich, W. II-672 Hermes, L. III-577 Hertz, T. IV-776 Hillenbrand, U. III-791 Hirzinger, G. III-791 Hordley, S.D. IV-823 Huang, K. II-201 Hue, C. I-661 Huggins, P.S. I-384 Ieng, S.-S. I-492 Ilic, S. II-704 Irani, M. I-753, II-883 Isard, M. IV-175 Jagersand, M. II-415 Javed, O. IV-343 Jehan-Besson, S. III-365 Jelinek, D. II-463 Jepson, A.D. I-282, I-692 Jiang, G. I-537
Author Index Jin, H. II-18 J¨onsson, C. III-759 Jojic, N. I-736, IV-715 Kahl, F. II-447 Kam, A.H. IV-297 Kamberov, G. II-598 Kamberova, G. II-598 Kambhamettu, C. II-556, IV-206 Kaminski, J.Y. II-823 Kanatani, K. III-335 Kang, S.B. I-508, III-210 Kaucic, R. II-433 Kazhdan, M. II-642 Kender, J.R. IV-388, IV-403 Khosla, P.K. I-551 Kim, J. III-321 Kimia, B.B. II-718, III-731 Kimmel, R. I-461 Klein, P.N. III-731 Koenderink, J.J. I-158, IV-808 Kohlberger, T. II-93 Kolmogorov, V. III-65, III-82 Koˇseck´a, J. IV-476 Kozera, R. II-613 Kriegman, D.J. III-651, III-869 Kr¨uger, V. IV-732 Kuehnel, F.O. II-247 Kuijper, A. I-143, I-190 Kutulakos, K.N. III-112 Lachaud, J.-O. III-438 Lafferty, J. III-776 Lasenby, J. I-524 Lazebnik, S. III-651 Leclerc, Y. III-163 Lee, A.B. I-328 Lee, M.-S. I-614 Lee, S.W. II-587 Leonardis, A. IV-761 Levin, A. II-399, III-635 Lhuillier, M. II-125 Li, S.Z. IV-67 Li, Y. III-210 Lin, S. III-210 Lindeberg, T. I-52, III-759 Liu, C. II-687 Liu, T. IV-403 Liu, X. I-37 Liu, Y. II-657
917
Lo, B.P.L. III-381 Loy, G. I-358 Lu, W. IV-297 Luong, Q.-T. III-163 Ma, Y. II-201 MacCormick, J. IV-175 Maciel, J. II-232 Mahamud, S. III-776 Maki, A. III-195 Malik, J. I-312, III-531, III-666 Malis, E. IV-433 Malladi, R. I-343 Malsburg, C. von der IV-747 Maluf, D.A. II-247 Manning, R. IV-131 Markham, R. IV-502 Marroquin, J.L. I-113, IV-560 Martinec, D. II-355 Mart´ı, J. III-408 Mathiassen, J.R. III-133 Maybank, S. IV-373 Mayer, A. IV-461 McGunnigle, G. III-289 McMillan, L. III-885 Medioni, G. III-423 Meer, P. I-236, II-294 ´ I-676 M´emin, E. Mikolajczyk, K. I-128 Mindru, F. IV-448 Mirmehdi, M. IV-502 Mitran, M. IV-267 Mittal, A. I-18 Mitter, S. III-48 Mojsilovic, A. II-3 Moons, T. IV-448 Mori, G. III-666 Morris, R.D. II-247 Muller, H. IV-34 Mu˜noz, X. III-408 Murino, V. II-805 Murray, D.W. I-82 Nakashima, A. III-195 Narasimhan, S.G. III-148, IV-636 Nayar, S.K. I-220, I-508, III-148, IV-189, IV-636 Ng, J. IV-670 Ng, T.K. I-205
918
Author Index
Nicolescu, M. III-423 Noakes, L. II-613
Rouy, E. II-790 Rudzsky, M. I-461
Odone, F. IV-20 Oliensis, J. II-383 Osareh, A. IV-502 Overveld, K. van II-217
Samaras, D. III-272 Samet, H. III-698 S´anchez, J. I-721 ˇ ara, R. II-46, III-900 S´ Savarese, S. II-759 Sawhney, H.S. I-599 Schaffalitzky, F. I-414 Scharstein, D. II-525 Schiele, B. IV-49 Schmid, C. I-128, III-651, IV-700 Schmidt, M. III-289 Schn¨orr, C. II-93 Schwartz, S. I-173 Schweitzer, H. IV-358, IV-491 Sebastian, T.B. III-731 Seitz, S.M. I-551 S´en´egas, J. III-97 Sethi, A. III-651 Shah, M. IV-343 Shakhnarovich, G. III-851 Sharp, G.C. II-587 Shashua, A. II-399, III-635 Shechtman, E. I-753 Shental, N. IV-776 Shokoufandeh, A. III-759 Shum, H.-Y. II-510, II-687, III-210, IV-67 Sidenbladh, H. I-784 Siebel, N.T. IV-373 Sigal, L. I-784 Skavhaug, A. III-133 Skoˇcaj, D. IV-761 Smelyansky, V.N. II-247 Smeulders, A.W.M. I-99 Sminchisescu, C. I-566, I-769 Soatto, S. II-735, III-32, III-48 Sochen, N.A. I-399 Sommer, G. I-369 Speyer, G. I-432 Srivastava, A. I-37 Strecha, C. II-170 Sturm, P. II-867, IV-221 Sullivan, J. I-629 Sun, J. II-510 Sung, K.K. I-205 Swaminathan, R. I-508 Szeliski, R. I-508, II-525
Pajdla, T. II-355 Pal, C. IV-715 Papadopoulo, T. I-721 Paragios, N. II-78, II-775 Pavel, M. IV-776 Payrissat, R. IV-252 Pece, A.E.C. I-3 Pedersen, K.S. I-328 Pennec, X. IV-418 P´erez, P. I-645, I-661, I-676 Perona, P. II-759 Petrou, M. III-289 Plaenkers, R. II-325 Pollefeys, M. II-186, II-837 Ponce, J. III-651 Pont, S.C. IV-808 Popovi´c, J. I-551 Prados, E. II-790 Prince, J.L. IV-575 Qian, G. II-277 Quan, L. I-537, II-125 Raba, D. III-408 Ragheb, H. II-626 Ramesh, V. I-173, II-775, III-561, IV-590 Ravve, I. I-343 Ren, X. I-312 Richard, F. IV-531 Ringer, M. I-524 Rivera, M. I-113, III-621 Rivlin, E. I-461 Robertson, D.P II-155 Robles-Kelly, A. II-63 Rogers, M. IV-517 Romdhani, S. IV-3 Ronchetti, L. II-805 Ronfard, R. IV-700 Rosin, P.L. II-746 Roth, D. IV-113, IV-685 Rother, C. II-309 Rousson, M. II-78, II-775
Author Index Tam, R. II-672 Tarel, J.-P. I-492 Taton, B. III-438 Taylor, C.J. II-463, III-3, IV-621 Teicher, M. II-823 Tell, D. I-68 Terzopoulos, D. I-447 Thacker, N. IV-621 Thomas, B. IV-34, IV-502 Tohka, J. III-350 Tong, X. III-210 Tordoff, B. I-82 Torr, P.H.S. II-852 Torre, F. De la IV-653 Torresani, L. I-801 Torsello, A. III-822 Trajkovic, M. I-614 Triggs, B. I-566, I-769, IV-700 Tschumperl´e, D. I-251 Tsin, Y. II-657 Tsui, H.-t. I-537 Tu, Z. III-393, III-502 Twining, C.J. III-3 Ullman, S.
II-109
Varma, M. III-255 Vasconcelos, N. I-297 Vasilescu, M.A.O. I-447 Veltkamp, R.C. III-715 Vemuri, B.C. IV-546, IV-560 Verbiest, F. II-837 Vermaak, J. I-645, I-661 Verri, A. IV-20 Vetter, T. IV-3 V´ezien, J.-M. II-262 Vidal, R. II-383 Vogel, J. IV-49 Wang, B. I-205 Wang, C. III-148 Wang, S.-C. I-583 Wang, Y. I-583, III-272, IV-793 Wang, Z. IV-546 Waterton, J.C. III-3 Wehe, D.K. II-587 Weijer, J. van de I-99 Weinshall, D. I-614, IV-776
919
Werman, M. I-432 Werner, T. II-541 Wexler, Y. III-487 Wieghardt, J. IV-747 Wilczkowiak, M. IV-221 Wilinski, P. II-217 Williams, C.K.I. IV-82 Wolf, L. II-370 Worrall, A.D. I-3 Wu, F. IV-358 Wu, Y.N. III-240, IV-793 W¨urtz, R.P. IV-747 Wyngaerd, J. Vanden II-572 Yang, G.-Z. III-381 Yang, M.-H. IV-685 Yang, R. II-479 Yau, W.-Y. IV-297 Yezzi, A.J. III-32, III-48, IV-575 Yu, Y. II-31 Yu, Z. III-517 Zabih, R. III-65, III-82, III-321 Zalesny, A. III-180 Zeevi, Y.Y. I-399 Zelinsky, A. I-358 Zhang, C. II-687 Zhang, H. IV-67 Zhang, W. IV-476 Zhang, Y. II-556 Zhang, Z. II-479, IV-67, IV-161 Zhao, W.Y. I-599 Zheng, N.-N. II-510 Zhou, S. III-681, IV-732 Zhou, W. IV-206 Zhu, L. IV-67 Zhu, S.-C. I-583, III-240, III-393, III-502, IV-793 Zhu, Y. I-173 Zhu, Y. I-583 Zickler, T. III-869 Zisserman, A. I-414, I-537, II-541, III-255, III-304, III-487 Z¨oller, T. III-577 Zomet, A. II-370 Zucker, S.W. I-384 ˇ c, J. II-746 Zuni´