Image-Based Modeling
Long Quan
Image-Based Modeling
Foreword by Takeo Kanade
1C
Long Quan The Hong Kong University of Science & Technology Dept. of Comp. Sci. & Engineering Clear Water Bay, Kowloon Hong Kong SAR
[email protected]
ISBN 978-1-4419-6678-0 e-ISBN 978-1-4419-6679-7 DOI 10.1007/978-1-4419-6679-7 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010930601 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To my loving family
Foreword
Obtaining a three-dimensional model of an object and a scene from images is a useful capability with a wide range of applications from autonomous robotics to industrial vision and consumer digital entertainment. As such, it has been a topic of intensive study since the early days of computer vision, but has eluded a general solution for it. 3D modeling requires more than reconstructing a cloud of 3D points from images; it requires a detailed and precise representation, whose form is often dependent on individual objects. This book written by Long Quan, a pioneering and veteran researcher of the field, presents impressive results of modeling hairs, trees and buildings from images - probably one of the best we have seen. The success is due to the interdisciplinary efforts of Long Quan and his collaborators in the recent years. Recognizing that it is yet infeasible to devise a completely general method of representing arbitrary objects, they developed an innovative methodology that combines computer-vision data-driven reconstruction techniques with some model-based techniques borrowed from the field of computer graphics to reach the goal of modeling a given type of object. This book guides you in the journey of 3D modeling from the theory with elegant mathematics to applications with beautiful 3D model pictures. Written in a simple, straightforward, and concise manner, readers will learn the state of the art of 3D reconstruction and modeling. Takeo Kanade U. A. and Helen Whitaker University Professor of Computer Science and Robotics Carnegie Mellon University Pittsburgh, PA. USA
vii
Preface
Image-based Modeling refers to techniques that generate a 3D model of a real object or a scene from input 2D photographs or images of the object or the scene. It brings together analytical and automatic techniques from computer vision with synthetic and interactive techniques from computer graphics. The high-quality and photorealistic nature of such models find broad applications in virtual reality, video games, and TV/movie post-production. More recent Google Earth and Microsoft Virtual Earth 3D platforms drive this type of modeling to an extreme, with every corner of the world from inside out needing to be modeled and mapped. This inevitably requires efficient image-based techniques given the gigantic scale of the ambition. The purpose of this book is to give a systematic account of image-based modeling from image pixels to texture-mapped models. The content is divided into three parts: geometry, computation and modeling. The ‘geometry’ part presents the fundamental geometry of multiple views and outlines the most fundamental algorithms for 3D reconstruction from multiple images. The ‘computation’ part begins with feature points to be extracted from an image, and ends with the computation of camera positions of the images and 3D coordinates of the observed image points. The ‘modeling’ part structures the computed 3D point clouds into finalized mathematical representations of 3D objects of different categories with specific methods. We start from small-scale smooth objects with implicit surfaces. Then, we successively move from linear structures of human hairs, to tree structure of real trees and plants. Finally we come to two-dimensional rectilinear fac¸ade modeling, then assemble fac¸ades into buildings, and collect buildings into large-scale cities. Much of the material in this book has been derived from years of research with my students and collaborators. The materials are coherent, but may not be exhaustive in treating alternative modeling methods. Methodologically, in the geometry of multiple views, we concentrate on the algebraic solutions to the geometry problems with the minimal data. We summarize the essence of the vision geometry in the three fundamental algorithms of five, six, and seven points. In computing camera positions and 3D points, we advocate and develop a quasi-dense approach. It fills the gap of insufficiency in the standard sparse structure from motion in all key aspects: density, robustness, and accuracy. In the
ix
x
Preface
modeling stage, a general-purpose modeling from point clouds is yet too difficult to be envisaged for the time being. Thus, systematically, we introduce prior knowledge of the class of objects to be modeled. The prior knowledge for a specific class of objects is encoded in terms of generative rules and generic models, for instances, tree branches are self-similar and tree leaves of the same type of trees share a common generic shape. The generative rules and the generic models plus some mild well-designed user-interactive tools are the keys to the success of our data-driven modeling approach. Pedagogically, there are more specialized text books in computer vision for geometry and structure from motion, but less is more, we choose a more succinct treatment focusing on the cr`eme de la cr`eme of vision geometry. And yet, our treatment is thorough and complete, accessible by readers who are not familiar with the subject. We have also tried to bring together the same conceptual objects such as cameras and geometry often noted and termed differently in vision and graphics. This will ease comprehension for readers of different backgrounds. The interdependence between chapters has been reduced, so a reader may directly jump into a specific chapter of interest. Intentionally, this book is for graduate students, researchers, and industry professionals working in the areas of computer vision and computer graphics. Those in robotics, virtual reality and photogrammetry may find this book useful as well. The content of the book is appropriate for a one-semester graduate course, with the possibilities of taking Part I and II as lecture materials and Part III as reading and discussion materials. Clear Water Bay, Hong Kong, March 2010
Long Quan
Acknowledgements
The book is mostly about the work that I have conducted with my collaborators and students. In particular, I would like to thank Marc-Andr´e Ameller, Tian Fang, Olivier Faugeras, Patrick Gros, Richard Hartley, Radu Horaud, Sing Bing Kang, Takeo Kanade, Zhongdan Lan, Steve Lin, Lu Yuan, Maxime Lhuillier, Guang Jiang, Roger Mohr, Eyal Ofek, Sylvain Paris, Peter Sturm, Eric Thirion, Bill Triggs, HeungYeung Shum, Jian Sun, Ping Tan, Tieniu Tan, Hung-tat Tsui, Jingdong Wang, Zhexi Wang, Yichen Wei, Jianxiong Xiao, Gang Zeng, Honghui Zhang, Peng Zhao, and Andrew Zisserman. Finally, I would like to express my thankfulness to my beloved daughters Jane and Viviane for their support and even critiques, to my wife Pei-Hong for her forbearance, and to my parents for their forever encouragement.
xi
Notation
Unless otherwise stated, we use the following notation throughout the book. • Scalars are lower-case letters (e.g., x, y, z) or lower-case Greek letters (e.g., α, λ). • Vectors are bold lower-case letters (e.g., x, v). • Matrices or tensors are bold upper-case letters (e.g., A, P, T). • Polynomials and functions are lower-case letters (e.g., f (x, y, z) = 0). • Geometry points or objects are upper-case letters (e.g., A, B, P ). • Sets are upper-case letters (e.g., I, V ). • An arbitrary number field is k, the real numbers are R and the complex numbers are C.
xiii
Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Part I Geometry: fundamentals of multi-view geometry 2
Geometry prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Projective geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Projective spaces and transformations . . . . . . . . . . . . . . . . . . . 2.2.3 Affine and Euclidean specialization . . . . . . . . . . . . . . . . . . . . . 2.3 Algebraic geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The simple methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Ideals, varieties, and Gr¨obner bases . . . . . . . . . . . . . . . . . . . . . 2.3.3 Solving polynomial equations with Gr¨obner bases . . . . . . . . .
7 8 8 8 10 16 21 21 23 24
3
Multi-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The single-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 What is a camera? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Where is the camera? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 The DLT calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 The three-point pose algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The uncalibrated two-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The fundamental matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 30 30 30 35 37 39 42 43
xv
xvi
Contents
3.4
3.5
3.6
3.7 3.8
3.3.2 The seven-point algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The eight-point linear algorithm . . . . . . . . . . . . . . . . . . . . . . . . The calibrated two-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 The essential matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 The five-point algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The three-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 The trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 The six-point algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 The calibrated three views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The N-view geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 The multi-linearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Auto-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45 46 47 47 49 53 54 58 63 66 66 68 72 72
Part II Computation: from pixels to 3D points 4
Feature point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Points of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Tracking features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Matching corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Scale invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Invariance and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Scale, blob and Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Recognizing SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77 78 78 78 80 81 82 82 82 83 84
5
Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.1 Least squares and bundle adjustment . . . . . . . . . . . . . . . . . . . . 86 5.1.2 Robust statistics and RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 The standard sparse approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.1 A sequence of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.2 A collection of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 The match propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 The best-first match propagation . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.2 The properties of match propagation . . . . . . . . . . . . . . . . . . . . 97 5.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 The quasi-dense approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.1 The quasi-dense resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.2 The quasi-dense SFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.3 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Contents
xvii
Part III Modeling: from 3D points to objects 6
Surface modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2 Minimal surface functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.3 A unified functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4 Level-set method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.5 A bounded regularization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.7 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7
Hair modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2 Hair volume determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.3 Hair fiber recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.1 Visibility determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.2 Orientation consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.3.3 Orientation triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.5 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.6 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8
Tree modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.2 Branche recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.2.1 Reconstruction of visible branches . . . . . . . . . . . . . . . . . . . . . . 153 8.2.2 Synthesis of occluded branches . . . . . . . . . . . . . . . . . . . . . . . . 155 8.2.3 Interactive editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.3 Leaf extraction and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.1 Leaf texture segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.2 Graph-based leaf extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.3.3 Model-based leaf reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 165 8.4 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.5 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9
Fac¸ade modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.2 Fac¸ade initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.2.1 Initial flat rectangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.2.2 Texture composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.2.3 Interactive refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.3 Fac¸ade decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 9.3.1 Hidden structure discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 9.3.2 Recursive subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.3.3 Repetitive pattern representation . . . . . . . . . . . . . . . . . . . . . . . 186
xviii
Contents
9.3.4 Interactive subdivision refinement . . . . . . . . . . . . . . . . . . . . . . 187 9.4 Fac¸ade augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.4.1 Depth optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.4.2 Cost definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.4.3 Interactive depth assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.5 Fac¸ade completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 9.6 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 9.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 10 Building modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 10.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 10.3 Building segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.3.1 Supervised class recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.3.2 Multi-view semantic segmentation . . . . . . . . . . . . . . . . . . . . . . 205 10.4 Building partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 10.4.1 Global vertical alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 10.4.2 Block separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 10.4.3 Local horizontal alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 10.5 Fac¸ade modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 10.5.1 Inverse orthographic composition . . . . . . . . . . . . . . . . . . . . . . . 211 10.5.2 Structure analysis and regularization . . . . . . . . . . . . . . . . . . . . 213 10.5.3 Repetitive pattern rediscovery . . . . . . . . . . . . . . . . . . . . . . . . . . 216 10.5.4 Boundary regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 10.6 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 10.7 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 10.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 List of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Part I
Geometry: fundamentals of multi-view geometry
Chapter 1
Introduction
3D modeling creates a mathematical representation of a 3D object. In computer graphics, people usually use a specialized software, for instance, Maya or Google SketchUp, to interactively create models. It is common to use a few photographs as references and textures to generate models using a modeling tool. The more systematic introduction of object images as the input to generate a photorealistic 3D model of an object is referred to as ‘Image-based Modeling’, and generates models for physically existing objects. More importantly, such a modeling process can be automated, and therefore can be scaled up for applications. More fundamentally, how to recover the lost third dimension of objects from a collection of 2D images is the ultimate goal of computer vision. The image-based modeling and 3D reconstruction from images are precisely the meeting point of computer graphics and vision, with ‘modeling’ connoting the finalized 3D representation of objects and ‘reconstruction’ emphasizing the recovery of the lost depth of an image. The applications of 3D modeling are part of the modern digital age, for instances, virtual realities, 3D games, TV/movie post-production and special effects. The scientific ambition of enabling a camera-equipped computer system with human-like perceptual abilities such as environment learning and navigation for robotics requires at least an automatic 3D geometric modeling of the sensed environment. The more recent internet 3D platforms, Google Earth and Microsoft Virtual Earth, set up the goal of modeling each and every object on earth from large to small. Why image-based and not active 3D scanners? The scanners are currently the prevalent technology for capturing the 3D geometry of objects, but they are invasive and cannot scan certain materials. They are also not as flexible and scalable as a camera for objects of different sizes. More importantly, the scanners only generate 3D points, which are of higher accuracy and density, but still require to be modeled into objects. The insufficiency of the usual ‘texture’ image for a 3D scanner makes the data-driven modeling much more difficult, as image-based segmentation and recognition are often the keys to the success of the modeling. In many applications, the photorealism of texture is more important than geometry. Fortunately, more and more 3D scanners are now able to capture a texture image of better quality. The methodology developed in the book can be applied to integrate 3D points, and it
1
2
1 Introduction
will result in a more robust process with far more refined geometrical details in the future.
(a)
(b)
(c)
(d)
Fig. 1.1 A small-scale modeling example. (a) One of the twenty-five images captured by a handheld camera. (b) The reconstructed quasi-dense 3D points and their segmentation. (c) A desired segmentation of an input image. (d) A rendered image at a similar viewpoint of the reconstructed 3D model based on the segmentation results in (b) and (c).
The organization of the book The book is structured into three parts, from the fundamentals to applications. Part I—Geometry. This part is on the geometry of 3D computer vision. It is divided into two chapters. The first chapter (Chapter 2) reviews concepts such as projective geometry, algebraic geometry and computer algebra. The second chapter (Chapter 3) presents the geometry of a single view, two views, three views and N views. There are many algorithms of multiple views in the literature, but only a few of them have proven to be fundamental over years of research and practice for their robustness, generality and minimality. Vision geometry essentially comes down to the presentation and understanding of the three fundamental algorithms: the five-point algorithm for two calibrated views, the six-point algorithm for three uncalibrated views, and the seven-point algorithm for two uncalibrated views. These algorithms are the computational engine of any structure from motion systems.
1 Introduction
3
Part II—Computation. This part is on how to obtain 3D points and camera positions from the input images, which is referred to as ‘structure from motion’ or 3D reconstruction. The part is divided into two chapters. The first chapter (Chapter 4) discusses feature points and their extraction from an image. The second chapter (Chapter 5) first presents the standard sparse structure from motion, then focuses on the quasi-dense approach to structure from motion. The sparse structure from motion is now the standard in computer vision, but it suffers from the insufficient density of reconstructed 3D points for modeling. The quasi-dense approach not only provides sufficiently dense 3D points for the objects to be explicitly modeled, but it also results in a more robust structure from motion than the sparse method, which is critical for applications. Part III—Modeling. This part is on how to obtain 3D objects in terms of a mathematical representation from unstructured 3D points. It is divided into five chapters, each of which is dedicated to the modeling of a specific class of objects. The first chapter (Chapter 6) reconstructs smooth surfaces such as human faces and daily objects. The second chapter (Chapter 7) reconstructs hair geometries, fiber by fiber. The third chapter (Chapter 8) models a variety of trees. The fourth chapter (Chapter 9) models fac¸ades of buildings. The fifth chapter (Chapter 10) automatically generates models of city buildings. To illustrate this, we show two examples of image-based modeling. The first, shown in Figure 1.1, is the modeling of a small-scale object with input images captured by a handheld camera. The second, illustrated in Figure 1.2, is the modeling of a large-scale city with input images systematically captured by cameras mounted on a vehicle, the street-view images of Google.
Fig. 1.2 A large-scale city modeling example in downtown Pittsburgh areas from images (shown on the bottom) captured at ground level by a camera mounted on a vehicle.
The challenge in turning unstructured 3D points into objects lies in the fact that objects are of different natures and require different mathematical representations. For instances, a smooth object is handled by an implicit surface, but a hair fiber is by a space curve. The intrinsic ill-posedness of reconstruction requires strong prior knowledge of the objects. In our approach, we systematically introduce the
4
1 Introduction
prior knowledge represented as generative rules and generic models of the class of objects. For instance, tree branches have a self-similar tree structure, which is an example par excellence for generative rules. The shape of tree leaves for a given type of trees is similar, therefore all leaves can be parametrized by a generic model. Obviously, prior to this data-driven modeling, it is necessary to group data points and pixels that belong to the same ‘object’ into the same cluster of points and pixels. The concept of object is subjective: for the example shown in Figure 1.1, the whole plant might be considered as an object for some applications, whereas each individual leaf might be considered as an independent object for other applications, depending on the required scale of the modeling. We introduce a joint segmentation framework that integrates both registered 3D points and input images.
Chapter 2
Geometry prerequisite
This chapter revisits some basic concepts from projective geometry, algebraic geometry and computer algebra. For projective geometry, we present, first intuitively then more formally, the concepts of homogeneous coordinates and projective spaces. Then we introduce metric properties using a projective language, paving the way for camera modeling. For algebraic geometry and computer algebra, we primarily target the introduction of the Gr¨obner basis and eigenvalue methods for solving polynomial systems of vision geometry problems. The purpose is to provide a minimum for readers to be able to follow the book. More refined and detailed treatment of the topics can be found in the excellent textbooks [196] for projective geometry and [27, 26] for algebraic geometry and computer algebra.
7
8
2 Geometry prerequisite
2.1 Introduction The mathematical foundation of 3D geometric modeling is geometry. We are familiar with Euclidean geometry and the methods of Cartesian coordinates integrated into algebra and calculus as part of our standard engineering curriculum. We are not well acquainted with the old subject of projective geometry, nor the new area of computer algebra as they are not part of our curriculum. Both of them, however, play fundamental roles in advancing the geometry of computer vision. Projective geometry is native in describing camera geometry and central projection. Historically, projective geometry was motivated and developed in search of a mathematical foundation for the techniques of ‘perspective’ used by painters and architects. The success of projective geometry is due above all to the systematic introduction of points at infinity and imaginary points. Computer algebra has been developed for manipulating systems of polynomial equations since the 1960s. Many of the difficult vision geometry problems are now efficiently solved with the concepts and the tools from computer algebra.
2.2 Projective geometry 2.2.1 The basic concepts Basic geometry concepts Geometry concepts are built into our mathematical learning. The classical axiomatic approach to geometry proves theorems from a few geometric axioms or postulates. Since Descartes, we usually define a Euclidean coordinate frame; then, points are converted into coordinates and geometric figures are studied by means of algebraic equations. This is the analytical geometry we learned in high school. Affine and vector space. In modern languages, a vector space is an affine space. For instance, an affine plane is the two-dimensional vector space R2 . A point P −→ −−→ −→ on the plane is identified with a vector AP = aAB + bAC, where the three non−−→ −→ collinear points A, B, and C specify a vector basis AB and AC. The affine geometry reduces de facto to linear algebra. Theoretically, there is only a difference of point of view between the concept of a vector space and that of an affine space: each affine space becomes a vector space when we fix one point as the origin; and each vector space is an affine space when we ‘forget’ the ‘zero’. The affine coordinates (a, b) of the point P are the coefficients of linear combinations in the given basis. Geometrically, the affine coordinates can also be viewed as the relative distances to the x-unit length AB and the y-unit length AC, which are obtained by parallelprojecting to the non-rectangular frame with A as the origin and AB and AC as the x and y axis, as illustrated in Figure 2.1. In affine geometry, there is no distance, no angle, only parallelism and ratios.
2.2 Projective geometry
9
Py
P C
A
P
C
B
A
B
Px
Fig. 2.1 Left: affine coordinates as coefficients of linear combinations. Right: affine coordinates as ratios.
Euclidean space. If we define a scalar product or dot product ‘·’ in the vector space, we have the orthogonality expressed as the vanishing of the scalar product x · y = 0 for two vectors x and y. An affine or vector space equipped with a scalar product is an Euclidean space or a metric space in which we do Euclidean or metric geometry. For instance, the distance between two points A and B is the vector length −−→ (x·x)1/2 for x = AB. The angle between AB and BC is the angle arccos(x·y)/((x· −−→ −−→ x)(y · y)) between the two vectors x = AB and y = BC. All these metric concepts are expressed by the scalar product.
Intuitive projective concepts The elements of both affine and Euclidean spaces are geometry points that are all finite points in Rn , with or without a scalar product. Points at infinity. On a plane, we know that two non-parallel lines intersect at a point, but two parallel lines cannot. Imagine that two parallel lines do meet at a point that is a special point we call a point at infinity for that group of parallel lines. Apparently, different groups of parallel lines meet at a different point at infinity. Simply adding these missing points at infinity to the finite point set of the plane R2 gives an extended plane we call a projective plane P2 . More generally, a projective space of any dimension is simply an affine space plus some missing points at infinity: Pn = Rn + {points at infinity}. In projective geometry, metric properties do not exist, nor does the parallelism, but the collinearity and incidence are kept. Different geometries speak differently of points at infinity. Euclidean geometry says that parallel lines exist, but never meet. Affine geometry says that parallel lines exist, but meet at a special point. Projective geometry says that any two lines, including parallel lines, always meet at a point. A point at infinity becomes a normal point in projective space. A point at infinity is visible as a finite vanishing point in a photograph of it! Homogeneous coordinates. The difficulty we are facing is how to algebraically represent a point at infinity.
10
2 Geometry prerequisite
A point P on a real Euclidean line usually has a real number x ∈ R as its coordinate, which is the distance to the origin. The point at infinity of the line could not be represented by the symbol ∞, which is merely a notation, it is not a number. Instead, we take a pair of real numbers (x1 , x2 ) and let the ratio be the usual coordinate such that x1 , when x2 = 0. x= x2 The pair of numbers (x1 , x2 ) is the homogeneous coordinate of a point, and the number x is the usual inhomogeneous coordinate of the point if it is not at infinity. Intuitively, when x2 → 0, x1 /x2 → ∞. The point represented by homogeneous coordinates (x1 , 0) is the point at infinity. Of course, the point at infinity has only homogeneous coordinates and does not have any inhomogeneous representation as we cannot divide by zero. We see that the homogeneous representation of a point is not unique since λ(x1 , x2 ) ≡ (x1 , x2 ) for any λ = 0. The point at infinity is (1, 0) as (1, 0) ≡ x1 (1, 0) ≡ (x1 , 0). The point (0, 1) is the origin. But the representation (0, 0) is invalid and does not represent any point. All pairs of such proportional numbers define an equivalent class by λ(x1 , x2 ) ≡ (x1 , x2 ). Each representative of the class is a valid projective point, and the set of all these representatives forms a projective line.
2.2.2 Projective spaces and transformations Projective spaces Definition. Let k be a number field, either real numbers R or complex numbers C. We first define an equivalence relation for non-zero points x = (x1 , ..., xn+1 )T ∈ k n+1 − {0} by setting x ∼ x if there is a non-zero number λ such that x = λx . The quotient space of the set of equivalence classes of the relation ∼ Pn = (k n+1 − {0})/ ∼ is a projective space of dimension n. A projective space Pn (k) or simply Pn is then the non-zero equivalence classes determined by the relation ∼ on k n+1 . Any element x = (x1 , . . . , xn+1 )T of the equivalence class is called homogeneous coordinates of the point in Pn . This projective space from the quotient space of homogeneous coordinates is so far only a space that does not inherit algebraic structures from k n+1 . It is obviously not a vector space; it even does not have the zero! The only structure it inherits is the notion of linear dependence of points that encodes the collinearity! A point P
2.2 Projective geometry
11
by x is said to be linearly dependent on a set of points Pi by xi if there exist λi , such that x = i λi xi . Essential properties of a projective space follow immediately from the definition. • A point is represented by homogeneous coordinates. • The homogeneous coordinates of a given point are not unique, but are all multipliers of one another. • The (n + 1)-dimensional zero vector 0 does not represent any point in any projective space. • A line is a set of points linearly dependent upon two distinct points. • A hyper-plane is a set of linearly dependent points described by uT x = 0. • Duality: uT x = 0 can be viewed as xT u = 0 by transposition. Geometrically, uT x = 0 being a set of coplanar points is dual to xT u = 0 being a set of planes intersecting at the point x. • The set of points at infinity in Pn is a hyper-plane xn+1 = 0. Finite points are (xn , 1)T , and points at infinity are (xn , 0)T in Pn . • The relation between kn (not kn+1 !) and Pn : all points (finite) in kn are embedded in Pn by (x1 , . . . , xn )T → (x1 , . . . , xn , 1)T . And the finite points of Pn , not at infinity so to exclude xn+1 = 0, are mapped back to k n by (x1 , . . . , xn+1 )T → (x1 /xn+1 , . . . , xn /xn+1 )T . Precisely, the points at infinity are those not reached by this injection.
Projective transformations Among transformations between two projective spaces of same dimension Pn → Pn , let us take the simplest linear transformation in homogeneous coordinates, which is represented by a (n + 1) × (n + 1) matrix A(n+1)×(n+1) : λxn+1 = A(n+1)×(n+1) xn+1 . The crucial fact is that this homogeneous linear transformation maps collinear points into collinear points! This can be verified easily by taking three points, one of which is linearly dependent on the other two, and transforming them with the above matrix. The collinearity is the very defining property of projective geometry, thus this linear transformation in homogeneous coordinates is called a projective transformation, or a collineation or a homography. A projective transformation has (n + 1) × (n + 1) − 1 = (n + 1)2 − 1 = 2(n + 2) degrees of freedom, which can be determined from n + 2 points since each point contributes to two inhomogeneous equations. Not surprisingly, this is the
12
2 Geometry prerequisite
same number of points that define a projective basis in the given space. Obviously, the basis points in two spaces completely determine the transformation. The set of projective transformations A(n+1)×(n+1) forms a projective group that is the general linear group of dimension n + 1 denoted as GL(n + 1, k).
The 1D projective line The 1D projective line is simple. Points. The homogeneous coordinates of a point on a projective line are x = (x1 , x2 )T . There is only one point at infinity λ(1, 0) (not two!). Any de-homogenized ratio of x1 /x2 or x2 /x1 admits an invariant interpretation as a cross-ratio of four points or four numbers. Cross-ratios. The cross-ratio of four numbers a, b, c, and d is (a, b; c, d) =
a − c/b − c , a − d/b − d
which is the ratio of the relative distances (and the relative distances are affine coordinates of a point). The cross-ratio is the fundamental projective invariant quantity like the distance for Euclidean geometry and the relative distance for affine geometry. Figure 2.2 shows the invariance of the cross-ratio of four points on a line: (A, B : C, D) = (A , B : C , D ).
A
A′
C
B
B′
D
C′
D′
Fig. 2.2 The invariance of a cross-ratio.
A projective transformation or a homography on a line is given by a11 a12 x1 x1 = x2 a21 a22 x2 The topology. The projective line is closed as there is only one at infinity. It is topologically equivalent to a circle.
2.2 Projective geometry
13
The 2D projective plane The projective plane is the most convenient space to introduce concepts and describe properties. Points. • The homogeneous coordinates of a point on a projective plane are x = (x1 , x2 , x3 )T . • The points at infinity are those characterized by (x1 , x2 , 0)T , or x3 = 0, which is the line at infinity. There is one point at infinity per direction. • Three points (non-collinear) define a projective basis. The coordinates of the basis points can be chosen arbitrarily as long as they are independent, the simplest possible ones are called the canonical or standard coordinates of the basis points, which are: (1, 0, 0), (0, 1, 0), (0, 0, 1) and (1, 1, 1).
y∞ y P
C
A
D
B
x
x∞
Fig. 2.3 Definition of (inhomogeneous) projective coordinates (α, β) of a point P on a plane through four reference points A, B, C and D.
Lines. Two distinct points are linearly independent. A line is a set of points x linearly dependent on two distinct points x1 and x2 , x = λx1 + μx2 , which is equivalent to |x, x1 , x2 | = 0. This leads to the linear form lT x = 0 of the equation of the line in which l = x1 × x2 . The cross-product above is merely a notational device from the vanishing determinant. If two points x1 and x2 define a line by l = x1 × x2 , dually, two lines l1 and l2 define a point x = l1 × l2 , which is the intersection point of the pencil of concurrent lines generated by l1 and l2 : |l, l1 , l2 | = 0. Example 1 A first line through two points (0, 0)T and (0, 1)T is (0, 0, 1)T × (0, 1, 1)T = (1, 0, 0), i.e., x = 0. A second line through two points (1, 0)T and
14
2 Geometry prerequisite
(1, 1)T is (0, 1, 1)T × (1, 1, 1)T = (1, 0, −1), i.e., x − 1 = 0. These two lines intersect at the point (0, 1, 0)T = (1, 0, 0) × (1, 0, −1), which is the point at infinity of the y-axis. This example illustrates that these basic geometry operations are much easier carried out with homogeneous coordinates than with Cartesian coordinates. The set of all lines through a fixed point is called a pencil of lines, and the fixed point is the vertex of the pencil. Conics. A curve described by a second-degree equation in the plane is a conic curve. It could be ax2 + bxy + cy 2 + dxt + eyt + f t2 = 0 in inhomogeneous coordinates or in homogeneous and matrix form xT Cx = 0, where C is a 3 × 3 homogeneous and symmetric matrix. There are five degrees of freedom for a conic matrix C, so five points determine a unique conic. The line tangent to a conic C at a given point x is λl = Cx. The dual conic of a given point conic C is lT C −1 l = 0, which is a conic in line coordinates, or a line conic or a conic envelope. Two points are said to be conjugate points with respect to a conic if the crossratio of these two points and the two additional points in which the line defined by the two points and the conic meet is in harmonic division or equals to -1. For any point x we can associate a polar line l = Cx. Accordingly, the point x is called the pole of the polar line. A conic is generated by the intersection point of corresponding lines of two homographic pencils of lines, and the vertices of the pencils are on the conic. This is Steiner’s theorem for projective generation of the conic. The topology. The real projective plane P2 (R) is topologically equivalent to the sphere in space with one disc removed and replaced by a M¨obius strip. The M¨obius strip has after all a single circle as the boundary, and all that we are asking is that the points of this boundary circle be identified with those of the boundary circle of the hole in the sphere. The resulting closed surface is not orientable, and is equivalent to the projective plane.
The 3D projective space The 3D projective space can be extended from the 2D projective plane with more elaborate structures. Points. The homogeneous coordinates of a point in a 3D projective space are (x1 , x2 , x3 , x4 ). The points at infinity (x1 , x2 , x3 , 0) or x4 = 0 form a plane, the plane at infinity. There is a line at infinity for each pencil of parallel planes. Planes. Three distinct points are linearly independent. A plane is a set of points x linearly dependent on three distinct points x1 , x2 and x3 : x = ax1 + bx2 + cx3 , which is |x, x1 , x2 , x3 | = 0,
2.2 Projective geometry
15
which is the linear form uT x = 0 of the equation of the plane. The points and planes (not lines!) are dual in space. For instance, three points define a plane and three planes intersects at a point. The set of all lines through a fixed line is called a pencil of planes, and the fixed line is the axis of the pencil. Lines. A line is the set of points x = ax1 +bx2 linearly dependent on two distinct points x1 and x2 . A space line has only 4 degrees of freedom. Count it properly! One way is to imagine that a line is defined through two points each moving on a fixed plane, and a point moving on a pre-fixed plane has only 2 degrees of freedom. The Grassmanian or Pl¨uker coordinates of a line in space is to take the six 2 × 2 minors pij = |ij| of the 2 × 4 matrix by stacking x1 and x2 . The vanishing det(x, y, x, y) = 0 gives the quadratic identity Ωpp = p01 p23 + p02 p31 + p03 p12 = 0. These minors are not independent. Among the six of them, there are only 6-1 (scale) - 1 (identity) = 4 degrees of freedom. Two lines p and q are coplanar or intersect if and only if Ωpq = 0. Quadrics. The first analogue of the conic as a surface in 3D space is a quadric surface. Analogous to a conic xT Cx = 0, a surface of degree two is xT Qx = 0, where Q is a 4 × 4 homogeneous and symmetric matrix. A quadric has nine degrees of freedom, so nine points in general positions determine a quadric. The dual quadric in plane coordinates is a plane quadric uT Q−1 u = 0. A surface is ruled if through every point of the surface there is a straight line lying on the surface. A proper non-degenerate ruled quadric is the hyperboloid of one sheet. The degenerate rules quadrics might be cones and planes. Ruled surfaces are of particular importance for the study of critical configurations of vision algorithms. A ruled surface is generated by the intersection line of corresponding planes of two homographic pencils of planes. Twisted cubics. The second analogue of the conic as a curve in 3D space is a twisted cubic, which is represented through a non-singular 4 × 4 matrix A and parameterized by θ as (x1 , x2 , x3 , x4 )T = A(θ3 , θ 2 , θ, 1)T . This is analogous to a conic (x1 , x2 , x3 )T = A(θ2 , θ, 1)T , with a non-singular 3 × 3 matrix A. A twisted cubic has twelve degrees of freedom, the sixteen entries of the
16
2 Geometry prerequisite
matrix minus one for the homogeneity and minus three for a 1D homography of the parameter θ. Six points in general positions define a unique twisted cubic. The topology. The real projective space P3 (R) is topologically equivalent to a sphere in four-dimensional space with the antipodal points identified, which is the topology of the rotation group SO(3). It is orientable.
2.2.3 Affine and Euclidean specialization We have seen how a projective space is extended from an affine space by adding the special points at infinity, which lose the special status in projective space. Now we will see how a projective space is specialized into an affine and a Euclidian space by singling out special points and lines, which will then enjoy the special status of invariance in the specialized spaces. This is Klein’s view to regard projective geometry as an umbrella geometry under which affine, similarity, and Euclidean geometries all reside as sub-geometries of projective geometry. A geometry is associated with a group of transformations that leaves the properties of the geometry invariant.
Projective to affine Affine geometry involves a specialization of one invariant line in 2D and one invariant plane in 3D. Affine specialization is linear. The line at infinity. The line at infinity defines what is affine in geometry. The parallel lines meet at a point at infinity. The projective conic specializes as an ellipse, parabola, and hyperbola in the affine space depending on the number of the intersection points of a conic with the line at infinity. l∞
Fig. 2.4 Affine classification of a conic: an ellipse, a parabola, and a hyperbola is a conic intersecting the line at infinity at 0 point, 1 (double) point, and 2 points.
The 2D affine transformation. Affine geometry is characterized by the group of transformations that leaves the points at infinity invariant. The invariance is global, not point by point, that is, a point at infinity is usually mapped into a different point at infinity, but it cannot be mapped into a finite point. Class matters! Start from an arbitrary projective transformation
2.2 Projective geometry
17
⎛
⎞
a11 a12 a13 A = ⎝a21 a22 a23 ⎠ . a31 a32 a33 Choose a line at infinity. For convenience, we take x3 = 0. If it is left invariant by any transforms, then the transformed line should be x3 = x3 = 0. This constraint imposes a31 = a32 = a33 = 0, so an affine transformation is of the form: ⎛ ⎞ a11 a12 a13 A = ⎝a21 a22 a23 ⎠ . 0 0 1 It is straightforward to verify by inspection that all such matrices form a group of affine transformations, which is a sub-group of the group of projective transformations. Now for all finite points x3 = 0, we can de-homogenize the homogeneous coordinates (x1 , x2 , x3 )T by x = x1 /x3 and y = x2 /x3 , Then, we obtain x a11 a12 x a = + 13 , y a21 a22 a23 y which appears in the familiar form of the affine transformation in inhomogeneous coordinates, which is a linear transformation plus a translation, valid only for finite points. The plane at infinity. Each plane has one line at infinity. All lines at infinity from all planes form a plane at infinity in space. The 3D affine transformation. A 3D affine transformation leaving invariant the plane at infinity x4 = 0 becomes A3×3 a3×1 . 01×3 1 Affine to Euclidean Euclidean geometry involves a specialization of an invariant pair of points in 2D and an invariant conic in 3D. Euclidean specialization is quadratic. The circular points. The circular points define what is Euclidean from projective and affine point of view. What are these points? Let’s first do the impossible of intersecting two concentric circles of different radii: x2 + y 2 = 1, x2 + y 2 = 4. There is no intersection point if we stay in finite space characterized by the above inhomogeneous coordinates and equations. We first homogenize the equations through x → x1 /x3 and y → x2 /x3 to extend Euclidean space to projective space,
18
2 Geometry prerequisite
x21 + x22 = x23 , x21 + x22 = 4x23 . The subtraction of the two equations leads to 3x23 = 0, so x3 = 0 and x21 + x22 = 0, which gives (x1 /x2 )2 = −1, or x1 /x2 = ±i. Thus, the intersection points of the two concentric circles are a pair of points (±i, 1, 0), which we call circular points and denote them by I and J with coordinates i and j. They are called circular points as all circles always go through them! This can be easily verified by substituting the circular points back into an equation of any circle. We never see them in Euclidean space as they are complex points at infinity! A circle is determined by three points, as the other two points are the circular points. The orthogonality. The circular points are an algebraic device to specify the metric properties of a Euclidean plane. The orthogonality of two lines is expressed as that the cross-ratio of the intersection points of the two lines with the line at infinity, and the circular points is in harmonic division: (A, B; I, J) = −1. Furthermore, the angle between the two lines can be measured through the crossratio by Lagu`erre’s formula, θ=
i ln(A, B; I, J). 2
The congruency. The congruency substitutes the homography as a one-dimensional Euclidean transformation. The two pencils are congruent if they are related by a rotation, so the angle between the corresponding points (or lines or planes) is constant. For instance, projective generation of the conic and the quadric from homographic pencils of lines and planes becomes a kind of Euclidean generation of the circle and the orthogonal quadric from congruent pencils of lines and planes. In particular, a circle is generated by two congruent pencils of lines specified by the three points. The 2D Euclidean transformation. (Similarity) Euclidean geometry is obtained by the group of transformations that leaves the circular points invariant, as a pair, so (i, 1) might be mapped to (−i, 1). The invariance ±i ±i = A2×2 1 1 constrains s2 + c2 = 1 such that x c s x a = ρ + 13 , y a23 −s c y which is the familiar similarity transformation if we make the parameterization of c and s with an explicit angle θ as c = cos θ and s = sin θ. The ρ is a global scaling factor.
2.2 Projective geometry
19
The absolute conic. The pair of circular points i and j, also called the absolute points, are in fact jointly described by the equation x21 + x22 = x3 = 0. The absolute conic is its extension in 3D, xT3 x3 ≡ x21 + x22 + x23 = 0, x4 = 0, which is made up of all circular points of all planes. The absolute conic has no real points; it is purely imaginary. It is the algebraic device to specify the metric properties of the Euclidean space. Two planes are perpendicular if they meet the absolute conic in a pair of lines conjugate for the absolute conic. Two lines are perpendicular if they meet the absolute conic in a pair of points conjugate for the absolute conic. The 3D Euclidean transformation. A similarity transformation leaves invariant, globally, the absolute conic. The invariance of the absolute conic, x3 x3 = xT3 AT3×3 A3×3 x3 = xT3 x3 with x3 = A3×3 x3 , T
constrains AT3×3 A3×3 = I3×3 , which is the defining property for A3×3 to be a rotation matrix R! Recall that a 3 × 3 matrix R is an orthogonal matrix representing a 3D rotation if RRT = I3×3 . It has three degrees of freedom that might be chosen for the three rotation angles around the three axes. The similarity transformation now becomes R3×3 a3×1 . 01×3 ρ If we fix the global scale ρ = 1, we obtain the familiar rigid transformation in homogeneous form: ⎛ ⎞ ⎛ ⎞ x x ⎜ ⎟ ⎜y ⎟ R t 3×3 3×1 ⎜ y ⎟ ⎜ ⎟ = ⎝z ⎠ ⎝z ⎠ 01×3 1 1 1 or in usual inhomogeneous form ⎛ ⎞ ⎛ ⎞ x x ⎝y ⎠ = R ⎝y ⎠ + t. z z
The geometry of cameras The purpose of reviewing the projective geometry is that the camera geometry is more intrinsically described by it. We will see in the coming chapter the following highlights.
20
2 Geometry prerequisite
• A pinhole camera without nonlinear optical distortions is a projective transformation from a projective space of dimension three P3 to a projective space of dimension two P2 , which is a 3 × 4 matrix. • The intrinsic parameters of the camera, describing the metric properties of the camera, are corresponding to the absolute conic, which specializes a projective geometry into a Euclidean geometry. If a camera is calibrated, then the absolute conic is fixed. • The epipolar geometry of two cameras is that of two pencils of planes, which are in homographic relation if the cameras are not calibrated, and are in congruent relation if the cameras are calibrated.
Summary The hierarchy of geometries in projective language is summarized in Table 2.1 and 2.2 for 2D planes and 3D spaces. The term ‘incidence’ should broadly include collinearity and tangency. Notice that invariant elements of a given transformation are different from invariant elements of a given group of transformations. The invariants of a group specifies different geometries, while the invariants of a transformation merely characterize the type of transformation within the same group or geometry. For instance, the line at infinity is invariant for the affine group, while an eigenvector for a given transformation representing a point is a fixed point for the given transformation. Table 2.1 Hierarchy of plane geometries. Geometry Transformation group D.o.f. Invariant properties Invariant quantities Invariant elements Deformation of a unit square
⎛
Projective
⎞ a11 a12 a13 ⎝ a21 a22 a23 ⎠ a31 a32 a33
8 incidence (including collinearity and tangency)
Affine
Similarity
Euclidean
6
4 incidence, parallelism, orthogonality
3 incidence, parallelism, orthogonality
⎞ ⎛ ⎞ ⎛ ⎞ a11 a12 a13 cos θ sin θ t1 cos θ sin θ t1 ⎝ a21 a22 a23 ⎠ ⎝ − sin θ cos θ t2 ⎠ ⎝ − sin θ cos θ t2 ⎠ 0 0 1 0 0 ρ 0 0 1 ⎛
incidence, parallelism
cross-ratio
ratio
ratio, angle
distance, angle
—
line at infinity
line at infinity, circular points
line at infinity, circular points
2.3 Algebraic geometry
21
Table 2.2 Hierarchy of space geometries. Geometry Transformation group D.o.f. Invariant properties Invariant quantities Invariant elements
Projective
A4×4 15 incidence (including collinearity and tangency)
Affine Similarity A3×3 a3×1 R3×3 t3×1 01×3 1 01×3 ρ 12 7 incidence, incidence, parallelism, parallelism orthogonality
cross-ratio
ratio
—
plane at infinity
ratio, angle
Euclidean R3×3 t3×1 01×3 1 6 incidence, parallelism, orthogonality
distance, angle
plane at infinity, plane at infinity, absolute conic absolute conic
2.3 Algebraic geometry Many geometry problems lead to a system of polynomial equations of the form: f1 (x1 , . . . , xn ) = 0, ... fs (x1 , . . . , xn ) = 0, where fi are polynomials in n variables with coefficients from the real number field. The goal is to solve this system of polynomial equations.
2.3.1 The simple methods The fundamental theorem of algebra states that a polynomial equation of degree n in one variable f (x) = xn + cn−1 xn−1 + . . . + c1 x + c0 has n solutions over the field C of complex numbers. • The companion matrix. Solving a polynomial in one variable f (x) = 0 is equivalent to computing the eigenvalues of its companion matrix ⎞ ⎛ 0 0 . . . 0 −c0 ⎜ 1 0 . . . 0 −c1 ⎟ ⎟ ⎜ ⎜ 0 1 . . . 0 −c2 ⎟ . ⎟ ⎜ ⎝. . . . . . . . . . . . . . . ⎠ 0 0 . . . 1 −cn Numerically, the power method for eigenvalue or the Newton-Raphson for finding all the roots of a polynomial in one variable could be used for effective solutions [172].
22
2 Geometry prerequisite
• B´ezout’s theorem. For a polynomial system, if the degrees of the polynomials are n1 , n2 , ..., nm , then there are n1 × n2 × . . . × nm solutions. It is exact if we count properly with complex points, points at infinity, and multiplicities of points. • The Sylvester resultant. A system of polynomials could be solved by elimination to reduce the system to a polynomial in only one variable, similar to the approach for a system of linear equations. We mention two ways of eliminating variables. The first elimination method is the classical Sylvester resultant of two polynomials. The second method requires a few more advanced algebraic geometry concepts based on the Gr¨obner bases, which will be presented in the next paragraph. Given two polynomials f, g ∈ k[x1 , . . . , xn ], f = a0 xl + . . . + al , g = b 0 xm + . . . + b m , then the Sylvester matrix of f and g with respect to x, denoted Sylvester(f, g, x), is ⎛ ⎞ b0 a0 ⎜a1 a0 ⎟ b1 b0 ⎜ ⎟ ⎜ ⎟ .. .. ⎜a2 a1 . ⎟ . b 2 b1 ⎜ ⎟ ⎜ .. ⎟ .. ⎜. ⎟ a . b 0 0⎟ ⎜ ⎜ al ⎟. a b b 1 m 1⎟ ⎜ ⎜ .. .. ⎟ ⎜ al . bm . ⎟ ⎜ ⎟ ⎜ ⎟ .. .. ⎝ ⎠ . . al
bm
The resultant of f and g with respect to x is the determinant of the Sylvester matrix, i.e., Resultant(f, g, x) = Det(Sylvester(f, g, x)), which is free of the variable x so that the variable x is ‘eliminated’. In Maple, we can use the command ‘resultant()’to compute the resultant for instance. Example 2 Let f = xy − 1 and g = x2 + y 2 − 4. y 0 1 0 = y 4 − 4y2 + 1, Resultant(f, g, x) = −1 y 0 −1 y 2 − 4 which eliminates the variable x.
2.3 Algebraic geometry
23
2.3.2 Ideals, varieties, and Gr¨obner bases αn 1 A monomial is a finite product of variables xα 1 . . . xn with non-negative exponents α1 , ..., αn . A polynomial f in x1 , . . . , xn with coefficients in a number field k (either R or C) is a finite linear combination of monomials. The sum and product of two polynomials are again a polynomial. The set of all polynomials in x1 , . . . , xn with coefficients in k, denoted by k[x1 , . . . , xn ], is a ring, or a polynomial ring, which is also an infinite-dimensional vector space. Definition. An ideal I is a subset of the ring k[x1 , . . . , xn ] satisfying
1. 0 ∈ I. 2. If f, g ∈ I, then f + g ∈ I. 3. If f ∈ I and h ∈ k[x1 , . . . , xn ], then hf ∈ I. If f1 , . . . , fs ∈ k[x1 , . . . , xn ], then I = f1 , . . . , fs is an ideal of k[x1 , . . . , xn ], generated by f1 , . . . fs . The ideal I generated by fi = 0 consists of all polynomial consequences of the equations fi = 0. Intuitively, the solutions of the ideal is the same as those of the generating polynomials. Definition. The variety V (f1 , . . . , fs ) ∈ kn is the set of all solutions of the system of equations f1 = f2 = . . . = fs = 0, V = {(a1 , . . . , an ) ∈ kn : fi (a1 , . . . , an ) = 0 for all 1 ≤ i ≤ s}. A variety is determined by the ideal, but not by the equations that can be rearranged. A variety can be an affine variety in kn defined by inhomogeneous polynomials or a projective variety in Pn defined by homogeneous polynomials. In algebraic geometry terminology, finding the finite number of solutions to polynomial equations fi = 0 is equivalent to finding the points of the variety V (I) for an ideal I generated by fi . The generating polynomials f1 , . . . fs for an ideal I is a basis of I. The Hilbert basis theorem states that every ideal is finitely generated. But there are many different bases for an ideal. A Gr¨obner basis or standard basis is an ideal basis that has special interesting properties. If we want to solve the ideal membership of a polynomial f ∈ k[x1 , . . . , xn ] regarding an ideal generated by a set of polynomials F = {f1 , . . . , fs }, then intuitively we may divide f by F . This means that f is expressed in the form: f = a1 f1 + . . . + as fs + r, where the quotients a1 , . . . , as and the remainder r are in k[x1 , . . . , xn ]. If f is ‘divisible’, i.e. the remainder is zero, then f belongs to the ideal generated by F . But to characterize the remainder properly, it is necessary to introduce an ordering for the monomials. There are many choices of monomial orders. Definition. We first fix an ordering on the variables, x1 > x2 > ... > xn . The lexicographic order is analogous to the ordering of words used in dictionaries, that
24
2 Geometry prerequisite
is xα >lex xβ if the left-most nonzero entry in the difference α − β is positive. The graded lexicographic order is first ordered by the total degree, then breaks ties using the lexicographic order. The graded reverse lexicographic order is first ordered by α β the total degree, then breaks ties in the way that x > x if α > grevlex i i i βi , or if i αi = i βi , and the right-most nonzero entry in the difference α − β is negative. The graded reverse lexicographic order is less intuitive, but is more efficient for computation. Example 3 If x > y > z, we have the following ordering of the given monomials. Lexicographic: x3 > x2 z 2 > xy2 z > z 2 , 2 2 Graded lexicographic: x z > xy2 z > x3 > z 2 , Graded reverse lexicographic: xy 2 z > x2 z 2 > x3 > z 2 . Definition. Given a monomial order, a finite subset G = {g1 , . . . gt } of an ideal I is a Gr¨obner basis if LT(g1 ), . . . , LT(gt ) = LT(I) , where LT(f ) is the leading term of a polynomial f . That is, a set of polynomials is a Gr¨obner basis if and only if the leading term of any element of I is divisible by one of the LT(gi ). G If G is a Gr¨obner basis of I = f1 , . . . , fs , the remainder f of a polynomial f is uniquely determined when we divide by G, and f ∈ I if and only if f
G
= 0.
2.3.3 Solving polynomial equations with Gr¨obner bases There are two approaches to solving polynomial equations via Gr¨obner bases. The first approach uses a lexicographic Gr¨obner basis to obtain a polynomial in one variable. The second uses a non-lexicographic Gr¨obner basis to convert the system into an eigenvalue problem.
By elimination A lexicographic Gr¨obner basis for a polynomial system is to Gaussian elimination for linear system. One of the generators is a polynomial in one variable only. So once we have a lexicographic Gr¨obner basis, we solve it by using any of the numerical root-finding methods for one-variable polynomials, then back-substituting it to find solutions of the whole system. This is conceptually a simple and universal method that generalizes the usual techniques used to solve systems of linear equations. Example 4 For the system of equations
2.3 Algebraic geometry
25
x2 + y 2 + z 2 = 4, x2 + 2y 2 = 5, xz = 1, a lexicographic Gr¨obner basis for the system (for instance, use the command ‘gbasis’ in Maple to compute) is x + 2z 2 − 3z = 0, y 2 − z 2 − 1 = 0, 2z 4 − 3z 2 + 1 = 0. The last equation has only one variable z. The main challenge is that the computation of a Gr¨obner basis with symbolic coefficients in the system of polynomials is difficult. A non-lexicographic basis is often easier to compute than a lexicographic one. This motivates the following eigenvalue methods using a non-lexicographic Gr¨obner basis.
Via eigenvalues This approach exploits the structure of the polynomial ring k[x1 , . . . , xn ] modulo an ideal I from a non-lexicographic Gr¨obner basis G. Definition. The quotient k[x1 , . . . , xn ]/I is the set of equivalent classes: k[x1 , . . . , xn ]/I = {[f ] : f ∈ k[x1 , . . . , xn ]}, and the equivalent class of f modulo I or the coset of f is [f ] = f + I = {f + h : h ∈ I}. The key is that [f ] = [g] implies f − g ∈ I. The quotient k[x1 , . . . , xn ]/I is a ring as we can add and multiply any of its two elements. Moreover, we can multiply by constants, so it is a vector space over k as well. This vector space is denoted by A = k[x1 , . . . , xn ]/I, which is finitedimensional while the polynomial ring k[x1 , . . . , xn ] as a vector space is infinitedimensional. Given a Gr¨obner basis G of I, if we divide f by G, then G
f = h1 g1 + . . . + ht gt + f . G
The remainder f can be taken as a standard representative of [f ]. There is a oneto-one correspondence between remainders and equivalent classes of the quotient G / ring. Furthermore, the remainder f is a linear combination of the monomials xα ∈ LT(I) in the vector space. This set of monomials / LT(I) }, B = {xα : xα ∈
26
2 Geometry prerequisite
which are linearly independent, can therefore be regarded as a basis of the vector space A. The crucial fact is that when a system has only a finite number of n solutions, the vector space A = C[x1 , . . . , xn ]/I has dimension n. Definition. Given a polynomial f ∈ C[x1 , . . . , xn ], we can use the multiplication of a polynomial g ∈ A by f to define a map from the vector space A to itself, [g] → [g] · [f ] = [g · f ] ∈ A. This mapping is linear and can be represented by an n×n matrix Af in the monomial basis B if n = Dim(A): Af : A → A. This matrix can be constructed from a Gr¨obner basis. Given a Gr¨obner basis G of I. We first obtain the monomial basis B of A as the non-leading terms of G. Then we multiply each basis monomial of B by f , and compute the remainder of the product modulo G (we can use the command ‘normalf()’ in Maple for instance). The vector of coefficients in the basis B of the remainder is a column of the matrix Af . Example 5 The set G = {x2 + 3xy/2 + y2 /2 − 3x/2 − 3y/2, xy 2 − x, y 3 − y} is a Gr¨obner basis in graded reverse lexicographic order for x > y. Viewed as a vector space, it can be rearranged into matrix form: ⎛ 2⎞ x ⎜xy 2 ⎟ ⎜ ⎟ ⎞ ⎜ y3 ⎟ ⎛ ⎜ 1 0 0 1/2 3/2 −3/2 −3/2 0 ⎜ 2 ⎟ y ⎟ ⎟ ⎝ 0 −1 0⎠ ⎜ (2.1) Gx = 0 1 0 0 0 ⎜ xy ⎟ . ⎜ ⎟ 0 0 ⎜ 0 0 1 0 0 −1 ⎟ ⎜ y ⎟ ⎝ x ⎠ 1 The leading monomials are {x2 , xy 2 , y3 }. The non-leading monomials are B = {y , xy, y, x, 1}, which is a basis for A. We compute the product of each monomial in B and x modulo G, 2
G
y 2 → y 2 · x, y 2 · x xy → xy · x, xy · xG y → y · x, y · xG x → x · x, x · xG G 1 → 1 · x, 1 · x
= x = (0, 0, 0, 1, 0)b, = 32 y 2 + 32 xy − 12 y − 32 x = ( 32 , 32 , − 12 , − 32 , 0)b, = xy = (0, 1, 0, 0, 0)b, = − 12 y 2 − 32 xy + 32 y + 32 x = (− 12 , − 32 , 32 , 32 , 0)b, = x = (0, 0, 0, 1, 0)b,
2.3 Algebraic geometry
27
where b = (y 2 , xy, y, x, 1)T . Each vector of coefficients is a column of the matrix Ax for the multiplication operator by x, thus we obtain ⎛ ⎞ 0 3/2 0 −1/2 0 ⎜0 3/2 1 −3/2 0⎟ ⎜ ⎟ ⎟ Ax = ⎜ ⎜0 −1/2 0 3/2 0⎟ . ⎝1 −3/2 0 3/2 1⎠ 0 0 0 0 0 Notice that some entries of Ax have already appeared in the 3 × 5 sub-matrix of the matrix G in Eq. 2.1. In practice, Ax is constructed by rearranging G. The conclusion is that the eigenvalues of Af are the values of f evaluated at the solution points of the variety V (I). Moreover, the right eigenvectors of Af or the eigenvectors of the transpose ATf are the monomial basis evaluated at the solution points. The solution points are trivially included or reconstructed from the right eigenvectors. If f = xi , the eigenvalues of Axi are the xi -coordinates of the solution points of V (I). The choice of f should guarantee that f is evaluated distinctly on different solution points. It often suffices to take f = x1 or a linear combination of variables if necessary. This matrix is the extension of the companion matrix for a polynomial of a single variable to a system of polynomials. More details and proofs could be found in [26, 27].
Chapter 3
Multi-view geometry
This chapter studies the geometry of a single view, two views, three views and N views. We will distinguish an uncalibrated setting from a calibrated one if necessary. The focus is on finding the algebraic solutions to the geometry problems with the minimal data. The entire multi-view geometry is condensed into the three fundamental algorithms: the five-point algorithm for two calibrated views, the six-point algorithm for three uncalibrated views and the seven-point algorithm for two uncalibrated views. Adding calibration and auto-calibration, this chapter is summarized into seven algorithms, which are the computational engine for the geometry of multiple views in ‘structure from motion’ of the next chapter.
29
30
3 Multi-view geometry
3.1 Introduction A camera projects a 3D point in space to a 2D image point. A ‘good’ camera projects a straight line to a straight line. The geometry of a single camera specifies this 2D and 3D relationship while preserving the straightness of lines. We can ‘triangulate’ the 3D point in space if it is observed in at least two views much like do our two eyes! The geometry of two or more views studies the triangulation of a space point from multiple views. Intuitively, there are fundamentally two geometric constraints known for long time. The first is the collinearity constraint of a space point, its projection and the camera center for one view. The second is the coplanarity constraint of a space point, its two projections, and the camera centers for two views. These properties have often been presented in a Euclidean setting associated with a calibrated camera framework. But the properties of collinearity and coplanarity are fundamental incidence properties that do not require any metric measurements. They are better expressed in a projective geometry setting associated with an uncalibrated camera framework. That has been the motivated development of 3D computer vision in the 1990s [47, 80], and it has been enormously successful. With hindsight, the projective structure per se is not the goal as it might have been suggested then, but it is an intrinsic representation to encode the crucial correspondence information across images, which does not require any metric information of the camera as long as the camera is pin-hole and the scene is rigid and static. A projective setting simplifies parameterization by justifying more of them, but most of the algorithms suffer from degeneracy in the presence of the coplanarity structure while the calibrated versions do not. It is wrong not to use the partially known parameters of the camera to come back to a Euclidean setting [88], if the information is available. Calibrated setting using Euclidean geometry or uncalibrated one using projective geometry is not an ideological combat, but is more of computational convenience encountered at different stages of the entire reconstruction pipeline, and each complements the other. The literature on the subject is abundant. In the end, only the seven-point algorithm and the six-point algorithm for two and three uncalibrated views and the five-point algorithm for two calibrated views are the known algorithms sufficient to characterize the whole vision geometry. The classical calibration and pose estimation still play an important role in the calibrated setting, the modern auto-calibration is more conceptual than practical. The goal is to synthesize and summarize the vision geometry in seven algorithms for computational purposes.
3.2 The single-view geometry 3.2.1 What is a camera? If cameras are free of nonlinear optical distortions, straight lines are projected onto straight lines. This is the ideal pinhole camera model. We first develop the model
3.2 The single-view geometry
31
in a Euclidean setting, then in a projective setting. Both of them lead to the same conclusion.
A Euclidean setting Camera coordinate frame. We first specify a Cartesian coordinate frame xc yc zc − oc , centered at the projection center and aligned with the optical axis of a pinhole camera, as illustrated in Figure 3.1. The image plane or the retina is located at the distance f away from the projection center. This coordinate frame is called the camera-centered coordinate frame. The pinhole camera is mathematically a central projection, which is y xc yc x = , and = f zc f zc by similar triangles.
zc yw
u x
yc
y
O
xw
zw
Ow
P f
xc
v C
Fig. 3.1 The central projection in the camera-centered coordinate frame.
It is only the play of the matrix for the above central projection to become ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ x x 1 0 0 0 ⎜ c⎟ yc ⎟ λ ⎝ y ⎠ = ⎝0 1 0 0⎠ ⎜ ⎝ zc ⎠ . f 0010 1 The remaining efforts are only a matter of making the appropriate coordinate changes to the observed pixels, expressed in a different 2D image coordinate frame; and to the object space, measured in a different 3D coordinate frame. Image coordinate frame. The image plane xy − p is a 2D plane involving only xc and yc , and p. The point p is the intersection point (0, 0, f ) of the optical axis zc with the image plane at f . It is called the principal point. Since image pixels are observed and measured within the image plane from a corner, which is a 2D
32
3 Multi-view geometry
coordinate frame specified by uv − o. The pixel space uv may not be orthogonal, so
y p
v x (mm)
o
u (pixel)
Fig. 3.2 The transformation within the image plane between the Cartesian frame xy − p and the affine frame uv − o.
it is generally an affine coordinate frame. Thus, the transformation between xy − p and uv − o is at most a plane affine transformation K having at most six parameters ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞ u x αu a u0 x ⎝v ⎠ = K ⎝y ⎠ = ⎝ b αv v0 ⎠ ⎝y ⎠ . 1 1 0 0 1 1 The point p = (u0 , v0 ) specifies the off-set between the two frames in pixels (the uv units). The αu and αv convert the focal length in x and y units of mm for instance to u and v units of pixels in the horizontal and vertical directions that might be different. The ratio of the difference in size in the two directions αu /αv is called the aspect ratio. The possible non-perpendicularity of the uv axes is the skew, accounting for the parameter a. The parameter b may have accounted for a rotation between the two frames aligning u and x, but we will see that this rotation is not independent and could be absorbed later in an external rotation of the 3D frame xyz − c. This finalizes the most general form of the upper triangular matrix of the five intrinsic parameters: ⎞ ⎛ αu a u0 K = ⎝ 0 αv v0 ⎠ . 0 0 1 There is no reason for the pixel in uv to not be square! This is true for most modern CCD cameras. Thus, there is no skew so that a = 0, and no aspect ratio, so that αu = αv = f . The matrix of the intrinsic parameters reduces to the threeparameter form ⎛ ⎞ f 0 u0 K = ⎝ 0 f v0 ⎠ . 00 1 In camera manufacturing, the best effort is made to align and to center the optical axis with the retina sensor. It is reasonable to assume that the (u0 , v0 ) is at the center of the image plane for non-metric applications or as a good initial guess of the true position. For metric applications with metric cameras in photogrammetry, all five intrinsic parameters need to be carefully readjusted through the calibration process, as do the non-linear distortion parameters.
3.2 The single-view geometry
33
World coordinate frame. A 3D point or an object is first measured in its own 3D Euclidian coordinate frame, which is called the world coordinate frame xw yw zw − ow . The transformation between the world coordinate frame and the camera-centered frame is a rigid transformation in 3D space: ⎛ ⎞ ⎛ ⎞ xc xw ⎜ ⎟ ⎜ yc ⎟ ⎜ ⎟ = R t ⎜ yw ⎟ . ⎝ zc ⎠ 0 1 ⎝ zw ⎠ 1 1 Now we can describe the whole projection for a point (xw , yw , zw )T in a world coordinate frame onto an image point (u, v)T in pixels: ⎛ ⎞ ⎛ ⎞ xw u
R t ⎜ yw ⎟ ⎜ ⎟. λ ⎝v ⎠ = K I3×3 0 0 1 ⎝ zw ⎠ 1 1
A projective setting More abstractly, the physical pin-hole camera is nothing but a mathematical central projection from P3 to P2 . Such a projection (x, y, z, t)T → (u, v, w)T preserves the collinearity, i.e. lines in space are projected onto lines in the image. It is a linear transformation from P3 to P2 , so it is is represented by a 3 × 4 matrix P: ⎛ ⎞ ⎛ ⎞ x u ⎜y ⎟ ⎟ λ ⎝ v ⎠ = P3×4 ⎜ (3.1) ⎝z ⎠ . w t This 3 × 4 projection matrix is the most general camera model without considering nonlinear optical distortion. The study of the camera now becomes the study of this 3 × 4 projection matrix. Properties of the projection matrix. • It has eleven degrees of freedom as the twelve entries of the matrix are homogeneous, so there are at most eleven independent parameters describing a camera. • The rank P3×3 is at most three as it has only three rows. • It follows from the rank-three constraint that the one-dimensional kernel of the projection matrix is the projection center or the camera center c, i.e. P3×4 c = 0. If the projection center is not at infinity, then c = (o, 1)T . Hence, −1 −P3×3 p o= . 1
34
3 Multi-view geometry
The camera center being the kernel of the projection matrix is not affected by any projective transformation in space. • Look at the row vectors of the matrix ⎛ T⎞ u P = ⎝ vT ⎠ . wT Each row vector is a four vector, so it can be interpreted as a plane. The uT x = 0 is the plane going through the camera center and the u-coordinate axis of the image plane. The vT x = 0 is the plane going through the camera center and the v-coordinate axis of the image plane. The wT x = 0 is the plane going through the camera center and parallel to the image plane. Again, the intersection of these three planes is the camera center c, the kernel of the projection matrix. • The plane wT x = 0 is called the principal plane. It is a plane going through the camera center and parallel to the image plane, so the points on the principal plane wT x = 0 project onto the w = 0, which is the line at infinity of the image plane. Two parallel planes intersect at the line of infinity that is common to the whole family of the parallel planes. • Look at the column vectors of the P = (p1 , p2 , p3 , p4 ). The first column vector p1 is obtained by projecting the point (1, 0, 0, 0)T by P. It is thus the image of the x axis direction. Similarly, p2 is the image of the y direction, and p3 is the image of the z direction. p4 is the image of the origin (0, 0, 0, 1)T . • It can be determined or calibrated from six 3D-2D point correspondences. Decomposition of P. Using the QR decomposition theorem (RQ to be more exact), the 3 × 3 submatrix of the projection matrix becomes KR where K is the upper triangular matrix and R is an ortho-normalized rotation matrix such that P = K(R, t). The eleven parameters of P are thus restructured into five in K, and three in R, and three in t. This comes to exactly the same number of parameters or the degrees of freedom of the matrix as the intrinsic and extrinsic compositions of the projection matrix assembled from a usual Euclidian point of view in the previous section. This confirms that the 3×4 projection matrix is the most general description of a pinhole camera if we do not take the nonlinear optical distortions into account. The image of the absolute conic. Take a picture of the absolute conic xT3 x3 = x4 = 0 by the camera P. The image of the absolute conic is uT (KKT )−1 u = 0, which is the conic of the intrinsic parameters K. For a four-parameter camera without skew, the image of the absolute conic is
3.2 The single-view geometry
35
(
u − u0 2 v − v0 2 ) +( ) = i2 , αu αv
which is a pure imaginary ellipse.
3.2.2 Where is the camera? We start with a geometric method that constructs the camera center from point correspondences, which is equivalent to camera calibration. Given a set of six point correspondences {A, B, C, D, E, F } in space and {a, b, c, d, e, f } in an image plane, we want to recover the unknown camera center O. The configuration is illustrated in Figure 3.4. In this section, we use lower-case letters for 2D points in an image plane and upper-case letters for 3D points in space. Consider the pencil of fives planes OAB, OAC, OAD, OAE, and OAF , which go through the axis OA and one of the points B, C, D, E, F , respectively. The lines of intersection of these planes with the image plane are the ab, ac, ad, ae, and af . We can then obtain the cross-ratio of any pencil of four planes by taking measures of the cross-ratios of the corresponding pencil of lines. For instance, {OAB, OAC; OAD, OAE} = {ab, ac; ad, ae}. Take a known plane P , for instance, the plane BCD defined by the three points B, C, and D. The plane P intersects the pencil of planes with the axis OA in the pencil of lines with the vertex a , which is the point of intersection of the line OA with the plane P . The lines AB, AC, AD, AE, AF intersect the plane P in the known points B , C , D , E , and F , respectively. Recall that Chasles’s theorem states that for four points A, B, C, D in a projective plane, with no three of them collinear, the locus of the vertex of a pencil of lines passing through these four points and having a given cross ratio is a conic (see Figure 3.3. Reciprocally, if M lies on a conic passing through A, B, C and D, the cross-ratio of four lines M A, M B, M C, M D is a constant, and is independent of the point M on the conic. M
M′
A B
C
D
Fig. 3.3 A Chasles conic defined by a constant cross-ratio of a pencil of lines.
36
3 Multi-view geometry
The cross-ratio of the pencil of lines (A B , A C ; A D , A E ) can be measured so the point A lies on the conic determined by B , C , D , E in the plane P by Chasles’s theorem. We can do the same construction while replacing the point F by the point E . Thus, A lies on a second conic determined by B , C , D , and F on the plane P (see Figure 3.4).
(′
'′ &′ %′
$′
' &
%
(
) $ E F D
G H I
2 Fig. 3.4 Determining the view line associated with a known point A.
As a result, the point A lies at the same time on the two conics that already have three common points B , C , and D . Thus, the point A must be the fourth intersection point of these two conics. The two conics are known and the three of their intersection points are also known, the remaining fourth unknown intersection point is unique. We can therefore reconstruct the viewing line AA . The same construction can be applied to any of the other points B, C, D, E, and F . Finally, the camera center O is the intersecting point of the bundle of lines AA , BB , CC , DD , EE and F F !
The critical configurations The critical configurations to a given method of a problem are those configurations that fail to solve the problem with the method or give multiple solutions to the problem.
3.2 The single-view geometry
37
• When the six points in space are coplanar, this constructive method does not hold. Clearly, for instance, the cross-ratio of the pencil of lines (ab, ac; ad, ae) cannot be measured anymore as the point a is confused with b, c, d, and e if A is coplanar with B, C, D, and E. • Moreover, when the six points and the camera center lie on a twisted cubic, both its non-degenerate and degenerate forms, the method fails. This is due to the fact that a twisted cubic in space is projected onto a plane as a conic if the projection center is on the twisted cubic. The lines that project the twisted cubic lie on a quadric cone, which is a degenerate ruled quadric. For instance, there will be only one conic for any chosen plane. The conic through the points A , B , C , D , E and the conic through the points A , B , C , D , F are the same conic, it will be impossible to construct the intersection of these two conics to obtain the point A .
Remarks • This constructive technique was presented by Tripp [236] in the planar case. The 3D extensions have been presented in [211, 144]. • This technique is not computational, but the construction is simple and elegant. More importantly, it has inspired the development of vision geometry using projective geometry. This is de facto an uncalibrated method in which only the camera center matters while the other camera parameters are bypassed.
3.2.3 The DLT calibration Given a set of point correspondences xi ↔ ui between the 3D reference points xi and 2D image points ui . The camera calibration consists of determining the projection matrix of the camera P, which contains both the intrinsic and extrinsic parameters of the camera. The non-linear distortions are not considered for the time being. They could be carried out beforehand and jointly optimized in the final optimization. From the projection equation of each point λui = Pxi , we have two linear equations in the entries pij by taking the ratios of the first component and the third one, and the second components and the third one to eliminate the unknown λ: xi yi zi 1 0 0 0 0 −ui xi −ui yi −ui zi −ui p12 = 0, 0 0 0 0 xi yi zi 1 −vi xi −vi yi −vi zi −vi where we pack the unknown entries pij into the twelve vector p. For n given point correspondences, we have a homogeneous linear system of equations A2n×12 p12 = 0.
38
3 Multi-view geometry
As the unknowns are homogeneous, there are only eleven degrees of freedom. The dehomogenization could be one of the following constraints: • p34 = 1, which transforms the homogeneous system into an inhomogeneous system Ax = b. • ||p12 || = 1, which is directly solved by the SVD. • p231 + p232 + p233 = 1, which becomes a constraint linear system [52]. This constraint proposed by Faugeras and Toscani has the advantage of preserving the rigidity in the decomposition. As soon as we have 5 12 points, or six points if we do not have a half point, the system is solved. The most convenient way might be through a SVD decomposition of the matrix A2N ×12 with the unknown vector x normalized. This gives a good initial estimate of the camera matrix if additional data normalization advocated by Hartley in [76] is carried out. Algorithm 1 (The DLT calibration) Given at least six 3D-2D point correspondences xi ↔ ui for i = 1, . . . , n and n ≥ 6, compute the intrinsic parameters K of the camera and the rotation R and the translation t of the camera with respect to the points xi . 1. Compute a 2D similarity transformation Tu such that the points ui are√translated to its centroid and re-scaled so that the average distance equals to 2. Do the same to compute a√3D similarity transformation Tx for xi so that the average distance equals to 3. ˜i = Tu ui and ˜ 2. Apply u xi = Tx xi . ˜ 2n×12 matrix with normalized points u ˜i and ˜xi . 3. Form the A 4. Solve for p˜12 by taking the singular vector corresponding to the smallest singular ˜ 2n×12 . value of A ˜ 5. Convert p˜12 into the matrix P. ˜ 6. Undo the normalization by P = T−1 x PTu . 7. Decompose P to obtain the intrinsic parameters in K and the extrinsic parameters R and t. The solution is unique.
The critical configurations • When the points are coplanar, the calibration method fails because of the rank deficiency of the matrix A2n×12 . • When the points and the camera center lie on a twisted cubic, including both non-degenerate and degenerate forms of the twisted cubic, the calibration is not unique [16]. The geometric proof is constructed in the previous section.
3.2 The single-view geometry
39
Remarks • The half point redundancy of the six points with respect to the strict minimum of 5.5 points in the calibration finds the ramification in the six-point algorithm in Section 3.5.2. • A more elaborate calibration procedure usually includes a nonlinear optimization after the DLT solutions with additional nonlinear distortion parameters. The methods of using a planar calibration pattern introduced in [254, 215] are also practical.
3.2.4 The three-point pose algorithm Given a set of point correspondences xi ↔ ui between the 3D reference points xi and 2D image points ui , and also given the intrinsic parameters of the camera K. The pose estimation consists of determining the position and orientation of the calibrated camera with respect to the known reference points. The camera pose is called space resection in photogrammetry. The difference between the camera calibration and the camera pose is that the intrinsic parameters of the camera needs to be estimated for the calibration and is known for the pose.
The distance constraint An uncalibrated image points in pixels ui and its calibrated counterpart x¯i is related by the known calibration matrix K such that ui = K¯xi . The calibrated point ¯xi = K−1 ui is a three vector representing a 3D direction in the camera-center coordinate frame. For convenience, we assume the direction vector is normalized to a unit vector such that ¯ x≡¯ x/¯ x. A 3D point corresponding to the back-projection of an image point/direction ¯ x is determined by a depth λ as λ¯x. The depth λ is the camera-point distance. In summary, u is an image point in pixels; ¯ x is the direction vector of an imx is a space point corresponding to the age point for a calibrated camera; x = λ¯ image point u in the camera-centered coordinate frame; and x is the space point corresponding to the image point u in the world coordinate frame. The distance between two 3D points represented by the vectors p and q is given by the cosine rule: p − q2 = p2 + q2 − 2pT q. Applying this to the normalized direction vectors representing the 3D points in the camera frame, and using the fact that ¯ xp = 1, gives: d2pq = λ2p + λ2q − cpq λp λq ,
40
3 Multi-view geometry
q 00 11
01
p
11 00 00 11 00 11 00 11 x 11 00 00 11
Fig. 3.5 The geometric constraint for a pair of points.
where cpq = 2¯xTi x¯j = 2 cos(θpq ) is a known constant from the image points, and dpq is the known distance between the space points.
The three points If we are given three points, we have three pairs of points and three quadratic equations f12 (λ1 , λ2 ) = 0, f13 (λ1 , λ3 ) = 0, f23 (λ2 , λ3 ) = 0 of the form fij (λi , λj ) ≡ λi 2 + λj 2 − 2λi λj cosθij − dij 2 = 0 for the three unknown distances λ1 , λ2 , λ3 . The polynomial system has a B´ezout bound of 2 × 2 × 2 = 8 solutions. But the quadratic equations do not have linear terms, so λi → −λi preserves the form and the eight solutions should appear in four pairs. There are many ad hoc elimination techniques in the literature [206, 54] to effectively obtain a polynomial of degree four. Our favorite one is the classical Sylvester resultant , which first eliminates λ3 from f13 (λ1 , λ3 ) and f23 (λ2 , λ3 ) to obtain a new polynomial h(λ1 , λ2 ) in λ1 and λ2 . Then, we can further eliminate λ2 from f12 (λ1 , λ2 ) and h(λ1 , λ2 ), and obtain a polynomial in λ1 of degree eight with only even terms. We let x = λ21 , the polynomial of degree four is of the following form: a4 x4 + a3 x3 + a2 x2 + a1 x + a0 = 0,
3.2 The single-view geometry
41
where the coefficients are given in [182] for instances. The equation has at most √ four solutions for x and can be solved in closed-form. Since λi is positive, λ1 = x. Then λ2 and λ3 are uniquely determined from λ1 .
The absolute orientation The camera-point distances λi are then converted into the camera-centered 3D coordinates, xi = λi K−1 ui , of the reference points in space. We are now given a set of corresponding 3D points, xi ↔ xi . We would like to compute a rigid transformation up to a scale, or a similarity transformation R, t, and s such that the set of points xi maps to the set of points xi , xi = sRxi + t. This is also called the absolute orientation in the photogrammetry literature. As there are totally seven degrees of freedom for the similarity transformation, we need 2 13 points or at least three points if we do not have a third of a point. It is elementary Euclidian geometry to find a closed-form solution from three points: the rotation maps the normal of the plane determined by the given three points, the scale is the ratio of the vector lengths relative to the centroid, and the translation is the de-rotated centroid. If more than three points are available, the best least-square rotation is obtained in closed-form using quaternions [45, 87]. The determination of the translation and the scale follow immediately from the rotation. Algorithm 2 (The three-point pose) Given the calibration matrix K of the camera and three 3D-2D point correspondences xi ↔ ui for i = 1, . . . , 3, compute the rotation R and translation t of the camera with respect to the points xi . 1. Convert 3D points xi in coordinates into pair-wise distances dij . Convert 2D image points ui into the pair-wise angular measures cos θij with the calibration matrix K. 2. Compute the coefficients of the fourth degree polynomial in x from the quadratic equations. 3. Solve the equation in closed-form. For each solution of x, get all the camerapoint distances λi . 4. Convert back the distances λi into the 3D points x . 5. Estimate the similarity transformation, the scale, the rotation and the translation, between the two sets of 3D points xi and xi . There are at most four solutions to R and t.
The critical configurations There are usually multiple solutions to the pose from the minimum of three points. All critical configurations for which multiple distinct or coinciding (unstable) solutions occur are known in [225, 251].
42
3 Multi-view geometry
• There is no coplanarity case per se, since three points always define a plane. Better, any additional point coplanar with the three points will give a unique solution to the pose. • If the three points and the camera center lie on a specific twisted cubic or its degenerate forms, the ambiguity of multiple solutions cannot be resolved, regardless of any additional points lying on the same cubic. The solution will be unique if a fourth point not lying on the specific cubic is introduced. The specific twisted cubics are called horopters, which are spacial cubic curves going through the camera center and lying on a circular cylinder known as the ‘dangerous cylinder of space resection’ in photogrammetry.
Remarks • The fourth degree polynomial derived from the resultants is different from that derived in [54]. Many variants of the polynomial are reviewed in [73] following different orders of substitutions and eliminations. • Four points or more not lying on the critical curves will be sufficient for a unique solution, which can be directly obtained by solving a linear system in [182]. • Methods for the camera pose using line segments instead of points as image features have also been developed, mostly in computer vision. Dhome et al. [36] and Chen [23] developed algebraic solutions for three-line algorithms, and Lowe [128] used the Newton-Raphson method for any number of line segments. Liu, Huang and Faugeras [123] combined points and line segments into the same pose estimation procedure. • Historically, the robust method RANSAC that we will discuss in Section 5.1.2 is proposed within the context of the pose estimation [54].
3.3 The uncalibrated two-view geometry The study of the geometry of two views is fundamental as the two views are the strict minimum for us to be able to ‘triangulate’ the lost third dimension of the image points. It is often called stereo vision when we take the images at different viewpoints simultaneously, or motion estimation if the two images are taken sequentially. The general approach is to understand the geometric constraints for the image points in different views that come from the same physical point in space. These constraints can then be used to find the point correspondences and the triangulation.
3.3 The uncalibrated two-view geometry
43
3.3.1 The fundamental matrix Geometrically, given an image point in the first view, this point first back-projects into a line in space, then this line reprojects onto the second view as a line along which all potential corresponding points of the given point in the first view are located. (cf. Figure 3.6). Equivalently, this is to say that the corresponding points in two views and the point in space are on the same plane! This is the coplanarity constraint, from which all the geometries of two views are derived. A point in one view generates a line in the other view, which we call the epipolar line. The geometry is called the epipolar geometry. The line connecting the two camera centers intersect with each image plane at a point, which we call the epipole. In other words, the epipole in one view is the image of the other camera center. All epipolar lines pass through the epipole, and form a pencil of lines. x
u c
u′ e′
e
l′ c′
Fig. 3.6 The epipolar geometry of two views. A pair of corresponding points is u ↔ u . The camera centers are c and c . The epipoles are e and e in the first and second image. The line l is the epipolar line of the point u.
Algebraically, without loss of generality, we can always choose the simplest coordinate representation by fixing the coordinate frame with that of the first view, so the projection matrix for the first view is P3×4 = (I, 0) and the second becomes a general P3×4 = (A, a). Also with us is a point correspondence u ↔ u in two views of a point x in space. The back-projection line of the point u is defined by the camera center c = (0, 1)T and the direction x∞ = (I−1 u, 0)T , the point at infinity of this line. The images of these two points in the second view are respectively e = (A, a)c = a and u∞ = (A, a)x∞ = Au. The epipolar line is then given by l = e × u∞ = a × Au. Using the anti-symmetrical matrix [a]× associated with the vector a to represent the vector product, if we define F = [a]× A, which we call the fundamental matrix, then we have l = Fu.
44
3 Multi-view geometry
As the point u lies on the line l , it verifies uT l = 0. We come to the fundamental relationship of the epipolar geometry between two image point correspondence u and u : uT Fu = 0.
Properties of F • Rank constraint. The fundamental matrix F is singular and has rank two because of its multiplicative component of an anti-symmetric matrix that has rank two. Geometrically, the set of all epipolar lines passes through the same common point–the epipole, they form a pencil of dimension one. • Epipolar geometry. The kernel of F is the epipole in the first image plane. To obtain the other epipole, it suffices to transpose F, that is FT e = 0. As if uT Fu = 0 is the epipolar geometry between the first and the second view, transposing the equation gives the same in exchange of the views. The epipolar line in the second image plane is Fu, and is FT u in the first image plane. • Degrees of freedom. The degrees of freedom of F is seven. – Algebraically, F has 3 × 3 = 9 homogeneous elements, which makes up only eight degrees of freedom. In addition, it is singular, so only seven degrees of freedom remain. – Geometrically, each epipole accounts for two degrees of freedom, which makes four for the two epipoles. The two pencils of corresponding epipolar lines are in homographic correspondence of dimension one, which accounts for three degrees of freedom. The epipoles and the pencils make up the total of seven degrees of freedom for F. – Systematically, for a system of two uncalibrated views, each view or camera has 11 degrees of freedom, so the two views amount to 22 degrees of freedom. The entire uncalibrated system is defined up to a projective transformation that accounts for 4 × 4 − 1 degrees of freedom, so the total degree of freedom of a system of two uncalibrated views is seven from 2 × 11 − (4 × 4 − 1) = 7. The crucial fact is that the fundamental matrix having exactly seven degrees of freedom is a perfect minimal parameterization of the two uncalibrated views! • Projective reconstruction. The fundamental matrix or the epipolar geometry is equivalent to the determination of the two projection matrices of the two views up to a projective transformation, which is equivalent to a projective reconstruction of the two uncalibrated views [42, 79]. Given F, one possible choice for the pair of projective projection matrices is (I, 0), and ([e ]× F, e ), which can be verified by the definition of F. Then, each image point correspondence can be reconstructed up to a projective transformation from the two projective projection matrices.
3.3 The uncalibrated two-view geometry
45
Remarks The notion of projective reconstruction, originated by Faugeras and Hartley in [42, 79], changed the scope of shape representation, and initiated the systematic study of vision geometry in a projective setting. Together with the affine shape representation introduced by Koenderink in [103], this projective stratum complements the spectrum of shape representations. Computationally, it is still the determination of the epipolar geometry encoded by the fundamental matrix. With hindsight, the point correspondence information of rigid objects in different views encoded by the projective structure via fundamental matrices has a profound impact on solving the fundamental structure from motion.
3.3.2 The seven-point algorithm The fundamental matrix has seven degrees of freedom. Since each point correspondence generates one constraint, so we expect that seven point correspondences suffice to obtain a solution. Each image point correspondence in two views u ↔ u generates one equation in the unknown entries of F from uT Fu = 0, (u u, u v, u , v u, v v, v , u, v, 1)f9 = 0, where f9 = (f1 , . . . , f9 )T is the nine vector of the entries of the fundamental matrix F. For seven point correspondences, we obtain a linear homogeneous system of seven equations, A7×9 f9 = 0, which is not yet sufficient to obtain a unique solution. But a one-dimensional family of solutions parametrized by x/t is given by f9 = xa + tb, where a and b are the two right singular vectors corresponding to the two smallest singular values of the A7×9 . We now add the rank two constraint of F which is the vanishing of Det(F(x, t)) = 0. By expanding this determinant, we obtain a cubic equation in x and t, ax3 + bx2 t + cxt2 + dt3 = 0. Each solution to x/t from the cubic equation gives one solution to f9 , and therefore one solution to the fundamental matrix F. The cubic equation can have either one or three real solutions. Algorithm 3 (The seven-point fundamental matrix) Given seven image point correspondences ui ↔ ui for i = 1, . . . , 7, compute the fundamental matrix F between the two views.
46
3 Multi-view geometry
1. Form the A7×9 matrix. 2. Construct a one-dimensional family of solutions to f9 by linearly combining the two singular vectors a and b corresponding to the two smallest singular values of A. 3. Solve the cubic equation in closed-form with coefficients obtained from a and b. 4. Obtain a solution vector f from each real solution of the cubic equation. 5. Convert f9 into the matrix form F. There are at most three possible solutions to F.
The critical configurations • When the seven points are coplanar, there will be an infinite number of solutions to the fundamental matrix, regardless of any additional points coplanar with the seven points. • The seven points and the two camera centers make up a set of nine points, which determines a quadric surface. If the quadric surface is a proper quadric but not a ruled quadric, there will be only one solution. If the quadric surface is a ruled quadric surface, for instance, a hyperboloid of one sheet, there will be three solutions. The ambiguity of multiple solutions cannot be resolved regardless of any additional points belonging to the same ruled quadric. The solution will be unique if an eighth point not belonging to the ruled quadric is introduced [106, 137].
Remarks This formulation and solution, which appeared in several papers in the early 1990s in computer vision, is simple, elegant and practical. It is equivalent to the original Sturm’s method using algebraic geometry back in 1869 in [216], which was reintroduced into the vision community by Faugeras and Maybank [48].
3.3.3 The eight-point linear algorithm If we add one more point correspondence and ignore the rank constraint on the matrix F, the solution can be simply obtained from a pure linear system A8×9 f9 = 0. The excitement is that this eight-point algorithm is remarkably simple. It has been known to photogrammetrists, a more recent revival could be found in [124, 44]. Not until recently, this linear algorithm suffered from the numerical instability. Hartley in [76] proposed a data normalization that makes the algorithm numerically more stable.
3.4 The calibrated two-view geometry
47
Algorithm 4 (The linear eight-point fundamental matrix) Given at least eight image point correspondences ui ↔ ui for i = 1, . . . , n, and n ≥ 8, compute the fundamental matrix F between the two views. ˜ = Au et u˜ = A u such 1. Transform the image points in each image plane by u that the image point cloud in each image plane is centered at√ the centroid and re-scaled to have the average distance to the centroid equal to 2. ˜ 8×9˜f9 = 0 to the transformed points. 2. Apply the linear method A 3. Take the right singular vector corresponding to the smallest singular value as the solution vector ˜f9 . ˜ 4. Convert ˜f9 to the matrix form F. ˜ to be 5. Enforce the rank two constraint by setting the smallest singular value of F ˜ zero and recomposing the matrix F2 . ˜ 2 A. 6. Undo the normalization to obtain F = AT F The solution is unique.
Remarks • The simplicity of this linear eight-point algorithm is attractive as it is a linear method for both the minimum eight points and redundant n points, but it is suboptimal. The recommended optimal method in Chapter 5 is first to initialize the solution by running the RANSAC on the seven-point algorithm, then to optimize numerically. • The idea of taking advantage of data redundancy to ‘linearize’ algorithms has been explored in other vision geometry problems [208, 122, 167, 182].
3.4 The calibrated two-view geometry When the two views are calibrated, there remains only the relative orientation between the two views. The relative orientation has five degrees of freedom. Three for the rotation, and two for the direction of the translational vector. Unfortunately the magnitude of the translation is unrecoverable. The geometry of the calibrated views is characterized by the essential matrix bearing more constraints than its counterpart, the fundamental matrix, in the uncalibrated case.
3.4.1 The essential matrix For the two calibrated cameras in a Euclidean setting, without loss of generality, we can center and align the world coordinate frame with the first camera, the second camera is relatively oriented by a rotation R and displaced by a translation vector t.
48
3 Multi-view geometry
This amounts to assigning the projection matrices for the two cameras to (I, 0) and (R, t), which act on the calibrated image points x and x . The calibrated image points and the original pixel points are related by u = Kx and u = K x through the given intrinsic parameters K and K for the two cameras of the two views. Formally, the projection equations are exactly the same as for the uncalibrated case. If we define E ≡ [t]x R = TR, the same derivation based on the coplanarity constraint (the vectors x, x , and t) gives xT Ex = 0 for the calibrated image points x and x . The matrix E is called the essential matrix of the two calibrated cameras.
Relation between E and F By substituting u = K x and u = Kx into the fundamental matrix equation uT Fu = 0, we obtain xT (KT FK)x = 0, that is, E = KT FK. The essential matrix E inherits all the properties of the F. It is rank two, thus Det(E) = 0 and ET t = Et = 0.
The Demazure constraints Moreover, the essential matrix E is the product of an antisymmetric matrix and a rotation matrix R such that RRT = I. The rotation matrix differs from an arbitrary 3 × 3 matrix A for the uncalibrated fundamental matrix F. It is straightforward to see that EET = TRRT TT = −T2 = (tT t)I − ttT , so tT t)I − EET = ttT . As t21 + t22 + t23 = tT t = 1/2Trace(EET ), we multiply on the right by E and ttT E = t(ET t)T = 0, and obtain 1 D(E) ≡ EET E − Trace(EET )E = 0, 2 where the nine elements of the matrix give nine polynomial equations of degree three in the entries of the essential matrix [33, 41]. Together with the rank two constraint, the decomposability of E into the product of an antisymmetric and a rotation matrix is now equivalent to
3.4 The calibrated two-view geometry
49
D(E) = 0, Det(E) = 0, which are ten homogeneous polynomial equations of degree three on the essential matrix E. We call these ten polynomials the Demazure constraints. These ten equations are linearly independent, and one can verify by brute force [33]. But obviously, they are not algebraically independent. Indeed it is a simple matter to verify that the vanishing determinant equation is a polynomial consequence of the first nine equations D(E) = 0 related to the trace.
The Huang-Faugeras Constraint Since the singular values of E are the eigenvalues of EET , the nine constraints D(E) = 0 are equivalent to the Huang-Faugeras constraint that a matrix E is an essential matrix if and only if the two non-zero singular values of the rank two matrix E are equal [89]. Thus, a real 3 by 3 matrix E is an essential matrix if and only if it has one singular value equal to zero and the other two singular values equal to each other. It is straightforward to verify that the two characterizations of the essential matrix, either by the Demazure constraints or by the Huang-Faugeras constraint are algebraically equivalent to the two polynomials Det(E) = 0 and 1/2Trace(EET )2 − Trace((EET )2 ) = 0.
3.4.2 The five-point algorithm There are five degrees of freedom for the essential matrix E. Five pairs of corresponding image points are sufficient for at most ten solutions. Demazure answered this question by using the algebraic geometry to characterize the variety of all the essential matrices E embedded in P8 . The Demazure variety is defined by the ten polynomial equations and five linear equations: D(E) = 0, Det(E) = 0, xT i Exi = 0, for i = 1, . . . , 5. Demazure established that the variety of essential matrices has dimension five, and the degree of the variety is ten. Therefore there are at most ten distinct essential matrices from five pairs of corresponding image points. This corresponds to the old results by Kruppa using a different approach. One of the intersection points should be counted twice as a double tangent point, so the eleven solutions of Kruppa should be ten as pointed out by Faugeras and Maybank in [48]. The way Demazure characterizes the essential matrix plays a key role in the modern methods of solving this system of polynomial equations for the given five
50
3 Multi-view geometry
points [233, 156, 212]. They all use the same characterizing polynomials of the variety of essential matrices in Eq. 3.4.2, in particular, the linear independency of the ten algebraically redundant polynomial equations, and the subspace of co-dimension five characterized by the five linear equations.
Solving equations The idea is to reduce the polynomial equations in nine homogenous unknowns eij of the essential matrix to four homogeneous unknowns from the five linear equations given by xT i Exi = 0, for i = 1, . . . , 5, which is a 5 × 9 linear homogeneous system. The four-dimensional null space of the system spans the solution space of E now parametrized by four remaining unknowns x, y, z, and w. Each of the nine components are eij = aij x + bij y + cij z + dij w, where the coefficients come from the computed null space of the 5 × 9 system. After this, we have ten cubic polynomials in x, y, and z by substituting eij with x, y, z and setting w = 1. Triggs [233] used the sparse resultant method to solve the system. Nist´er [156] carried out ad hoc eliminations of these ten polynomials to first exhibit an explicit polynomial of degree ten in one variable, thus leading to a direct solution of the problem. Stew´enius et al. [212] continued with a more systematic elimination. They chose a graded lexicographic order, with total degree first and lexicographic order second, for the twenty monomials of the system, (x3 , x2 y, x2 z, xy 2 , xyz, xz 2 , y3 , y 2 z, yz 2 , z 3 , x2 , xy, xz, y 2 , yz, z 2 , x, y, z, 1) = x with x > y > z. We divide x into two parts l and b such that x = (l, b). The vector l contains the first ten monomials l = (x3 , x2 y, x2 z, xy 2 , xyz, xz 2 , y3 , y 2 z, yz 2 , z 3 ), whose elements are of degree three in x, y and z, and are the leading terms of the ten polynomials. The vector b contains the last ten monomials b = (x2 , xy, xz, y 2 , yz, z 2 , x, y, z, 1), whose elements are of degree two or less in x, y, and z. Now we write the ten cubic polynomials in matrix form: A10×20 x = 0. Since these equations are generally linearly independent, so are the matrix rows. Gauss-Jordan elimination reduces A to A = (I, B),
3.4 The calibrated two-view geometry
51
l (I10×10 , B10×10 ) = 0. b where I is a 10 × 10 identity matrix and B is a 10 × 10 matrix. The crucial observation of Stew´enius et al. is that, knowing that there are ten solutions, the dimension of the vector space A = C[x, y, z]/I is ten, where I is the ideal generated by the ten cubic polynomials. Thus, there should be exactly ten basis monomials for the vector space A. It can be verified that the ten monomials in b are effectively the only ten non-leading monomials, none of which is divisible by any of the leading terms in l. At the same time, these ten polynomials after Gauss-Jordan reduction must be a Gr¨obner basis for the graded lexicographic order. The fact that the Gauss-Jordan reduced form of the ten polynomial equations is a Gr¨obner basis for the graded lexicographic order has important consequences—the system can be solved via eigenvalues. A polynomial f ∈ C[x, y, z] defines a linear mapping represented by a 10 × 10 matrix Af (called the action matrix in [212]) by multiplication operator from A = C[x, y, z]/I to itself. We assume that the x-coordinates of the solutions are distinct, so we choose to take f = x. Each basis monomial from B is multiplied by x, then the normalized product modulo G is expanded in the basis B to contribute one column to the matrix Ax . By inspecting B10×10 , we have ⎛ ⎞ 11 ⎜ ⎟ 12 ⎟, Ax = ⎜ ⎝−b1 −b2 −b3 −b5 −b6 −b8 14 ⎠ 17 where bi is the i-th column vector of B10×10 and the number one’s index is its row position in the matrix. Finally, the eigenvectors of ATx are the basis monomials evaluated at the solution points, which can be trivially reconstructed from the basis monomials.
Rotation and translation from E Once we obtained an essential matrix E from eij via x, y, z. We let the singular value decomposition of the E be E = UDiag(1, 1, 0)VT , such that Det(U) > 0 and Det(V) > 0; and we let ⎛ ⎞ 0 10 D = ⎝−1 0 0⎠ . 0 01 If the last column vector of U is u3 , t is either u3 or −u3 ; the rotation R = UDVT or UDT VT . Of the four possible solutions, the one that triangulates a 3D point in front of both cameras is the correct solution.
52
3 Multi-view geometry
Algorithm 5 (The five-point relative orientation) Given five image point correspondences ui ↔ ui for i = 1, . . . , 5, compute the rotation and the translation between the two views. 1. Form the 5 × 9 linear system. 2. Compute the four-dimensional null space of the linear system by SVD. 3. Obtain the coefficients of eij parametrized by three parameters x, y, z from the above null space. 4. Form the A10×20 matrix from the ten cubic polynomial constraints. 5. Obtain the reduced matrix B10×10 from A10×20 by Gauss-Jordan elimination. 6. Form the multiplication matrix Ax from B10×10 . 7. Compute the eigenvectors of ATx . 8. Each eigenvector is a solution to b, and x = b7 /b10 , y = b8 /b10 , and z = b9 /b10 . 9. Compute eij from x, y, z, and convert eij into the matrix E. 10. Decompose the E into the R and t by SVD. 11. Choose the R and t that reconstruct a 3D point in front of the cameras. There are at most ten solutions to R and t.
The critical configurations • One important advantage of the calibrated five-point algorithm is that it does not suffer from the singularity of coplanar points. Better, when the five points are coplanar, the solution is unique if we enforce the reconstructed points to be in front of the cameras. • If the five points and the two camera centers lie on a specific ruled quadric or its degenerate forms, the ambiguity of multiple solutions cannot be resolved, regardless of any additional points belonging to the same quadric. The solution will be unique if a sixth point not belonging to the specific quadric is introduced. The specific ruled quadrics are called orthogonal ruled quadrics, which are generated by the intersection line of corresponding planes of a pair of congruent pencils of planes [106, 251]. The ‘congruent’ pencils means that the angle between the corresponding planes of the pencils is constant. The ‘orthogonal’ ruled quadric means that cross sections of parallel planes orthogonal to the specific line of the two camera centers are circles. One simple example is a circular cylinder with the line of the two camera centers on the cylinder and parallel to the axis, illustrated in Figure 3.7. If we split the line of the two camera centers into two lines, move them apart but keep each line going through only one camera center, the circular cylinder deforms into an orthogonal hyperboloid of one sheet.
3.5 The three-view geometry
53
F
F′
Fig. 3.7 A circular cylinder is one simple form of the critical configurations.
Remarks • Although the development of the five-point algorithm was a little bit long and requires some algebraic geometry knowledge, it is remarkably easy to implement. Only a few lines of Matlab codes were needed by Stew´enius et al. in [212]. • It is well known that the relative motion between two calibrated views can be estimated from five points due to Kruppa in 1913, Demazure in 1988, Faugeras and Maybank in 1990. Kruppa originally concluded to having at most 11 solutions by studing the intersection of sextic curves, and Demazure, and Faugeras and Maybank both confirmed that one known intersection point is a tangent point, and so should be counted twice instead of once, thus reducing the number of solutions to 10. However there has been no method of deriving the coefficients of the 10th polynomials for the on-line solution of the problem, therefore there was no algorithmic implementation. Triggs revived the subject in [233], which reached to a linear system of 20 × 20 using the sparse resultant method, but Nist´er [156] first obtained such a polynomial using better elimination tools and proposed a practical solution, and Stew´enius et al. [212] confirmed the Gr¨obner basis. Remarkably, these modern methods are based on the characterization of algebraic varieties by Demazure [33]. The uncalibrated epipolar geometry has already been formulated by Chasles in 1855, and approached by Hesse in 1863 and Sturm in 1869. Kruppa in 1913 introduced the rigidity constraint for the formulation.
3.5 The three-view geometry The study of the geometry of the three views naturally extends that of the two views [199, 209, 78, 178, 229, 18, 19, 145]. Though two views are the minimum for depth, the point correspondence between the two views is not unique, a point corresponds only to a line, not a point. Intuitively, if we consider that a set of three views is three pairs of two views, then it generally suffices to have the epipolar geometry between the first and the third, and between the second and the third so that the corresponding point in the third image is completely determined. (see Figure 3.8).
54
3 Multi-view geometry
[
X′′
X
O′ X′
O′′ F′′
F F′ Fig. 3.8 The geometry of three views as two epipolar geometries.
The three-view geometry is fundamental in that it is the minimum number of views for which the image point correspondence ambiguity can be removed.
3.5.1 The trifocal tensor The geometry constraints have been derived for points as the consequences of the collinearity for the single view or the coplanarity for the two views. When a third view is introduced, it is interesting to notice that the line is more natural than the point. We start with lines and end with points.
Line transfer Geometrically, a line l in a view P defines a plane p = PT l that is the back-projection plane of the line or the viewing plane of the line. It is known that there does not exist any geometric constraint for a pair of corresponding lines in two views, as the two back-projected planes of the two lines in the two views are always and generally intersecting on a line. Given a triplet of corresponding lines in three views, l ↔ l ↔ l . The basic geometry is that these three planes should intersect on a line rather than a point at which the three planes usually meet. It is equivalent to
3.5 The three-view geometry
55
⎛
⎞ T
p1 Rank ⎝pT2 ⎠ = 2. pT3 Algebraically, we take the projection matrices of the three views to be (I, 0), (A, a), and (B, b) as we have the freedom to arbitrarily first one. ⎞ The matrices can further be ⎛ T ⎞fix the ⎛ bT1 a1 decomposed into vectors A = ⎝aT2 ⎠ B = ⎝bT2 ⎠. The three planes can then be aT3 bT3 written as T T l B l A l p1 = , and p3 = . , p2 = 0 aT l bT l The rank two constraint says that the vectors representing the planes are linearly dependent. Notice that one entry is zero, so the linear combination coefficients are obtained from the last row, so we obtain l = α(AT l ) + β(BT l ) = bT l AT l − aT l BT l . Introducing Tk = ak bT − abTk , then lk = lT Tk l . We obtain ⎛ T ⎞ l T1 l l = ⎝lT T2 l ⎠ . lT T3 l This is the line transfer equation as it transfers or predicts the line l in the first view from the given lines l and l in the other two views. The three matrices Tk actually form a tensor Tjk i . The line transfer equation is lk = li lj Tij k in tensorial notation.
Point transfer Geometrically, the point transfer was depicted previously in Figure 3.8 in three views as the intersection point of the two epipolar lines. Algebraically, we take a triplet of corresponding points in three views u ↔ u ↔ u . Then, for a point x = (x, y, z, t) in space, we obtain x = (u, t) from the first view. Projecting (u, t) onto the second view, we have: u λ u = (A, a) , t
56
3 Multi-view geometry
/
O
O′′ O′ F′′
F F′
Fig. 3.9 The geometry of three views for line correspondence.
which gives two possibilities for the scalar t: t1 = (aT1 u − u aT3 u)/(u a3 − a1 ) and t2 = (aT2 u − v aT3 u)/(v a3 − a2 ). Using one of two possible values of t, we project the space point, now a known point, onto the third view. By re-arranging the terms and introducing the vectors tij = ai bTj − bj aTi , we obtain two systems of equations with the two possible values of t: ⎛ T ⎞ ⎛ T ⎞ u t31 − tT11 v t31 − tT21 λ u = ⎝u tT32 − tT12 ⎠ u = ⎝v tT32 − tT22 ⎠ u u tT33 − tT13 v tT33 − tT23 These two sets of three equations are homogeneous. By eliminating the scale factor, we obtain four equations in nine vectors tij . It is interesting to observe that by continuing this game of indexes of the vectors tij , and denoting the k-th element of the vector tij with a third index k, the set of the elements indexed by i, j et k forms a tensor Tij k where i, j, and k vary from 1 to 3. More interestingly, this tensor is exactly the same as that introduced for the study of the lines! We call this tensor the trifocal tensor, which may play the same role for three views as that of the fundamental matrix for two views. If we use tensorial notation abc that is 0 when a, b, and c are distinct, +1 when abc is an even permutation of 123, and -1 when abc is an odd permutation of 123. The point transfer equation is nicely summarized in ui (uj jac )uk kbd Tab i = ocd .
3.5 The three-view geometry
57
The properties of the trifocal tensor T • Degree of freedom. We can count the degree of freedom of a given system. For a set of two uncalibrated views, each view or camera has 11 degrees of freedom, so the two views amount to 22 degrees of freedom. The entire uncalibrated system is defined up to a projective transformation that accounts for 4×4−1 = 15 degrees of freedom, so the total degrees of freedom of a system of two uncalibrated views are seven from 2 × 11 − (4 × 4 − 1) = 7. The fundamental matrix having exactly 7 degrees of freedom is a perfect parameterization of the two uncalibrated views that has exactly 7 degrees of freedom (cf. page 44). For a set of three uncalibrated views, using the same counting arguments, the degrees of freedom should be 3 × 11 − (4 × 4 − 1) = 18. This counting argument suggests that there are 27 − 1 = 26 inhomogeneous entries for the tensor T, there should be 27−1−18 = 8 independent algebraic constraints among these entries. Some of them are simple, for instance, the rank of each 3 × 3 matrix of the tensor is two—similar to the fundamental matrix. The others are known [80, 47] but are complex and difficult to be integrated into an efficient numerical schema. • Linear algorithms. For each image point correspondence u ↔ u ↔ u , there are many trilinear equations by the trifocal tensor. Four of them (for instance those we mentioned above) are linearly independent for an image point correspondence, and two are linearly independent for an image line correspondence in three views. This immediately suggests that seven point correspondences are sufficient to linearly solve for the trifocal tensor T, and thirteen line correspondences are sufficient as well [209, 245]. More generally, and also new to the three views, any mixture of n points and m lines will be sufficient as long as 4n + 2m ≥ 26. • Relations between F and T. Now, we answer the question raised at the beginning of this section. Geometrically, both the trifocal tensors and the fundamental matrices describe the ’transfer’ of the image correspondence in the first two views to the third view. Are they the same? Indeed, they are equivalent for generic view configurations and generic points [177]. For instance, when the three camera centers of the three views are aligned, clearly, the epipolar geometries degenerate so it is impossible to intersect the epipolar lines to tranfer to the third view, but the trifocal tensor survives! More general discussion on the N-view case is given in Section 3.6.
Remarks • Despite many nice mathematical properties of the trifocal tensor, it is an overparametrization of the geometry of three views. Its direct estimation is not yet settled. The apparently attractive linear algorithms (for instance, the seven-point linear algorithm) still suffer from the instability, and the nonlinear optimization methods suffer from the complexity of integrating all algebraic constraints.
58
3 Multi-view geometry
• The geometry of the three uncalibrated views can be solved with a minimum of six point correspondences in the following section by solving a cubic equation. It is straightforward to convert the projection matrices computed with the six-point algorithm to the trifocal tensor if that conversion is of any necessity.
3.5.2 The six-point algorithm The number of invariants Given a set of points or any geometric configuration, the number of invariants is, roughly speaking, the difference between the dimension of the configuration and the dimension of the transformation group that acts on the configuration, if the dimension of the isotropy group of the configuration is null (cf. [71, 152]). For a set of six points in P3 , there are three (3 × 6 − (16 − 1) = 3) absolute invariants under the action of general linear group GL(3) in P3 .
The canonical representation The six points in space. For a set of six points {Pi , i = 1, . . . , 6} in space, there are three invariants. One of the simplest ways to consider the invariants is to use the canonical projective coordinates, then the three inhomogeneous projective coordinates of any sixth point with respect to a projective basis defined by any five of them characterize the set of the six points. Any five of the six points, with no three of them collinear and no four of them coplanar, can be assigned to the canonical projective coordinates as follows (1, 0, 0, 0)T , (0, 1, 0, 0)T , (0, 0, 1, 0)T , (0, 0, 0, 1)T and (1, 1, 1, 1)T . This choice of basis uniquely determines a space projective transformation A4×4 , which transforms the original five points into this canonical basis. The sixth point is transformed to (x, y, z, t)T by A4×4 . Thus, x : y : z : t gives three independent absolute invariants of six points and equivalently a projective reconstruction of the six points. The six image points. For a set of six image points {pi , i = 1, . . . , 6}, usually measured in inhomogeneous coordinates (¯ ui , v¯i , 1)T , i = 1, . . . , 6. Take any four of them, with no three of them collinear, and assign them with the canonical projective coordinates in P2 : (1, 0, 0)T , (0, 1, 0)T , (0, 0, 1)T , and (1, 1, 1)T . This choice of basis uniquely determines a plane projective transformation A3×3 , u6 , v¯6 , 1)T to (¯ u5 , v¯5 , w ¯ 5 )T which transforms the fifth (¯ u5 , v¯5 , 1)T and sixth point (¯ T and (¯ u6 , v¯6 , w ¯6 ) respectively.
3.5 The three-view geometry
59
Thus, u ¯5 : v¯5 : w ¯5 and u ¯6 : v¯6 : w ¯6 give four independent absolute invariants of six image points.
Projection between space and image Remember that if we are given six point correspondence in space and image, we can calibrate the camera, i.e. estimate the parameters of the projection matrix P if these points are known. Now the points in space are unknown, and we want to eliminate the camera parameters. Remember also that theoretically we only need 5 12 points for the DLT calibration. The redundance introduced by this half point is the key in the following development. Elimination of camera parameters. For a set of six point correspondences ui ↔ xi , for i = 1, . . . , 6, established as (1, 0, 0)T (0, 1, 0)T (0, 0, 1)T (1, 1, 1)T (¯ u5 , v¯5 , w ¯ 5 )T ¯ 6 )T (¯ u6 , v¯6 , w
↔ ↔ ↔ ↔ ↔ ↔
(1, 0, 0, 0)T , (0, 1, 0, 0)T , (0, 0, 1, 0)T , (0, 0, 0, 1)T , (1, 1, 1, 1)T , (x, y, z, t)T .
The six point correspondences lead to 12 = 2 × 6 equations from (??) for the eleven unknowns of the camera P3×4 , as we assumed that the camera was uncalibrated. There will still remain one (1 = 12 − 11) independent equation after eliminating all eleven unknown camera parameters pij . Substituting the canonical projective coordinates of the first four point correspondences into the projection equation reduces to a three-parameter family of the camera matrix P: ⎛ ⎞ α λ P = ⎝ β λ⎠ , γλ where α = p11 , β = p22 , γ = p33 , and λ = p34 . Using the projective coordinates of the fifth point further reduces to a oneparameter family of the camera matrix, homogeneous in μ and ν: ⎞ ⎛ u ¯5 μ − ν ν v¯5 μ − ν ν⎠ . (3.2) P=⎝ w ¯5 μ − ν ν Finally, adding the sixth point, all the camera parameters μ and ν (the orginal pij ) are eliminated, we obtain the following homogeneous equation in the unknowns x, y, z, t and the known {(¯ ui , v¯i , w ¯i ), for i = 5, 6},
60
3 Multi-view geometry
w ¯6 (¯ u5 − v¯5 )xy + v¯6 (w ¯5 − u ¯5 )xz + u ¯5 (¯ v6 − w ¯6 )xt + u ¯6 (¯ v5 − w ¯5 )yz + ¯6 − u ¯6 )yt + w ¯5 (¯ u6 − v¯6 )zt = 0. v¯5 (w
(3.3)
Invariant interpretation. The above Equation 3.3 can be arranged and interpreted as an invariant relationship between the invariants of P3 and P2 . If ij and ξj denote respectively i1 = w ¯6 (¯ u5 − v¯5 ), i2 = v¯6 (w ¯5 − u ¯5 ), i3 = u ¯5 (¯ v6 − w ¯6 ), i4 = u ¯6 (¯ v5 − w ¯5 ), i5 = v¯5 (w ¯6 − u ¯6 ), i6 = w ¯5 (¯ u6 − v¯6 ), and ξ1 = xy, ξ2 = xz, ξ3 = xt, ξ4 = yz, ξ5 = yt, ξ6 = zt. By the counting arguments, there are only four invariants for a set of six points in P2 , so the homogeneous invariants {ij , j = 1, . . . , 6} are not independent, but are subject to one (1 = 5 − 4) additional constraint i1 + i2 + i3 + i4 + i5 + i6 = 0. The verification is straightforward from the definition of ij . There are only three independent invariants for a set of six points in space, so the invariants {ξj , j = 1, . . . , 6} are subject to two additional constraints which are ξ1 ξ5 ξ2 ξ4 = and = , ξ2 ξ6 ξ3 ξ5 by inspection of the definition of ξj . The invariant relation 3.3 is then simply expressed as a bilinear homogeneous relation, i1 ξ1 + i2 ξ2 + i3 ξ3 + i4 ξ4 + i5 ξ5 + i6 ξ6 = 0,
(3.4)
which is independent of any camera parameters, therefore can be used for any uncalibrated images.
Invariant computation from three images There are only three independent invariants for a set of six points. With each view gives one invariant relation, therefore we can hope to solve for these three invariants if three views of the set of six points are available. The three homogeneous quadratic equations in x, y, z, t from three views can be written as follows.
3.5 The three-view geometry
61
f1 (x, y, z, t) = 0, f2 (x, y, z, t) = 0, f3 (x, y, z, t) = 0. Each equation represents a quadratic surface of rank three. The quadratic form does not have any square terms x2 , y 2 , z 2 , and t2 . It means that the quadratic surface goes through the vertices of the reference tetrahedron whose coordinates are canonical (0, 0, 0, 1)T , (0, 0, 1, 0)T , (0, 1, 0, 0)T , and (1, 0, 0, 0)T . This is easily verified by substituting these points into theequations. In addition, as the coefficients of the 6 quadratic form i(k) are subject to j=1 ij = 0, all the equations necessarily pass through the unit point (1, 1, 1, 1)T as well. According to Bezout’s theorem, three quadratic surfaces meet in 2 × 2 × 2 = 8 points. Since they already pass through the five known points, so only 8 − 5 = 3 common points remain. Thus, the maximum number of solutions for x : y : z : t is only three. Indeed, by successively computing resultants to eliminate variables, we first obtain homogeneous polynomials in x, y and t of degree three, g1 (x, y, t) and g2 (x, y, t) by eliminating z between f1 and f3 and between f2 and f3 . Then, by eliminating y between g1 and g2 , we obtain a homogeneous polynomial in x, t of degree eight, which can be factorized, as expected, into xt(x − t)(b1 x2 + b2 xt + b3 t2 )(a1 x3 + a2 x2 t + a3 xt2 + a4 t3 ). It is evident that the linear factors and the quadratic factor lead to trivial solutions. Thus, the only nontrivial solutions for x/t are those of the cubic equation, a1 x3 + a2 x2 t + a3 xt2 + a4 t3 = 0.
(3.5)
The implicit expressions for ai can be easily obtained with Maple (for instance, at www.cs.ust.hk/ quan/publications/proccubicxyz). The cubic equation may be solved algebraically by Cardano’s formula, either for x/t or t/x.
Projective reconstruction The determination of the projective coordinates of the sixth point completely determines the geometry of three cameras up to a projective transformation. All geometry quantities related to the three views follow trivially. • Given the projective coordinates of the sixth point (x, y, z, t), the projection matrix Pi of each camera can be computed using Equation 3.2 . • Given the projection matrices, any point correspondence in the three views can be reconstructed in P3 up to a collineation. • Given the projection matrices, it is also straightforward to compute any pair-wise fundamental matrices and any trifocal tensors.
62
3 Multi-view geometry
Algorithm 6 (The six-point projective reconstruction) Given six image point correspondences ui ↔ ui ↔ u for i = 1, . . . , 6 in three views, compute the projective projection matrices P, P, and P of the two views. 1. Compute a plane homography in each view to transform the points ui , ui and ui respectively into the canonical basis. 2. Compute the ij in each view from the transformed image points. 3. Obtain the coefficients of the cubic equations (The coefficients can be downloaded from www.cs.ust.hk/ quan/publications/. 4. Solve the cubic equation 3.5 in closed-form. 5. For each solution of x, obtain (x, y, z, t). 6. Compute the projection matrix P for each view from (x, y, z, t). There are at most three solutions to P, P and P .
The critical configurations • When the six points are coplanar, there will be an infinite number of solutions, regardless of any additional points that are coplanar with the six points. • The six points define a twisted cubic. If any one of three camera centers lie on the twisted cubic, the solution is ambiguous regardless of the existence of any additional points lying on the cubic. • If the six and more points and the three camera centers lie on a six-parameter family of a curve of degree four, the solution is ambiguous [139, 80].
Remarks • The trifocal tensor over-parameterizes the geometry of the three views. The six points give a minimal parameterization, thus the six-point algorithm is the choice for the estimation of the three uncalibrated views. The readers should not be afraid of directly plugging the lengthy coefficients from a symbolic computation like Maple into any programming language like C [178, 119]. • It is no accident that the six points algorithm in three views and the seven points algorithm in two views both lead to the solution of a cubic equation. There is in fact an intrinsic duality between these two configurations elucidated by Carlsson in [18, 19]. For a reduced camera ⎛ ⎞ α λ P = ⎝ β λ⎠ γλ with the canonical representation for the reference tetrahedron, it is interesting to see that the camera center is o = (1/α, 1/β, 1/γ, −1/λ) and
3.5 The three-view geometry
63
⎛ ⎞ ⎛ ⎛ ⎞ x x α λ ⎜ ⎟ y ⎝ β λ⎠ ⎜ ⎟ = ⎝ y ⎝z ⎠ z γλ t
⎛ ⎞ ⎞ α t ⎜ ⎟ β⎟ t⎠ ⎜ ⎝γ ⎠ , t λ
in which we can exchange the role of the camera center o and a space point x. So the geometry of n points in v views is equivalent to that of n − 1 points in v + 1 views. In particular, the six points in three views is equivalent to the seven points in two views.
3.5.3 The calibrated three views We start with some counting arguments to obtain the minimal configurations of four points in three calibrated views, then we propose a minimal parameterization to the system. But the problem is still open due to a lack of efficient algorithms.
The counting arguments For 3D reconstruction from image points, each image point gives two constraints, each 3D point introduces three degrees of freedom and each camera pose introduces six degrees of freedom, but there are seven degrees of freedom in the 3D coordinate system (six for the Euclidean coordinate frame and one for the scene scale). So a system of n points visible in m calibrated images yields 2mn constraints in 3n + 6m − 7 unknowns. To have at most finitely many solutions, we therefore need: 2mn ≥ 3n + 6m − 7.
(3.6)
Minimal cases are given by an equality here, so we look for integer solutions for m and p satisfying: 2 . n=3+ 2m − 3 For m = 2, n = 5, a minimum of five points is required for a two-image relative orientation and Euclidean reconstruction. For any m ≥ 3, n is between 3 and 4, thus at least four points are required for m ≥ 3-view Euclidean reconstruction from unknown space points. In fact, four points in three images suffice to fix the 3D structure, after which just three points are needed in each subsequent image to fix the camera pose (the standard three point pose problem [206]). So at least four points are always required for Euclidean reconstruction, and of the minimal m ≥ 3 cases, the four points and three views problem is the most interesting. Note that for four points in three views, Eq. (3.6) becomes 2mn = 24 ≥ 3n + 6m − 7 = 23. The counting suggests that the problem is over-specified.
64
3 Multi-view geometry
An over-specified polynomial system generically has no solutions, but here (in the noiseless case) we know that there is at least one (the physical one). It is tempting to conclude that the solution is unique. In an appropriate formulation this does in fact turn out to be the case, but it needs to be proved rigorously. For example in the two image case, the ‘twisted’ partner of the physical solution persists no matter how many points are used, so the system becomes redundant but always has two solutions. The issue is general. Owing to the redundancy, for arbitrary image points there is no solution at all. To have at least one solution, the image points must satisfy some (here one, unknown and very complicated) polynomial constraints saying that they are possible projections of a possible 3D geometry. When these constraints on the constraints (i.e. on the image points) are correctly incorporated, the constraint counting argument inevitably gives an exactly specified system, not an overspecified one. To find out how many roots actually occur, the only reliable method is detailed polynomial calculations.
The basic Euclidean constraint An uncalibrated image point in pixels ui and its calibrated counterpart ¯xi is related by the known calibration matrix K such that ui = K¯xi . The calibrated point ¯xi = K−1 ui is a three vector representing a 3D direction in the camera-center coordinate frame. For convenience, we assume the direction vector is normalized to a unit vector such that ¯ x≡¯ x/¯ x. A 3D point corresponding to the back-projection of an image point/direction ¯ x is determined by a depth λ as λ¯x. The depth λ is the camera-point distance. The distance between two 3D points represented by 3-vectors p and q is given by the cosine rule: p − q2 = p2 + q2 − 2pT q Applying this to the normalized direction vectors representing the 3D points in the camera frame, and using the fact that xp = 1, gives: 2 λ2p + λ2q − cpq λp λq = δpq
where cpq = 2xTi xj = 2 cos(θpq ) is a known constant from the image points, and δpq is the unknown distance between the space points. These cosine-rule constraints are exactly the same as in calibrated camera pose from known 3D points in Section 3.2.4 and [206, 54], except that here the interpoint distances δpq , which are known quantities as dpq in the pose estimation, are unknowns that must be eliminated.
The four-point configuration A set of four 3D points has 6 independent Euclidean invariants — 3×4 = 12 degrees of freedom, modulo the six degrees of freedom of a Euclidean transformation —
3.5 The three-view geometry
65
and it is convenient to take these to be the 42 = 6 inter-point distances δpq , i.e. the edge lengths of the tetrahedron in Figure 3.10. This structure parametrization is very convenient here, as the δpq appear explicitly in the cosine-rule polynomials. However, it would be less convenient if there were more than 4 points, as the interpoint distances would not all be independent.
1 0 1 0
1 0 0 1
1 0 0 1
Fig. 3.10 The configuration of four non-coplanar points in space.
The polynomial system. For n images of 4 points, we obtain a system of 6n homogeneous cosine-rule polynomials in 4n unknowns λip and 6 unknowns δij : f (λip , λiq , δpq ) = 0,
i = 1, ..., m, p < q = 1, ..., 4
The unknown inter-point distances δij can be eliminated by equating cosine-rule polynomials from different images λ2ip + λ2iq − cipq λip λiq = λ2jp + λ2jq − cjpq λjp λjq (= δpq ). leaving a system of 6(m − 1) homogeneous quadratics in 4m homogeneous unknowns λip . For m = 3 images we obtain 18 homogeneous polynomials f (λip , λiq , δpq ) = 0 in 18 unknowns λip , δij , or equivalently 12 homogeneous polynomials in 12 unknowns λip . De-homogenizing (removing the overall 3D scale factor) leaves just 11 inhomogeneous unknowns, so as expected, equation counting suggests that the system is slightly redundant. The simulation method using random rational numbers. The polynomial system has only finitely many solutions if the dimension of the variety is zero, an infinite number of solutions for positive dimension, and no consistent solutions for negative dimension. Generally, we expect to have only a finite number of solutions for a well-defined geometric problem when the dimension of the variety is zero. Combining elimination from lexicographic Gr¨obner bases with numerical root-finding for one-variable polynomials conceptually gives a general polynomial solver. But it is often impossible simply because Gr¨obner bases cannot be computed with limited computer resources. This is true for our polynomial system g(λip , λiq , λjp , λjq ) with parametric coefficients cipq . We choose the approach offered by Macaulay (http://www.math.uiuc.edu/Macaulay2) among other computer
66
3 Multi-view geometry
algebra systems, which allows the computation with coefficients in modular arithmetic (a finite prime field k = Z/ p ) to speed up computation and minimize memory requirements. Currently, by using this simulation method on random rational simulations, currently we can establish the following results. • Euclidean reconstruction from four point correspondences that come from unknown coplanar points has a unique double solution in three views, but is generally unique in n > 3 views. • Euclidean reconstruction from four point correspondences that come from known coplanar points is generally unique in n ≥ 3 views.
Remarks The minimal Euclidean reconstruction in multiple views is still inconclusive due to its algebraic complexities. Longuet-Higgins [125] described an iterative method of finding the solutions to the case of four points in three perspective images by starting from the solution obtained in the simplifed approximate three scaled orthographic images. Holt and Netravali [84] proved that “there is, in general, a unique solution for the relative orientation . . .. However, multiple solutions are possible, and in rare cases, even when the four feature points are not coplanar. [84]” They used some results from algebraic geometry to draw general conclusions regarding the number of solutions by considering a single example. This is certainly one step further to show the general uniqueness of the solutions, but still many questions remain unanswered and efficient algorithms do not exist for this problem as yet.
3.6 The N-view geometry 3.6.1 The multi-linearities A natural question to ask is whether there exist similar constraints to the fundamental matrix and the trifocal tensor for the case of more than three views. In fact, it is easy to see that everything comes from the projection equation, so all these constraints are pure algebraic consequences of these projection equations.
The formulation Based on [231, 49], for n views, we can easily re-write the projection equations into the following form. Given a space point x projected into n views:
3.6 The N-view geometry
67
λ1 u1 = P1 x, λ2 u2 = P2 x, ... λn un = Pn x. We can then pack these equations into matrix form: ⎞ ⎛ ⎞ ⎛ x P1 u1 ⎟ ⎜ ⎟ ⎜ −λ1 ⎟ ⎜ P2 u2 ⎟ ⎜ −λ2 ⎟ ⎜ ⎟ = 0, ⎟⎜ ⎜ .. .. ⎠ ⎜ .. ⎟ ⎝ . . ⎝ . ⎠ un Pn −λn M where the vector (x, −λ, −λ , . . . , −λ(n) )T cannot be zero, so the rank of the matrix M can not exceed n + 4. This implies that all its minors of (n + 4) × (n + 4) have to vanish. The expansion of all these minors gives all the geometric constraints that we could imagine among multiple views.
The multi-linear constraints The minors could be formed in different combinations. When the minors are formed by the elements involving only two different views, the expansion of these minors yields the geometric constraints for these two views. We call them the bilinear constraints, which are de facto the epipolar constraints or the fundamental matrix. Likewise, when the minors are formed by the elements involving only three different views, the expansion of these minors yields the constraints for these three views. We call them the trilinear constraints, which are de facto the trifocal tensor. When the minors are formed by the elements involving four different views, the expansion of these minors yields the constraints for these four views. We call them the quadrilinear constraints or quadrifocal constraints. Since M has n + 4 columns and 3n rows, the minors cannot have more than four projection matrices involved in the formation. Therefore, there is no constraints for more than four different views.
The algebraic relations The algebraic relations among these multi-linear constraints have been investigated in [75, 231, 49, 177]. • It is easy to see that the quadrilinear constraints are not algebraically independent, and that they break up into trilinear and bilinear constraints. Furthermore, these quadrilinear constraints are redundant due to the intrinsic Grassmannian quadratic relations among the minors.
68
3 Multi-view geometry
• The relation between the trilinear and bilinear is more subtle. The key is that we should single out the degenerate configurations of views and points. For generic view configurations and generic points, all multi-linear constraints may algebraically be reduced to the algebraically independent bilinear constraints. In other words, all matching constraints are contained in the ideal generated only by the bilinear constraints for generic views and points. As a consequence, 2n − 3 algebraically independent bilinearities from pairs of views completely describe the algebraic/geometric structure of n uncalibrated views for generic views and points. For degenerate points of generic views, each type of constraint reduces differently. The exact reduced form of the matching constraints are also made explicit by computer algebra.
Remarks The formulation allows us to have a global picture of all geometric constraints for multiple views. These multi-linear constraints generally over-parametrize the geometry of n views, the numerical exploration of these constraints is not yet elucidated.
3.6.2 Auto-calibration The idea of auto-calibration is to upgrade a computed projective reconstruction to a Euclidean reconstruction by taking advantages of the intrinsic constraints on the intrinsic parameters of the cameras in multiple views. The idea was introduced by Maybank and Faugeras [138] using the Kruppa’s equation with a minimal parameterization.
Basic equations Given a projective reconstruction of 3D points xi and camera matrices Pj . By the very definition of a projective reconstruction from uncalibrated cameras, it is defined up to a space projective transformation because of ui = Px = (PH)(H−1 x) for any projective transformation H. In case of auto-calibration, we seek a specific 4 × 4 transformation H such that, if xi → H−1 xi , the camera matrices should be brought to the calibrated Euclidian ones as Pj H = λj Kj (Rj , tj ). The auto-calibration directly looks for the unknown intrinsic parameters Kj , and indirectly the rectifying transformation matrix H. The Kj and H are not indepen-
3.6 The N-view geometry
69
dent. We first try to eliminate Rj and tj . We may take the 3 × 3 part of the 3 × 4 matrices, and use the orthogonality constraint RRT = I to obtain (Pj H)3×3 (Pj H)T3×3 = λi Kj KTj . Matrix arrangements lead to I 0 HT )PTj = λKj KTj . Pj (H 3 0 0 The fundamental auto-calibration equation, written for each view, is therefore Pj QPTj = λj Cj , where
I3 0 HT and Cj ≡ Kj KTj . Q≡H 0 0
(3.7)
(3.8)
So the auto-calibration is to compute the unknowns Q and Cj from the known Pi via the five equations in Equation 3.7 for each view. Once Q and Cj are computed, the H and Kj can be computed. The Ki is obtained 1/2 1/2 1/2 from Choleski decomposition of Cj , and H is computed as ADiag(λ1 , λ2 , λ3 , 1) from the eigen-decomposition of Q = ADiag(λ1 , λ2 , λ3 , 0)A.
Geometric interpretations We give a geometric interpretation for the above Q and Cj . Let the vector u = (u1 , u2 , u3 , u4 )T denote the dual plane coordinate in space, not an image point u = (u, v). A similarity Euclidian transformation up to a scale is a projective transformation that leaves the so-called absolute conic invariant. The absolute conic is a virtual conic on the plane at infinity. It is given by x21 + x22 + x23 = 0 = x4 as a point locus. The dual form of the absolute conic, viewed as an envelop of planes in plane coordinate, is u21 + u22 + u23 = uT Diag(1, 1, 1, 0)u = 0. We call the dual of the absolute conic the dual absolute quadric. It is a degenerate quadric envelop. The unique null space of this rank three matrix is the plane at infinity. The canonical form of the dual absolute quadric in a Euclidian frame is I 0 Diag(1, 1, 1, 0) = 3×3 . 0T 0 We see that the representation of the dual absolute quadric is easier to manipulate than that of the absolute conic as it is a simple matrix with the implicit rank constraint within. From the above definition of Q in 3.8, we have
70
3 Multi-view geometry
HQHT =
I3×3 0 . 0 0
Immediately, the Q can be interpreted as the dual absolute quadric in a projective frame before the rectification. The H is the rectifying homography that brings the dual absolute quadric in an arbitrary projective basis to the canonical Euclidean basis. Using the plane-line projection relation u = PT l, a dual quadric uT Qu = 0 is projected onto (PT l)T Q(PT l) = 0, which is lT (PQPT )l = 0. The equation describes an image conic envelop in dual line coordinates with the 3 × 3 symmetric matrix PQPT . Finally, the general auto-calibration equation 3.7 admits a simple geometric interpretation, i.e., the absolute quadric Q is projected onto the dual of the image of the absolute conic C.
Auto-calibration algorithms We first use the counting arguments to stress the practical difficulties of autocalibration, then we give two common and practical methods of auto-calibration. The degrees of freedom. A rank-three quadric Q has eight degrees of freedom and a conic C has five degrees of freedom. But Q and C or H and K are not independent, they are related through the first reference camera matrix K1 (I, 0): K1 0 H= vT1 1 and the vector v introduces only three additional degrees of freedom. It is easy to see that it is impossible for auto-calibration if we do not impose any additional constraints on Kj . The art of auto-calibration makes different assumptions on Kj and simplifies the parameterization of H so as to make the system more tractable. The constancy of intrinsic parameters. We assume constant but unknown intrinsic parameters Kj ≡ K. This was the original assumption used to introduce the concept of auto-calibration. The polynomial system can be iteratively solved starting from three views, and an initial solution to K is usually easy to provide. In this case, we can keep eliminating the additional three independent parameters in H or Q to have a quadratic polynomial system with only five unknowns from K. This is equivalent to the so-called Kruppa equations introduced to computer vision by Faugeras and Maybank. The Kruppa equations use the minimal parameterization, but are prone to singular cases [214]. The redundant absolute quadric parameterization does not suffer from the same singular cases as the Kruppa equations do. The variable focal length. The only reasonable scenario is that each camera has only one unknown focal length. In this case, the equations become linear, observed
3.6 The N-view geometry
71
by Pollefeys et al. in [169] based on the absolute quadric introduced by Triggs [232]. A small caveat is that we should first transform the known principal point (u0 , v0 ) to (0, 0) such that the K becomes diagonal Diag(f, f, 1), then Cj = Diag(fj2 , fj2 , 1) and Diag(f12 , f12 , 1) a Q= . aT ||a||2 Using the four equations from c11 = c22 , and c12 = c13 = c23 = 0 leads to a linear equation system in the unknowns fj2 , a = (a1 , a2 , a3 )T , and ||a||2 . This indeed turns out to be a practical method of solving a linear system. We may further to impose fj = f in the linear system with less unknowns. Algorithm 7 (The linear auto-calibration) Given projective projection matrices Pj for j = 1, . . . , n views, compute the rectification matrix H and the unknown focal lengths fj for j = 1, . . . , n. 1. Transform each Kj into Diag(fj , fj , 1) with the known (u0 , v0 ) for each camera. 2. Form the linear system for each view in the unknowns fj2 , a1 , a2 , a3 , ||a||2 . 3. Solve the linear system for these unknowns, the focal lengths fj and the matrix Q. 4. Decompose the Q into the H.
The critical configurations The singularities of auto-calibration has been studied by Sturm in [214]. The critical configurations for auto-calibration are the configurations of the cameras, not those of the points. • If the principal axes of the n views meet at a point at infinity, the auto-calibration is impossible. It is actually a translating camera with parallel principal axes. Only affine reconstruction is possible. • If the principal axes of the n views meet at a finite point, the linear autocalibration algorithm is ambiguous. The auto-calibration is unstable when the configuration of the camera is close to these critical configurations.
Remarks • The more general formulation using the absolute quadric is due to Triggs [232]. The notion of auto-calibration is conceptually attractive. Given the facts that the intrinsic parameters of the camera are correlated each other, few independent equations exist for each view, and the lack of statistical optimization criterion. It makes the numerical estimation of the auto-calibration unsatisfactory and less robust when compared to the projective reconstruction for instance.
72
3 Multi-view geometry
• In practice, most of the intrinsic parameters are well known in advance. For instances, the principal point is in the center of the image plane, there is no skew and the pixel is square. These prior specifications of the intrinsic parameters are not worse than the results obtained by an auto-calibration algorithm for many of the non-metric applications of reconstruction. There might be only the focal length that is worth auto-calibrating! Even the recommended linear auto-calibration method allows a variable unknown focal length. More robust results for the fixed unknown focal length are reported in [119].
3.7 Discussions To calibrate or not to calibrate? That has been the subject of discussions for a while. For a pure uncalibrated framework, the seven-point algorithm for two views and the six-point algorithm for three views are the workhorses to be followed by a global bundle-adjustment for the projective structure. Such a system has been developed in [119]. It is remarkable that no any knowledge on cameras is required as long as the images are overlapping and free of nonlinear distortions. Nevertheless, all uncalibrated methods suffer from the singular coplanar scenes, while the calibrated methods do not. There is a big advantage of working in a calibrated framework. The intrinsic parameters are (new digital cameras record all parameters!) often sufficient for non-metric applications. The final bundle-adjustment is compulsory. These arguments support a calibrated framework in which the five-point relative orientation algorithm and the three-point pose algorithm are the workhorses, as the combination of these two gives a working algorithm for the three calibrated views. Nevertheless, the seven-point algorithm is almost a must to be used for finding correspondences at the very beginning of the entire pipeline. If a quick prototyping is necessary, the eight-point in lieu of the seven-point could be used for its implementation simplicity. The open question is an efficient algorithm for the case of three calibrated views with as few as four points.
3.8 Bibliographic notes The study of the camera geometry has a long history and was originated in photogrammetry [206, 251], in particular for the classical tasks of pose and calibration. The ‘DLT’ methods have been proposed in [206, 218, 65]. The methods were further improved by Faugeras and Toscani [52] using a different constraint on the projection matrix. Lenz and Tsai [110] proposed both linear and nonlinear calibration methods. The three-point pose algorithm can be traced back to 1841 by the photogrammetrists [73]. Many variants [54, 57, 73] of the basic three-point algorithm have been developed ever since.
3.8 Bibliographic notes
73
The two-view geometry in the uncalibrated projective framework, summarized in the seven-point algorithm, has been revived in the 1990s [48, 42, 146, 145, 43, 130, 131, 79, 76, 228, 255]. The materials are now standard, and can be found in standard textbooks [43, 47, 80, 59, 132]. Nist´er and St´ewenus’s efforts in making the five-point algorithm workable is a major contribution as it is the probably the most fundamental component of vision geometry. Remarkably, these methods are based on the characterizing polynomials presented by Demazure [33] for the proof of the solutions. The trifocal tensor in the calibrated case for points and lines are from [78, 199, 209, 245], and its introduction into the uncalibrated framework and the development are due to [199, 78]. The six-point algorithm is adapted from [178] by Quan, which first appeared in a conference version [176]. The systematic consideration of multi-linearities and their relations have been investigated in [231, 83, 50]. The auto-calibration was originated in [138] by Maybank and Faugeras, but the introduction of the absolute quadric by Triggs in [232] led to the more simple and practical methods presented in [169]. There is also a large body of literature on special auto-calibration methods with special constraints. Approximative camera models such as orthographic, weak perspective and affine cameras have been considered. A good review can be found in [100, 198, 180]. There have also been methods for specific motions such as planar motion and circular motion [2, 97, 96, 51]. They were purposely not mentioned as they are useful only for specific applications, and are absent in any general structure from motion systems.
Part II
Computation: from pixels to 3D points
Chapter 4
Feature point
This chapter discusses the extraction of point-like features from images as an ideal geometric point for 3D geometry computation. Usually, we track a point when the two images are very close with small disparities, which is typical for adjacent video frames. We match a point when the two images are at more distant viewpoints with larger disparities. We recognize a point when the second image is an abstract large database of image features. The key is that we see the similar characteristics of these point features appeared in different contexts.
77
78
4 Feature point
4.1 Introduction The location (x, y) of a pixel in a digital image is just a geometric point in the image plane, if we ignore its intensity value. These points should be distinguished so that they can be detected and matched in different images. The intensity of each pixel is therefore its identity, or characteristics and features. Yet, the intensity is a complex function of many parameters during the image formation. It is insufficient for it to be the unique identity of a point even if we make the simplest Lambertian assumption. Lambertian assumption is actually our default assumption in most cases as we are unable to handle the other more complex lighting models. The goal is how to identify these distinguished pixels from their characteristics of intensity values, certainly not from a lonely pixel but together with its vicinity. The subject of feature detection is vast in computer vision, image processing and pattern recognition. Since we are geometrically motivated, we can classify the image features into one-dimensional edge features and two-dimensional point features. We leave the edge detection problem to the literature, Canny’s work [17] has settled the edge detection for almost two decades. The point features are prevalent in vision geometry, as they provide strong geometry constraints, while the edge points or line segment features only weakly constrain the geometry. An image line provides only one constraint in the normal direction of the line. Nevertheless, line features are still important in recognition and modeling as has been advocated in [134] and they play important roles in modeling of fac¸ades and buildings in Chapters 9 and 10.
4.2 Points of interest 4.2.1 Tracking features A pixel in an image I(x) at x is said to be of interest when it can be identified in another image I (x ) at a different viewpoint. Imaging physics is usually complex, we have no other choices than to take Lambertian assumption for simplicity so that the two points in the two images are from the same space point if I(x) = I (x ). If the transformation between the two images could be parameterized by T(.) such that x = T(x ), then the fundamental equation becomes I(x) = I (x ). For a very small or infinitesimal displacement of the transformation, we have T : x → x + Δx. Moreover, if we assume that the image intensity function I is smooth. We can Taylor expand the image intensity function using only the first order term I(x + d) ≈ I(x) + gT Δx,
4.2 Points of interest
79
where g is the gradient of the image intensity function. The differentiable intensity function is usually pre-smoothed by a Gaussian function Iσ (x, y) = I(x, y) ∗ G(x, y; σ). The scale σ in G(x, y, σ) is fixed for the time being, but we will discuss it again later when we are concerned about the scale. Then we obtain gT Δx = e = I(x) − I (x), where gT = (∂(I ∗ G)/∂x, ∂(I ∗ G)/∂y). This is one scalar equation that is impossible to solve for the unknown vector ΔxT = (dx , dy ). In other words, only one component along the gradient direction can be computed, but not the component perpendicular to the gradient. This is the well-known aperture problem in perception [134]. This is true for a single pixel, but we can track or match an image patch I(x), a collection of pixels in a neighborhood, against another image I (x) with the common unknown transformation T(.) such that I(T(x)) = I (x) for all x ∈ W . Now we have more equations than the unknowns, we could choose a (weighted) to minimize for the unknown transformation T, e(T) = square error functional 2 x∈W ||I(T(x)) − I (x)|| w(x)dx, where the integral is over a window W around the point x. The weight function w(.) can be either a constant 1 or a Gaussian-like function. In the case of the simplest displacement as the transformation T, x → T (x) = x + d, we have e(d) = ||I(x + d) − I (x)||2 w(x)dx. x∈W Taylor expanding the image term I(x + d) in the first order, and differentiating e(d) with respect to d for the minima, we obtain Hd = e,
(4.1)
where e is the residual error vector x∈W (I − I)gw(x)dx and g is the gradient of the image I, and H=
gT gw(x)dx.
x∈W This is a system of two linear equations for the two unknowns of the vector d. As in [226, 201], the feature point is defined a posteriori as the pixel at which there is a reliable solution to Equation 4.1, that is, 2 wgx wgx gy H= wgx gy wgy2 should be well-conditioned at the feature point.
80
4 Feature point
Intuitively, H measures the intensity variations or the textureness of the pixel at its vicinity. The two eigenvalues λmin and λmax of the H characterize the stability of the system. There are two implications. First, the condition of H in Equation 4.1 defines a posteriori criterion to decide which pixel is a good feature point. Tomasi and Kanade [226] suggest to choose all pixels whose smallest eigenvalue λmin is sufficiently large. This is de facto a feature detector or detector of points of interest. A small image patch centered at the feature point can be taken as its local descriptor. A distance between two such descriptors, such as ZNCC, can be used to determine the correspondence of two such feature points from two different images. Second, for each of these identified feature points, we iteratively solve Equation 4.1 at each step for a gradient-descent minimization over the displacement d defined in Equation 4.1. This is the tracking of the feature points in successive and usually quite close images, and is known as the popular KLT tracker.
4.2.2 Matching corners Now, suppose that we are matching the image against itself, so I = I, the integral is the weighted least squares of matching errors e(d) = ||I(x + d) − I(x)||2 (weight)dx. x∈W The purpose now is not to search for the unknown transformation d because e = 0 if d = 0, but to look at the variation at its vicinity. Taylor expanding e, and notice that both the function evaluated at 0 and the first order term evaluated at 0 vanish, we obtain 1 e(d) ≈ dT Hd. 2 For a fixed small value of e, this equation defines an ellipse at x in the image plane. The two eigenvalues of H are inversely proportional to the lengths of the major and minor axes of the ellipse. When λmin and λmax are both small, the image intensity variation at x is almost constant. When they are both large, it varies significantly in all directions. When λmax >> λmin , it changes more dramatically in the direction of the eigenvector of the λmax . The pixel is an ’edge’ point. The above second-order Taylor approximation leads to the exact second derivative Hessian that is usually approximated by the quasi-Hessian, the positive semidefinitive gT g, because of the quadratic metric of the functional gT g(weight)dx. H≈ x∈W These two formulations both lead to the same H for each point x. The Hessian matrix is also called the information matrix, whose inverse is the covariance matrix.
4.2 Points of interest
81
Though they are of the same form, they come differently, where one is from the first-order development of the image, while the second is an approximation to the second-order development of self-matching. Harris and Stephens [74] proposed to compute the response det(C) − ktrace2 (C) = λmin λmax − κ(λmin + λmax )2 for each pixel instead of the explicit eigenvalues. This response is then thresholded to keep only those sufficiently strong. Finally, the non-maxima of the responses are suppressed in the eight neighbors. This is the Harris-Stephens corner detector that has gained popularity during the development of vision geometry. Usually these points are not exactly located at the corners, so they are more points of interests, but defined a priori to maximize the auto-correlation score. The corner ususally has the simple intensity patch around the point as the descriptor, a simple distance between the two descriptors can be used to match two corners in two different images.
4.2.3 Discussions The Lucas-Kanade tracking equation has several ramifications. First, it leads to one definition of good feature points [226, 201], which have well-conditioned H. This is per se the same detector of points of interest or corners as in [74], which has been motivated by improving Moravec’s points of interest [147], which in turn is from the solution of the discretization of the equation E(T ) with d = {(±1|0, ±1|0)} for self-matching I = I . Forstner [58] proposed similar operators with more statistics motivations. He distinguishes two scales: the natural scale σ1 which is used to describe the blurring process In × Gσ1 and the artificial scale σ2 which is used to integrate the nonlinear function ∇I∇I T of I, leading to ∇I∇I T ⊗ Gσ2 . All these detectors of points of interest [58, 74] only differ by their computational implementation of examining the eigensystem of H: Harris and Stephen used det − kT race2 = λ1 λ2 − k(λ1 + λ2 )2 ; Tomasi and Kanade suggested min(λ1 , λ2 ) > λ; and Forstner took also the scale σ into account and proposed κ2 = λλ1 σ2 2 . Second, it suggests that a dense correspondence between frames is ill-posed in nature as H is not well conditioned everywhere. In other words, there are pixels for which H are close to singularity. The impossibility of dense correspondence justifies the necessity for the introduction of the quasi-dense correspondence [119] as a more achievable goal in the coming chapters. Thirdly, it implies that the feature point based approach, now the mainstream, is, by its definition of the infinitesimal expansion d for tracking/matching and self-matching, only good for close frames. This indeed reflects the actual practice in the representative systems implemented in [55, 169, 155]. Although a more general transformation than the translational displacement could be introduced into the development of matching and self-matching
82
4 Feature point
equation without theoretical difficulties [234], it leads often to inconclusive tradeoffs between invariance and rareness of the descriptors [234].
4.3 Scale invariance 4.3.1 Invariance and stability By its definition and derivation in both tracking and matching, the points of interest are invariant to the displacement at the fixed scale at which the detection was carried out. Although the transformation does not include any rotation in the derivation, the H is isotropic, and therefore rotation invariant within minor angular variations. The computation of H involves only derivatives of image intensity, it is invariant to a linear transformation of the intensity function. This is better than we could have expected. It remains that H is sensitive to the scale! A corner at a smaller scale may become an edge in a larger scale. When the scale is of importance in the image collection, we need different points of interest, which lead to the SIFT.
4.3.2 Scale, blob and Laplacian The detection of scale-invariant features could be performed by searching for stable features across all possible scales, i.e. in the scale-space. The scale space representation of an image is a one-parameter family of smoothed images provided by the convolution of the image with Gaussians G(x, y; σ) of different sizes σ. The choice of Gaussian function is motivated and justified by many desired properties of the scale-space [101]. The feature points can be detected in this space-space representation. But one key issue is how to select the scale of each feature point. The Laplace operator is a second-order differential operator. The Laplacian is the trace of the (true) Hessian, not the quasi-Hessian for the squared function. The Laplacian of a pre-smoothed image Iσ is ∇2 Iσ (x, y) =
∂ 2 Iσ ∂ 2 Iσ + = ∇2 G ∗ I(x, y). ∂x2 ∂y 2
The Laplacian of Gaussian at a certain scale, σ, is a blob detector, because it gives strong positive/negative responses for dark/bright blobs of size σ. We see that the size of a blob matches our intuition of the scale of a point. If the Laplacian of Gaussian is detected at a single scale, we are not able to differentiate the true size of the blob structures of the image from the size of the pre-smoothing Gaussians. But it is possible to make it scale-invariant by the scalenormalized Laplacian operator σ 2 ∇2 G. We can automatically detect both the scale and the blobs if we search for the points that are simultaneously local maxima or
4.3 Scale invariance
83
minima in the true scale space of the image [121]. Note that the zero-crossing of the Laplacian of Gaussian has been used to detect edge points, but not the direct response of it. In practice, we build a three-dimensional discrete scale-space volume of the Laplacian of Gaussian LOG(x, y; σ) for an input image I(x, y). A point in this scale-space volume is regarded as a dark/bright blob if its response is a local minimum or maximum in all its 26 neighbors. The Laplacian operator is covariant with translations, rotations and rescalings in the image. Thus, if a scale-space maximum is assumed at a point (x, y, t), then under a rescaling of the image by a scale factor s, there will be a scale-space maximum at (sx, sy, s2 t) in the rescaled image [121]. This practically useful property implies that beyond the blob detection by Laplacian, more importantly, local maxima and minima of the scale-normalized Laplacian can be used for scale selection in different contexts, such as in corner detection, in scale-adaptive feature tracking [13], in the scale-invariant feature transform [127], and in other image descriptors for image matching and object recognition.
4.3.3 Recognizing SIFT Scale-invariant keypoints The scale-invariant keypoints developed in Scale-Invariant Feature Transform (SIFT) by Lowe [127] are essentially the scale-space maxima or minima of the scalenormalized Laplacian of Gaussian. But there are innovations that make it more like a point of interest rather than a blob. The first is the adaptation of the well-known approximation to the Laplacian of Gaussian by the difference of Gaussian to speed up the computation. Second, either LOG or DOG blobs do not necessarily make highly selective features, since these operators also respond to edges. They have been one family of edge detectors before. To improve the selectivity of the detected keypoints in SIFT, Lowe introduces an additional post-processing stage, where the eigenvalues of the Hessian of the response image at the detection scale are examined in a similar way as in the Harris-Stephens operator. If the ratio of the eigenvalues is too high, then the local image is regarded as too edge-like, so the feature is rejected. Third, the keypoint position is relocated by fitting to a quadratic function.
Descriptor Finally each point is assigned a dominant orientation from the orientation histogram within a region centered on the keypoint. For locations with multiple peaks of similar magnitude, there will be multiple keypoints created at the same location and scale but with different orientations. The keypoint is rotated to align with the dominant orientation at the given scale. Finally, a local image descriptor is computed
84
4 Feature point
from the histogram of the gradients in the neighborhood of 16 × 16 partitioned into 4 × 4 = 16 patches is computed, then a local descriptor as a 16 × 8 = 128 dimensional feature space for the histograms in 8 directions is obtained. The matching is the search of the nearest neighbor using Euclidian distance between two points, or using any approximate nearest neighbor searching method [3], if the database is very large.
4.4 Bibliographic notes We mainly followed the original formulation of image registration by Lucas and Kanade [129], and the later one by Tomasi and Kanade [227], to characterize points of interest. In the early days of computer vision, point features, more corner-like features, have been overshadowed by edges, possibly because of the primal sketches of the vision paradigm advocated by Marr in [134]. Point features took over edges when many vision researchers focused on 3D vision geometry for which points are natural. In this period, Harris-like points of interest from [74] became popular. The scale was not a major challenge as far as 3D reconstruction is concerned as the scales of the overlapping images for 3D reconstruction often do not change abruptly. The evaluation on the stability of points of interest has been well accounted in [143], which has stimulated many of the recent invariant feature detection and description. Traditional descriptor is only the intensities of the image patch around the point, the stability study lead to the usage of more complex features in the patch such as Koenderink’s image jet [102, 143]. This suggested that the feature points and the descriptors can handle the images with more severe transformation. The usage of detectors and descriptors were then introduced for a data base of images for object recognition purpose. The SIFT developed by Lowe [127] is probably the most successful development in this category. The success of SIFT in structure from motion from a collection of unstructured pictures in [207] confirmed the effectiveness of SIFT in image matching with significant scale variations. Nevertheless, feature points are still more efficient for most images captured for reconstruction purposes with only mild scale changes.
Chapter 5
Structure from Motion
This chapter presents the computational framework of 3D reconstruction from 2D image points or structure from motion. We first present the standard sparse approach, which is sufficient for computing camera poses or camera tracking, but insufficient for broad modeling and visualization applications that require a higher density of reconstructed points. We then introduce a resampling scheme to obtain point correspondences at sub-pixel level, and develop a unique quasi-dense approach to structure from motion. The quasi-dense approach results in a more robust and accurate geometry estimation using fewer images than the sparse approach by delivering a higher density of 3D points. The modeling applications presented in Part III of the book are based on this quasi-dense reconstruction.
85
86
5 Structure from Motion
5.1 Introduction ‘Structure from Motion’ (SFM) designates the computation of the camera poses and 3D points from a sequence of at least two images. It is also called 3D reconstruction. Note that both structure, which are 3D points, and motions, which are camera poses, are simultaneously computed from image points, it is not meant that the unknown structure is computed from the known motions as the term might suggest. It is more precisely ‘structure and motion’ rather than ‘structure from motion’. Structure from motion simply starts from algebraic solutions using the minimal data studied in Chapter 3, then ends with a numerical optimization using all data. The first challenge is to combat the inevitable noises and errors of image measurements. The ‘small’ noises are absorbed by the numerical optimization methods, and the ‘large’ errors are handled by the robust statistics methods. The second challenge is that we want to have more and better feature points, which is the motivation for the development of the unique quasi-dense approach.
5.1.1 Least squares and bundle adjustment The Newton-like method for minimization is based on a quadratic model of the function to be minimized f (x) = f (x0 ) + gT d + 1/2dT Hd, by iteratively solving the linear system Hd = −g. An excellent textbook on the subject is [56]. Most of our vision geometry problems is formulated as an optimization of a distance function or residual function r(x) in which a maximum likelihood solution to the parameters x is sought. This is the Nonlinear Least Squares, minimizing f (x) = ||r(x)||2 . If J denotes the first derivative Jacobian matrix of r with respect to x, then g = −2JT r and H ≈ 2JT J, thus it is equivalent to iteratively solve the linear normal equation JT Jd = JT r. The approximation of Hessian by JJT is valid only when r is small. This is equivalent to making a linear approximation to r, but but not a linear approximation to the objective function f = rT r. So the method is in between first-order and secondorder. The Levenberg-Marquardt method replaces JT J by (1 + λi )JT J, where typically at each iteration λi = λi−1 /10 and λ0 = 10−3 . The factor 1 + λi improves the stability of the normal equation. This regularization can be used as is for most of the problems encountered in this book when the problem size is small, see for instance
5.1 Introduction
87
the implementation provided in [172]. But for a real structure from motion with thousands of images and millions of points, the linear solver becomes too expensive at each iteration. A sparse linear solver that takes the structure of the given parameter space into account has to be used. In structure from motion, we optimize all the 3D point coordinates xi and the camera parameters Pj by minimizing the image re-projection errors ||ui − Pj xi ||2 , f (xi , Pj ) = ij
which is called a bundle adjustment as the method of least squares has been named adjustment theory in applied sciences [213]. The bundle adjustment as a global optimization procedure for Euclidean structure from motion has been established as the standard in photogrammetry [15, 206]. It is adopted in computer vision as a standard as well. The bundle adjustment for projective reconstruction was introduced in [146, 145]. The implementation strategy of bundle adjustments for cost function, parametrization and linear solver vary a lot [235]. The cost function is always a kind of the quadratic of the total re-projection error with infinite cost for temporary outliers. The optimization procedure is the standard Levenberg-Marquardt method with a Gauss-Newton step [172].
Parametrization The global parametrization is a gauge-free over-parametrization [77, 140]. The state vector is partitioned into camera and structure parameters. In the projective case, a camera is parametrized by 11 of its 12 homogeneous entries while locally normalizing its largest absolute value to 1. In the Euclidean case, each camera is parametrized by its six extrinsic parameters (camera center, tx , ty , tz , and the three local Euler angles, δx, δy, δz). Also, all cameras are parametrized by their common focal length f and one optional common radial distortion ρ. For both projective and Euclidean cases, each 3D point is parametrized by three of their four homogeneous entries while normalizing its largest absolute value to one.
Sparseness pattern In our quasi-dense reconstruction case, the points overwhelmingly outnumber the cameras. It is obvious that camera reduction by eliminating the structure parameters suggested in photogrammetry [15, 206] and computer vision [77, 235, 80, 47] is a natural choice for the implementation of both projective and Euclidean bundle adjustments. A typical sparseness pattern of such a bundle adjustment looks like the one illustrated in Figure 5.1.
88
5 Structure from Motion
U W t
W V
Fig. 5.1 A typical sparseness pattern in a bundle adjustment (courtesy of M. Lhuillier).
Algorithm 8 (Bundle-adjustment) Given the point correspondences in n views, u ↔ u ↔ . . . ↔ uj , an initial estimate of xi and Pj , refine xi and Pj by minimizing the re-projection errors.
1. Choose a parametrization for xi and Pj . • For projective reconstruction, the homogeneous entries are re-scaled each time by the largest components of the current values. • For Euclidean reconstruction, a point is the inhomogeneous (xi , yi , zi ), and a camera matrix is decomposed into the intrinsic parameters and extrinsic rotation and translation. The rotation can be parametrized by the three Euler angles. 2. Solve the entire system using the Levenberg-Marquardt algorithm in which the linear solver has to take the sparseness pattern into account. There are a few publicly available bundle-adjustment libraries [126].
5.1.2 Robust statistics and RANSAC In least squares (LS) analysis, we assume that the data is good, but is subject to a Gaussian-like error distribution. For example, the point correspondences in different images are assumed to be correct, and only the point locations might be slightly deviated. Obviously, a bad datum, an outlier, is a serious threat to the least squares methods. There are two solutions to outliers [188]. ‘The first approach is to construct
5.1 Introduction
89
so-called regression diagnostics. Diagnostics are certain quantities computed from the data with the purpose of pinpointing influential points, after which these outliers can be removed or corrected, followed by an LS analysis on the remaining cases. The other approach is robust regression, which tries to devise estimators that are not so strongly affected by outliers. Diagnostics and robust regression really have the same goals, only in the opposite order: When using diagnostic tools, one first tries to delete the outliers and then to fit the “good” data by LS, whereas a robust analysis first wants to fit a regression to the majority of the data and then to discover the outliers as those points which possess large residuals from that robust solution.’ from [188]. The breakdown point for LS is 1/n, which tends to zero with increasing n. The LS method is extremely sensitive, so is not robust to outliers. Replacing the least squares by the Least Median of Squares (LMS) will increase the breakdown point so that the method becomes more robust.
Least median of squares The computation of the LMS is not straightforward. There is no formula for the LMS estimators like those for the LS estimators. The algorithm proceeds by repeatedly drawing subsamples of p different observations. For such a subsample, we can determine a trial solution θJ through the p data points. For each θJ , we obtain the corresponding LMS objective function with respect to the whole data set. This means that the value med(yi − xi θJ )2 is calculated for every i. Finally, we retain the trial estimate for which this value is minimal. But the key question is how many subsamples J1 , ..., Jm should we draw? In principle, we should repeat the above procedure for all possible subsamples of size p, of which there are Cnp . Unfortunately, Cnp increases quickly with n and p. It easily become infeasible for many applications. Instead, we perform a certain number of random samplings such that the probability that at least one of the m subsamples is free of outliers is close to one. A subsample is ‘good’ if it consists of p good data points of the sample, which may be contaminated by at most a fraction of bad data points. Assuming that n/p is large, this probability is 1 − (1 − (1 − )p )m .
(5.1)
Be requiring that this probability be close to one, say at least 0.99, we can determine the value of m for given values of p and .
RANSAC RANdom SAmple Consensus (RANSAC), which originated from computer vision for robust pose estimation [54], is very close to LMS in spirit. ‘The RANSAC procedure is opposite to that of conventional smoothing techniques: Rather than using
90
5 Structure from Motion
as much as the data as possible to obtain an initial solution and then attempting to eliminate the invalid data points, RANSAC uses as small as initial data set as feasible and enlarges this set with consistent data when possible.’ [54]. The algorithm proceeds by repeatedly drawing subsamples of p different observations. For such a subsample, we determine a trial solution θJ through the p points. It is similar to LMS up to this point, but is different afterwards. For each θJ , we compute the error residual for every data point, then with a pre-selected threshold, we determine whether a data point is an inlier or an outlier. The final solution is the sample that has the largest number of inliers! Algorithm 9 (RANSAC algorithm) Given a solver of input size p (one of the algorithms in Chapter 3) and n data points. 1. Determine the number of samples to draw from the input size p from Equation 5.1. 2. Compute a trial solution θJ for each sample with the given solver. 3. Classify each data point into inlier or outlier according to its residual error with respect to a chosen threshold. 4. Retain the solution that has the largest number of inliers. Example 6 We use the seven-point algorithm 3 to establish point correspondences in two views. The size of the seven-point algorithm is seven, so p = 7. If we assume that the proportion of outliers is = 50% and we require a probability of p = 0.99, we need to draw 588 samples.
Remarks • RANSAC is similar to LMS both in idea and in implementation, except that RANSAC needs a threshold to be set by the user for consistence checking, while the threshold is automatically computed in LMS. • LMS cannot handle the case where the percentage of outliers is higher than 50%. The solution is to use the proportion of outliers as the prior to replace the median by a more suitable rank. • In practice, RANSAC works better than LMS in most applications because of the integration of the data prior into the threshold. Computationally, RANSAC is more efficient than LMS because it can exit the sampling loop once a consensus has been reached.
5.2 The standard sparse approach It is now standard to compute both cameras and a sparse set of 3D points from a given sequence or a collection of overlapping images. We first describe two building blocks of the sparse two-view and three-view algorithms in this section, then we describe the algorithms for a sequence and a collection in the following sections.
5.2 The standard sparse approach
91
Algorithm 10 (Sparse two-view) Given two overlapping images captured at different viewpoints, compute a list of point correspondences and the fundamental matrix of the two views. 1. Detect feature points in each image. 2. Compute an initial list of point correspondences between the two images, either by normalized correlation, or by an approximate nearest neighbor searching method of SIFT descriptors associated with the feature points. 3. Draw randomly seven points and use RANSAC Algorithm 9 with the seven-point Algorithm 3 plugged in to determine the inliers and outliers of point correspondences. 4. Optimize the fundamental matrix with the inliers of the point correspondences. • Parametrize the fundamental matrix with seven parameters, where one row is a linear combination of the two others, which is initialized by the current estimate of the fundamental matrix. • Define the geometric error function d(F), which is the sum of the distances between a point and its corresponding epipolar line, symmetrically taken in the two images, that is, d= (d2i + d2 i ), i
d2i
and = dist(ui , Fui ). where = dist(ui , F • Minimize d(F) using the Levenberg-Marquardt method. T
ui )
d2 i
5. Retain the inliers with respect to the optimized fundamental matrix as the final point correspondences. For the calibrated two-view algorithm, we replace the seven-point algorithm by the five-point algorithm. Algorithm 11 (Sparse three-view) Given three overlapping images, compute a projective reconstruction of three cameras and the corresponding points. 1. Detect feature points in each image. 2. Compute an initial list of point correspondences between the first and second images, and also an initial list between the second and the third image. The correspondence can be computed either by normalized correlation, or by an approximate nearest neighbor searching method of SIFT descriptors associated with the feature points. 3. Establish an initial list of point correspondences of the triplet as the intersection of the correspondences of the two pairs of images via the common second image. 4. Randomly draw six points to run RANSAC Algorithm 9 with the six-point Algorithm 6 plugged in. 5. Optimize the three-view geometry with the bundle-adjustment Algorithm 8 to obtain a complete projective reconstruction of three camera matrices and the inlier point correspondences.
92
5 Structure from Motion
Remarks • The optimization of the two-view geometry can use the general bundle adjustment for n views with n = 2, but it is more efficient to optimize only the fundamental matrix and point correspondences instead of the 3D reconstruction because of the nice minimal parametrization of the fundamental matrix. • Two views are the minimum number for a 3D reconstruction, but with potential ambiguities in point correspondences due to the intrinsic point-to-line geometry constraint for the two views, the ambiguity is only heuristically solved by local image information. Three views are the minimum number of views required for unique point correspondences. It is recommended to use the three-view algorithm rather than the two-view algorithm if indeed a third view is available. • There is no algorithm for the calibrated three views per se. For calibrated three views, one common practice is to first compute a calibrated two views with the five point Algorithm 5, and then resect the third view with the three-point pose Algorithm 2.
5.2.1 A sequence of images There are two strategies of obtaining an optimal reconstruction of a sequence. The first is a batch solution when the entire sequence is available, and the second is an incremental updating solution, which accepts the arrival of a new view. Algorithm 12 (Sparse SFM) Given a sequence of overlapping images, compute a reconstruction of the camera matrices and the corresponding points of the sequence. 1. Decomposition. Decompose the sequence of n images into all consecutive n − 1 pairs (i, i + 1) and n − 2 triplets (i, i + 1, i + 2). 2. Pair. Compute a list of point correspondences by the sparse two-view Algorithm 3. 3. Triplet. Compute a reconstruction by the sparse three-view Algorithm 6. 4. Merge. Reconstruct a longer sequence [i..j] by merging two shorter sequences [i..k + 1] and [k..j] with two overlapping frames, k and k + 1, where k is the median of the index range [i..j]. a. Merge the point correspondences of the two subsequences using the two overlapping images. b. Estimate the space transformation (a homography for the uncalibrated case and a similarity for the calibrated case) between two common cameras using linear least squares. c. Applying the space transformation to one of the two subsequences to bring both of them into the same coordinate frame. d. Optimizing the sequence [i..j] with all merged corresponding points using the bundle-adjustment Algorithm 8.
5.2 The standard sparse approach
93
Algorithm 13 (Incremental sparse SFM) Given a sequence of overlapping images, compute a reconstruction of the camera matrices and corresponding points of the sequence. 1. Initialize a 3D reconstruction from the first two (Algorithm 10) or the first three images (Algorithm 11). 2. For each new ith image in the sequence, compute the point correspondences using Algorithm 11 with the triplet i − 2, i − 1 and i. 3. Merge the triplet with the current 3D reconstruction by estimating a space transformation (a homography for the uncalibrated case and a similarity for the calibrated case). 4. Bundle-adjust the current system of cameras and points.
Remarks The presented incremental sparse SFM is one of many possible variants. We usually favor the merging of two subsequences with two overlapping views as the transformation between the two subsequences could be established from the camera matrices. If we merge two subsequences with one overlapping view, the transformation is established through the resection with reconstructed points, which leaves many different choices of points.
5.2.2 A collection of images If the given set of images is not naturally ordered in a linear sequence, we can extend the incremental sparse SFM to a collection of unstructured images. Algorithm 14 (Unstructured sparse SFM) 1. Generate all pairs of images in the collection using the two-view Algorithm 10 in which the SIFT keypoint detector and descriptor are used with an approximate nearest neighbor searching method. 2. Initialize a 3D reconstruction from the best pair in the collection. The criteria involve the number of point correspondences, the baseline, and the closeness to the degenerate coplanar configuration. 3. Add the most suitable image into the 3D reconstruction. The criterion involves the number of common point correspondences. 4. Bundle-adjust the entire system of cameras and points.
Remarks The SFM from a collection of unstructured images was first introduced in [207], which in theory, subsumes SFM from a sequence of images. It is an incremental
94
5 Structure from Motion
schema, but with the computational expenses and difficulties of choosing an initial reconstruction and a new image to be added. The robustness and efficiency of the algorithm are still subject to further research efforts. Nevertheless, the advantages of SIFT over points of interest due to the important scale variations in the collection is clearly demonstrated.
5.3 The match propagation This section presents a quasi-dense matching algorithm between images based on match propagation principle. The algorithm starts from a set of sparse seed matches, then propagates to neighboring pixels by the best-first strategy, and produces a quasi-dense disparity map. The quasi-dense matching aims at a broad range of modeling and visualization applications relying heavily on matching information. Our algorithm is robust to initial sparse match outliers due to the best-first strategy; it is efficient in time and space as it is only output sensitive; and it handles half-occluded areas because of the simultaneous enforcement of newly introduced discrete 2D gradient disparity limit and the uniqueness constraint. The properties of the algorithm are discussed and empirically demonstrated. The quality of quasi-dense matching is validated through intensive real examples.
5.3.1 The best-first match propagation Seed selection and initial matching We start with a traditional sparse matching algorithm between two images for the points of interest detected in each image [228, 255]. Points of interest are naturally reliable two-dimensional point features [74, 201], therefore they can handle cases of large disparity. The Zero-mean Normalized Cross-Correlation (ZNCC) is used to match points of interest in two images, as it is invariant to local linear radio-metric changes. The correlation at point x = (x, y)T with shift Δ = (Δx , Δy )T is defined to be
¯ x))(I (x + Δ + i) − I¯ (x + Δ)) i (I(x + i) − I( ZN CCx (Δ) = 2 1/2 ¯ ¯ x))2 ( i (I(x + i) − I( i (I (x + Δ + i) − I (x + Δ)) )
¯ and I¯ (x) are the means of pixel luminances for the given windows cenwhere I(x) tered at x. After the correlation step, a simple cross-consistency check [61] is used to retain a one-to-one matching between two images. The cross-consistency check consists of correlating pixels of the first image to the second image and inversely by correlating those of the second to the first. Only the best matches consistent in both directions are retained. From many examples, we have found that a good compro-
5.3 The match propagation
95
mise is to reject definitively a match if ZN CC < 0.8 using 11 × 11 windows. The choice of the region of interest will be described in the experiment Section 6.6.
Propagation All initial cross-checked sparse matches are sorted by decreasing correlation score as seed points for concurrent propagations. At each step, the match (x, x ) composed of two corresponding pixels, x and x , with the best ZNCC score is removed from the current set of seed matches. Then we search for possible new matches in the immediate spatial neighborhood N (x, x ) precisely defined below. The ZNCC is still used for match propagation as it is more conservative than others such as sum of absolute or square differences in uniform regions, and is more tolerant in textured areas where noise is important. We also simultaneously enforce a smoothness matching constraint called the ‘discrete 2D disparity gradient limit’ precisely defined below. The uniqueness constraint and a confidence measure defined below on these neighboring pixels are to be considered as potential new matches. The matching uniqueness and termination of the process are guaranteed by choosing only new matches that have yet to be selected. Discrete 2D disparity gradient limit. The usual 1D disparity gradient limit along the epipolar lines has been widely used in rectified stereo matching for disambiguation [168]. We propose here a 2D extension, the discrete 2D disparity gradient limit, to deal with an uncalibrated pair of images including rigid scenes with inaccurate epipolar constraint and non-rigid scenes. This limit is said to be ‘discrete’ because only integer values are allowed for disparities. This allows us to impose directly the uniqueness constraint while propagating. Let N (x) = {u, u − x ∈ [−N, N ]2 }, N (x ) = {u , u − x ∈ [−N, N ]2 } denote all (2N + 1) × (2N + 1) neighboring pixels of pixels x and x . The possible matches limited by the discrete 2D disparity gradient are given as N (x, x ) = {(u, u ), u ∈ N (x), u ∈ N (x ), ||(u − u) − (x − x)||∞ ≤ }, and are illustrated in Figure 5.2 for N = 2, = 1. Our choice is to impose the most conservative non-zero disparity gradient limit, i.e., = 1 for integer pixel coordinates u, u , x, x . In the case of rectified images of a rigid scene, the usual 1D gradient disparity limit for two neighboring pixels x and u = x + (1, 0) in the same line is directly deduced from our 2D one: writing disparities d(x) = x − x and d(u) = u − u of matches (x, x ) and (u, u ), we obtain |d(x) − d(x + 1)| ≤ 1. The newly introduced discrete 2D disparity gradient limit therefore extend the usual 1D disparity gradient limit for uncalibrated image pairs with inaccurate epipolar constraint and for non-rigid scenes. A minimal size of neighborhood N should be used to limit bad matches at the occluding contours. The smallest neighborhood size that makes sense of the definition of gradient limit is therefore N = 2, i.e., the 5 × 5 neighborhood.
96
5 Structure from Motion
Neighborhood of x in image 1 u
Neighborhood of x’ in image 2 u’
x
x’ v
v’
Fig. 5.2 Possible matches (u, u ) and (v, v ) around a seed match (x, x ) come from its 5 × 5neighbor N (x) and N (x ) as the smallest size for discrete 2D disparity gradient limit. The match candidates for u (resp. v ) are within the 3 × 3 window(black framed) centered at u (resp. v).
Confidence measure. There is a wide variety confidence measure definitions in the literature and its choice depends on the applications. We use a simple differencebased confidence measure as follows s(x) = max{|I(x + Δ) − I(x)|, Δ ∈ {(1, 0), (−1, 0), (0, 1), (0, −1)}}. This is less restrictive than the Moravec operator [148] and allows the propagation to walk and match along edges in spite of the aperture problem while avoiding matching uniform areas. We forbid propagation in areas with s(u) ≤ t, t = 0.01 assuming that 0 ≤ I(u) ≤ 1, which are too uniform. More conservative measures such as those suggested in optical flow [5] are also possible.
Propagation algorithm The propagation algorithm can be described as follows. The input of the algorithm is the set Seed of the current seed matches. The set is implemented with a heap data structure for both fast selection of the ZNCC-best match and incremental additions of seeds. The output is an injective displacement mapping M ap. Algorithm 15 (Match Propagation) Input: Seed Output: M ap M ap ← ∅ while Seed = ∅ do pull the ZNCC-best match (x, x’) from Seed Local ← ∅ (Local stores new candidate matches enforcing the disparity gradient limit) for each (u, u’) in N (x, x’) do if s(u)>t and s(u’)>t and ZNCC(u,u’)>z then store (u,u’) in Local end-if
5.3 The match propagation
97
end-for (Seed and Map store final matches enforcing the uniqueness constraint) while Local = ∅ pull the ZNCC-best match (u,u’) from Local if (u,*) and (*,u’) are not in M ap then store (u,u’) in M ap and Seed end-if end-while end-while The complexity of this algorithm in time is O(nlog(n)) with n the final number of matched pixels, assuming that the number of initial seeds is negligible. Our algorithm depends only on the number of final matches and is independent of any disparity bound: it is therefore output sensitive. The memory complexity is linear in the image size. As the search space for potential correspondences for a given pixel reduces to a very small region of 3 × 3 as shown in Figure 5.2, the matching criterion is relaxed to a smaller window and weaker correlation score than the seed points. We found by experiments that a rejection takes place for a match having ZN CC < z, z = 0.5 within a 5 × 5 window. There are two benefits to smaller ZNCC window size: minor perspective distortions are acceptable and matching artifacts are limited around occluding contours.
5.3.2 The properties of match propagation The match propagation algorithm has many fine properties.
Robustness with respect to false seed matches The robustness and stability of this algorithm based on best-first propagation are considerably improved by the global best-first strategy with respect to the sparse matching. Though the seed selection step seems very similar to many existing methods [228, 255] for matching points of interest using correlation, the key difference is that the propagation can rely on only a few of the most reliable ones rather than taking a maximum number of them. This makes the algorithm much less vulnerable to the presence of match outliers in the seed selection stage. The risk of bad propagation is significantly diminished for two reasons: the bad seed points have no chance of developing if they are not ranked on top of the sorted list; and the propagation by bad seed points is stopped very quickly due to lack of consistency in its neighborhood even if the bad seed points might occasionally rank high in the list. In some extreme cases, only one single good match of points of interest is sufficient to pro-
98
5 Structure from Motion
voke an avalanche of correct matches in the whole textured images, while keeping all other seed points, including the bad ones, undeveloped.
Fig. 5.3 The disparity maps produced by propagation with different seed points and without the epipolar constraint. Left column: automatic seed points with the match outliers marked with a square instead of a cross. Middle column: four seed points manually selected. Right column: four seed points manually selected plus 158 match outliers with strong correlation score ZN CC > 0.9).
Figure 5.3 shows the disparity maps obtained from different selections of seed points to illustrate the stability of propagation. The first example on the left column is produced using automatic seed points for sparse matching of points of interest as described above. The set of automatic seed points still contains match outliers marked as a square instead of a cross for a good match. The second example in the middle is produced with four seed points, manually selected from the set of automatic seed points. Each seed match is sufficient to provoke an avalanche of correct matches in each of the four isolated and textured areas. All the matched areas cover roughly the same surface as that obtained with the automatic seeds and 78% of the matched areas are common between the two results. The third example on the right is produced with four seed points as in the second example plus 158 more seed points which are all match outliers but still have strong correlation score (ZN CC > 0.9). This very severely corrupted set of seed points still gives 70% of the matched areas common to the automatic case. This robust scenario of good seeds developing faster than bad ones is more dominant in matched areas than occluded regions. Further we notice from these examples that minor geometric distortion is well tolerated in the garden flowers in Figure 5.3 thanks to the small 5×5 ZNCC-window size during propagation. In the untextured sky area, the propagation is stopped by the confidence measure. The near-periodic thin unmatched gaps on the trunk illustrate the effect of the uniqueness constraint.
5.3 The match propagation
99
Fig. 5.4 Two images with low geometric distortion of a small wooden house.
(a)
(b)
(c)
Fig. 5.5 Examples of propagation for low textured images in Figure 5.4. (a) The disparity map automatically produced with the epipolar constraint. (b) the disparity map produced by a single manual seed on the bottom without the epipolar constraint. (c) The common matching areas (in black) between the two maps in (a) and (b).
Stability in low textured scenes For low textured images such as typical polyhedric scenes, one might expect that the matched areas are reduced to a small neighborhood of seeds due to poor textures, but interestingly, the propagation grows nicely along the gradient edges and we can show that the distance along the edges covered by propagation in both images is similar if the perspective distortion is moderate. Figure 5.5 illustrates an interesting consequence of this property by using two views of a wooden house. A single manually selected seed propagates globally to the whole image and covers most of the normally propagated results from many seeds distributed over the whole image. This global stability of propagation for low textured images is very good for interpolation or morphing applications.
Handling half-occluded areas The algorithm provides satisfactory results in half-occluded areas, mainly due to the simultaneous enforcement of the global best-first strategy and the uniqueness constraint. The principle can be illustrated with the help of Figure 5.6 in which we assume a foreground object B over a background A, and the half-occluded areas C and D. The global best-first propagation strategy first accepts matches with the best
100
5 Structure from Motion
correlation scores before trying the majority of matches with mediocre scores. As the foreground A and background B in Figure 5.6 are both visible, it is expected that they are matched before bad matches for pixels in half-occluded areas C and D had a chance. Consequently, the propagation stops in the half-occluded areas because it is always stopped by the uniqueness constraint. View 1
A a
C
B b
View 2
A a
B
D
b
Fig. 5.6 Two views of a scene with background A, foreground B and half occluded areas C and D. Assume that correct matches within A and B have better scores than bad ones within C and D . A and B are first filled in by propagation from seed a and b before the algorithm attempts to grow in C or D. Once A and B are matched, the procedure is stopped by the uniqueness constraint at the boundary of C in the first view (resp. D in the second view) because the corresponding boundary in the second view (resp. the first one) encloses an empty area.
We show in Figure 5.7 that with at least 4 seed points, the disparity maps obtained are excellent in handling the occluded areas. When we remove one important seed point from the tree in the foreground, which plays the role of region B in Figure 5.6, the tree is not matched at all as expected. Many match outliers have invaded into the background area.
Fig. 5.7 The disparity maps produced by propagation with different seed points and without the epipolar constraint. Left and Middle columns: four manually selected seed points marked by a cross between the 1st and the 20th frame of the flower garden. Right column: remove one manual seed located on the front tree. It has more match outliers in the occluded regions.
Another example including half occluded areas for a thin object is shown in Figure 5.8 in which an electric post of 2 to 3 pixel width and its disparity map
5.3 The match propagation
101
by propagation with and without the epipolar constraint. These fine details and their backgrounds have been successfully matched. As in the previous case, the usual fattening artifact around the occluding contours is limited because of the small 5 × 5 ZNCC-window and local propagation.
(a)
(b)
(c) Fig. 5.8 (a) Two sub-images of an electric post. The disparity maps without (b) and with (c) the epipolar constraint.
5.3.3 Discussions The match propagation presented in this section is adapted from [115], which is based on the earlier idea of region growing by Lhuillier in [111].
Imposing simultaneous matching constraints Matching constraints [37, 105] such as uniqueness, limit on disparity, disparity gradient and ordering constraints are always necessary to reduce matching ambiguity. These constraints are often implemented in stereo algorithms along the corresponding epipolar lines.
102
5 Structure from Motion
Uniqueness constraint is often imposed by cross-consistency check, i.e., by correlating pixels of the first image to the second image and inversely by correlating those of the second to the first; only the best matches consistent in both ways are retained. The error rate by cross-consistency check is low [61], but the resulting disparity map is less dense [92] unless multiple resolutions or additional images are used [61]. An alternative consists of evaluating a set of possible correspondences in the second image to each pixel in the first image. The final correspondences are established using relaxation techniques such as the PMF algorithm [168] or a search for disparity components [10]. However, these methods sequentially impose limits on disparity gradients and uniqueness constraint. In our approach we impose them simultaneously. This considerably improves the matching results and allows the efficient handling of half-occluded areas. Unlike stereo matching algorithms working along 1D epipolar lines, we extended the definition of 1D disparity gradient limit to the discrete 2D disparity gradient limit which naturally handles uncalibrated images including rigid scenes with inaccurate epipolar geometry and non-rigid scenes. Reducing search space by adding a direct limit on disparity greatly improves the performance of the majority of existing methods, but the result and complexities of our method are respectively much less sensitive and independent of this constraint.
Using a best-first match-growing strategy A related region-growing algorithm was previously introduced in the photogrammetry domain in [163]. Deformable windows and patch-to-patch propagation are used instead of the disparity constraints and pixel-to-pixel propagation strategy. The main advantage of this approach is that the matching can reach sub-pixel accuracy from Adaptive Least Square Correlation patch optimization. However, this patch-based optimization and propagation are the sources of two drawbacks: first, a uniqueness constraint can no longer be defined for the overlapping patches and second, large window sizes than ours is unavoidable for stable adaptive least squares, especially if gray level distortions are considered. Moreover, the patch propagation can not deal with fine texture details (like electric posts) unless optimization is done for each pixel. The optimization process suffers from over-parametrization when gray level distortions are considered and is hardly workable for matching different scenes. The most serious shortcoming is poor performance around occluding contour points due to the lack of a uniqueness constraint and the larger window size. A progressive scheme for stereo matching was also presented in [254]. It starts from robustly matched interest points [255], then densifies the matching by using a growing principle. It considers simultaneously multiple current matches and propagates in a larger area instead of one seed match in a very small predefined area in our approach. This tends to produce smoother disparity maps, but more outliers for half-occluded areas. Its performance for non-rigid scenes is also unknown.
5.4 The quasi-dense approach
103
5.4 The quasi-dense approach 5.4.1 The quasi-dense resampling Now we introduce and define the concept of quasi-dense point correspondences as our ‘point’ features. We also describe the computation procedures. The meaning of ’quasi-dense’ is twofold. The first is that the pixel disparity map is not fully dense, and the second is that the locally dense map is resampled into ‘sparse’ points.
Motivations On one hand, the resampling is motivated by the fact that the quasi-dense pixel correspondences give an irregular distribution of clusters of pixels, which is not suitable for geometry computation. Many clustered pixels do not create strong geometric constraints while making the estimation cost high. Resampling produces not only a reduced set and more uniform distribution of matched points in the images, but it also creates matching points with sub-pixel accuracy. On the other hand, the resampling is equally motivated by the necessity of post-match regularization to improve match reliability by integrating local geometric constraints, since the quasi-dense pixel correspondences may still be corrupted by wrong correspondences. We assume that the scene or object surface is at least locally smooth. Therefore, instead of directly using global geometric constraints encoded by a fundamental matrix, we first use local smoothness constraints encoded by local planar homography: the quasi-dense pixel correspondences are regularized by locally fitting homographies.
Resampling by homographies The first image plane is initially divided into a regular square grid of 8 × 8 pixels. This size is a trade-off between the sampling resolution and regularization stability. For each square patch, all pixel correspondences inside it from the quasi-dense pixel correspondences are used to tentatively fit a plane transformation. The most general linear plane transformation is a homography represented by a homogeneous 3 × 3 non singular matrix. Four matched pixels, no three of them collinear, are sufficient to estimate a plane homography. In practice, an affine transformation encoded by 6 d.o.f. using three matched pixels rather than a homography is preferred as the local perspective distortion is often mild between images. Because of unavoidable matching errors and the points not lying on the dominant local plane (e.g. at the occluding contours), the putative transformation for a patch cannot be estimated using standard least squares estimators. The Random Sample Consensus (RANSAC) [54] is used for a robust estimation of the transformation, H. Finally, for each confirmed patch correspondence, a pair of corresponding points,
104
5 Structure from Motion
ui ↔ ui ≡ Hi ui , with sub-pixel accuracy is created by selecting a representative center point of the patch in the first image ui and its homography-induced corresponding point, Hi ui , in the second image. The set of all corresponding points created in this way is called the ‘quasi-dense correspondences’. In practice, we also add to the quasi-dense point correspondences all corresponding points of interest within the patch and validated by the homography of the patch. Usually these points of interest have longer tracks along the sequence than other points obtained by propagation. This definition of quasi-dense correspondences is illustrated in Figure 5.9, and an example from a real image pair is given in Figure 5.10.(c). Regular sampling grid in image 1
Homography−induced correspondances in image 2 : center point
: interest point
Fig. 5.9 For each corresponding patch, the resampled points include the center point of the patch and all points of interest within the patch and their homography-induced correspondences in the second image.
These resampled corresponding points are not only more suitable for geometric computation thanks to their more uniform distribution in images, but they are also more reliable as the robust local homography fitting significantly singles out match errors in the original quasi-dense pixel correspondences, as illustrated in the middle and on the left of Figure 5.11.
5.4.2 The quasi-dense SFM The geometric estimation of a sequence of uncalibrated images, including both camera positions and the 3D reconstruction of scene points, is now standard for sparse points of interest [80, 47]. Mostly, we will apply some of the standard algorithms to our new “point” features, the quasi-dense point correspondences. But we will also propose two new algorithms. The first is the core two-view quasi-dense correspondence and geometry method that turns out to be much more robust and accurate than the sparse methods. The second is a fast gauge-free uncertainty estimation, which is necessary for our development.
5.4 The quasi-dense approach
105
Quasi-dense two views The two-view geometry of a rigid scene is entirely encoded by the fundamental matrix. The standard strategy is to recover geometry using sparse matching [255, 228] within a random sampling framework. We propose two procedures for fundamental matrix estimation in our quasi-dense approach: the first is the constrained propagation, which grows only those satisfying the current epipolar constraint. The second is the unconstrained propagation, which is motivated by the fact that the estimation might be local, and biased toward the areas with a high density of matches (for example, either merely the background or the foreground or a dominant plane) if the initial distribution of the matched points is not uniform across images. This bias due to the irregular distribution of the points in image space is well-known and discussed in [80]. The final strategy that combines these procedures and overcomes the disadvantage of each is given as follows. Algorithm 16 (Quasi-dense two-view) Given two overlapping images, compute a list of quasi-dense sub-pixel correspondences satisfying the epipolar constraint. 1. Detect feature “points of interest” in each image; establish the initial correspondences between the images by computing normalized correlation; Sort the validated correspondences by correlation score and use them to initialize a list of seed matches for match propagation; 2. Propagate without the epipolar constraint from all the seed points using a bestfirst strategy without the epipolar constraint to obtain quasi-dense pixel correspondences represented as a disparity map; 3. Resample the quasi-dense disparity map by local homographies to obtain the quasi-dense (sub-pixel) point correspondences; Estimate the fundamental matrix using a standard robust algorithm (Algorithm 9 and Algorithm 3) on the resampled points, i.e., the quasi-dense correspondences; 4. Propagate with the epipolar constraint from the same initial list of seeds using a best-first strategy with the epipolar constraint by the computed fundamental matrix; 5. Again resample the obtained quasi-dense disparity map to get the final quasidense point correspondences; Reestimate the fundamental matrix with the final quasi-dense correspondences. For the calibrated quasi-dense two-view algorithm, we replace the seven-point algorithm by the five-point algorithm. These quasi-dense sub-pixel matches are usually more reliable, denser, and more evenly distributed over the whole image space than the standard sparse points of interest. Figure 5.10 shows the more robust nature of the quasi-dense than the standard sparse method. Figure 5.11 illustrates the incremental robustification of correspondences in different steps. Figure 5.12 shows the computation for a typical pair of images. The computational time for a pair of images is more costly than a standard sparse method, but it is limited to about 10 to 15s for 512 × 512 images on a P4 2.4GHz. We have also
106
5 Structure from Motion
(a)
(b)
(c) Fig. 5.10 (a) Initial sparse correspondences by cross-correlation for a pair of images with large disparities. Only 31 matches in white out of 111 are correct. (b) Failure of the standard sparse method. Many correspondence points in black are obviously incorrect. (c) Successful quasi-dense estimation.
5.4 The quasi-dense approach
107
(a)
(b)
(c) Fig. 5.11 The improvements made by the constrained propagation and local homography fitting. The matching results from the second constrained propagation in (b) compared to the first unconstrained propagation in (a). (c) The matching results from the robust local homography fitting with respect to the second constrained propagation.
108
5 Structure from Motion
tested the strategy of combining a sparse geometry and a constrained propagation, and have found that the domain of the final propagation tends to be reduced and results in the undesirable local estimates discussed earlier.
(a)
(b) Fig. 5.12 Quasi-dense two-view computation. (a) The quasi-dense disparity map by two propagations and the estimated epipolar geometry. (b) The resampled quasi-dense correspondence points.
Another advantage of this strategy is that it works for more largely separated image pairs than those accepted by the standard sparse approach for the simple reason that the number of matched interest points dramatically decreases with an increasing geometric distortion between views. However, we do not compare our approach with specific sparse methods such as affine invariant regions [237, 93] and points [173, 224] matching methods. Figure 5.10 shows comparative results between the sparse and the quasi-dense methods for a widely-separated pair, to which the standard sparse method computes a wrong fundamental matrix. Also in our experiments, we used as few as about 20 images to make a full turn of the object, which might be impossible for the standard sparse approach.
5.4 The quasi-dense approach
109
Quasi-dense three views The quasi-dense correspondences computed from a two-view geometry still contain outliers as the two-view geometry only defines a one-to-many geometric relationship. The three-view geometry [178, 47, 80] plays the most important role in 3D reconstruction as it is the minimum number of images that has sufficient geometric constraints to resolve correspondence ambiguity. Algorithm 17 (Quasi-dense three-view) Given the quasi-dense correspondences of all pairs, compute the quasi-dense correspondences of all triplets of images. 1. Merge the two quasi-dense correspondences between the pair i − 1 and i and the pair i and i + 1 via the common ith frame as the intersection set to obtain initial quasi-dense correspondences of the image triplet. 2. Draw randomly six points to run RANSAC on the entire set of points. This further removes match outliers using reprojection errors of points. For six randomly selected points, compute the canonical projective structure of these points and the camera matrices [178]. The other image points are reconstructed using the current camera matrices and reprojected back onto the images to evaluate their consistency with the actual estimate. 3. Optimize the three-view geometry with all inliers of triplet correspondences by minimizing the reprojection errors of all image points by fixing one of the initial camera matrices. For the calibrated quasi-dense three-view algorithm, we replace the six-point algorithm by a combination of the five-point algorithm and the three-point pose algorithm. An example of the quasi-dense three-view is illustrated in Figure 5.13.
Fig. 5.13 Quasi-dense three-view computation. The inliers of quasi-dense correspondences in a triplet of images are in white, and the outliers are in gray.
110
5 Structure from Motion
Quasi-dense N views We merge all pairs and triplets from the quasi-dense correspondences into a consistent N-view quasi-dense geometry. Algorithm 18 (Quasi-dense SFM) Given a sequence of overlapping images, compute a reconstruction of the camera matrices and quasi-dense corresponding points of the sequence. 1. Compute a uncalibrated quasi-dense reconstruction using the sparse SFM algorithm 12 in which the sparse two-view and the sparse three-view algorithms are replaced by the quasi-dense two-view and the quasi-dense three-view algorithms. 2. Initialize the Euclidean structure of the quasi-dense geometry by auto-calibration. We assume a constant unknown focal length as we rarely change the focal length for capturing the same object. Then, a one-dimensional exhaustive search of the focal lengths from a table of possible values is performed for optimization of the focal length. The initial value for the focal length may either be computed from a linear auto-calibration method [232, 169, 155] or obtained from the digital camera. 3. Transform the projective reconstruction by the estimated camera parameters into its metric representation. 4. Reparametrize each Euclidean camera by its six individual extrinsic parameters and one common intrinsic focal length. This natural parametrization allows us to treat all cameras equally when estimating uncertainties, but it leaves the seven d.o.f scaled Euclidean transformation as the gauge freedom [235]. Finally, apply a Euclidean bundle adjustment over all cameras and all quasi-dense correspondence points. 5. Run a second Euclidean bundle adjustment by adding one radial distortion parameter for all cameras in the case where the non-linear distortions of cameras are non-negligible, for instance, for image sequences captured by a very short focal length. In practice, for an object of the size of a human face, this is unnecessary.
Remarks The algorithm is described in the most general form from a projective to a Euclidean reconstruction via an auto-calibration method. If any information on the camera parameters and poses is available, some steps of the algorithm can be skipped. The hierarchical merging strategy for a sequence is used by default, but any of incremental and unstructured SFM strategies can be applied if necessary. A typical example of the final quasi-dense results is illustrated in Figure 5.14.
5.4 The quasi-dense approach
(a)
111
(b)
Fig. 5.14 (a) A top view of the Euclidean quasi-dense geometry. (b) A close-up view of the face in point cloud.
5.4.3 Results and discussions We implemented a first sparse method by simply tracking all points of interest detected in each individual image, and implemented a second sparse method, which is a mixture of sparse and quasi-dense methods: it assesses points of interest from individual images by geometry that is computed from the quasi-dense algorithm and re-evaluates the whole geometry only from these matched points of interest. In the following, SPARSE indicates the best result of these two methods.
Reconstruction accuracy The reconstruction accuracy is measured by considering the bundle adjustment as the maximum likelihood estimates, if we assume that the image points are normally distributed around their true locations with an unknown standard deviation σ. The confidence regions for a given probability can therefore be computed from the covariance matrix of the estimated parameters. The covariance matrix is defined only up to the choice of the gauge [235, 140, 149] and the common unknown noise level σ 2 . The noise level σ 2 is estimated from the residual error as σ2 = r2 /(2e − d), where r2 is the sum of the e squared reprojection errors, and d is the number of independent parameters of the minimization: d = 1 + 6c + 3p − 7 (1 is the common focal length, c is the number of cameras, p is the number of reconstructed points and 7 is the gauge freedom choice). The covariance matrix can be estimated as the inverse of Hessian, H−1 = T −1 (J J) , up to a common noise level, if H is not singular. However, the current bundle optimization has been carried out with an over-parametrized free gauge. We need to solve two major problems: the first is that H is now singular due to free gauge; and the second is that H is excessively large in size. The singularity of H
112
5 Structure from Motion
could have been easily handled by a direct SVD-based pseudo-inverse if it was not excessively large in size. Normal covariance matrix. When H is non-singular and has a specific sparse structure as in our case, it can be block-diagonalized into H = TATT . Then, the pseudo-inverse of H can be efficiently computed as H+ = H−1 = (TATT )+ = (T−T )A+ T−1 , which is exactly as in the basic reduction technique used in photogrammetry [15, 80, 47, 149, 235]. For a singular H, it is still formally possible to compute H∗ as T−T A+ T−1 , but it is no longer the pseudo-inverse H+ of H. We need to clarify its underlying statistical meaning. The choice of coordinate fixing rules is a gauge fixing [235]. Each choice of gauge, locally characterized by its tangent space, determines an oblique covariance matrix. The interpretation of H∗ computed above is therefore an oblique covariance matrix at the particular solution point that we have chosen by a first-order perturbation analysis around the maximum likelihood solution [149]. It has been shown that all these oblique covariance matrices at a given solution point from different gauges are geometrically equivalent in the sense that they all have the same ‘normal’ component in the orthogonal space to the gauge orbit. This normal component is called the normal covariance [149]. We choose this uncertainty description as it is convenient and does not require the specification of gauge constraints. It also gives a lower bound on all covariances defined at that point on the gauge orbit. Fast computation. To compute this normal covariance, we need to project any oblique covariance onto the orthogonal space to the tangent space of the gauge orbit, i.e. Cov = PH∗ PT , where P is the projector to Im(H) in the direction of Ker(H). The difficulty is the handling of the very large size of H to make the projection computable. If N is an orthonormal basis of Ker(H), then P = I−NNT . Since dim(Ker(H)) = 7, N is a thin matrix of dimensions O(p+c)×7, where p is the numberof points and C M , c is the number of cameras. Using the approximated Hessian H = M S where C (resp. S) is an inversible block-diagonaland sub-Hessian of camera pa I Y Z 0 I 0 T rameters (resp. structure), we have H = TAT = 0 I 0 S Y I −1 −1 where Y = MS , Z = C − MS M . Thus, I Ker(H) = Ker(Z) −Y and N is efficiently computed by a SVD of the matrix Z of small dimensions O(c) × O(c). Using the notation K = H∗ N, the computation of normal covariance Cov is given by Cov = PH∗ P = H∗ − KN − NK + N(N K)N .
5.4 The quasi-dense approach
113
The matrices N and K are very thin, only seven columns. The matrix N K is therefore 7 × 7. The computational complexity of all diagonal blocks of Cov for camera and point covariances is O(c + p), provided that K and the corresponding diagonal blocks of H∗ are computed. The diagonal blocks of H∗ are computed in time of O(pc2 + c3 ) and in memory of O(i + c2 ) with i being the number of 2D points [80, 15]. For K, let Nc , Ns be N vectors such that N = ( N c s ) and the height of Nc (resp. Ns ) be the same as that of C (resp S). It is easy to verify that 0 I K= + Z+ (Nc − YNs ). S−1 Ns −Y This calculation is feasible because of the small size of Z+ (O(c) × O(c)) and the diagonal structure of S. Both the time and space complexities are only O(i + c2 ).
(a)
(b)
Fig. 5.15 The 90% confidence ellipsoids magnified by a scale of 4 for reconstructed points and camera centers. One out of every 50 point ellipsoids are displayed. (a) A top view of the cameras and the object. (b) A close-up view of the only object on the right.
Figure 5.15 illustrates one example of the computed uncertainty ellipsoids. Uncertainties. We first compute the normal covariance matrix H+ from the oblique covariance matrix H∗ in the coordinate system of the camera in the middle of the sequence and with the scale unit equal to the maximum distance between camera centers. We choose a 90% confidence ellipsoid for any 3D position vector, either camera position or 3D point. The maximum of semi-axes of the 90% confidence ellipsoid is computed as the uncertainty bound for each 3D position. The camera uncertainty is characterized by taking the mean of all uncertainty bounds of camera positions xci as the number of cameras is moderate. The point uncertainty is characterized by computing the rank 0 (the smallest uncertainty bound x0 ), rank 14 (x 14 ), rank 12 (median x 12 ), rank 34 (x 34 ) and rank 1 (the largest uncertainty bound x1 ) of the sorted uncertainty bounds, as the number of points is very high. The uncertainty of the focal length f is given by the standard deviation σf .
114
5 Structure from Motion
Examples. The Lady 1 sequence of 20 images at 768 × 512 has a more favorable lateral motion in close-range. The uncertainties given in Table 5.1 and Figure 5.16 for QUASI are smaller than for SPARSE. For focal length and camera positions, QUASI is three to six times smaller than SPARSE.
Fig. 5.16 Three images of the Lady 1 sequence from D. Taylor on the top row. QUASI (left) and SPARSE (right) reconstruction and their 90% ellipsoids (magnified by a scale of 4) viewed on a horizontal plane.
Table 5.1 Uncertainties for Lady 1 from the oblique (top) and normal (bottom) covariance matrix.
The Lady 2 sequence of 43 images at 408 × 614 is captured with an irregular but complete tour around a person. The results are given in Table 5.2 and Figure 5.15.
5.4 The quasi-dense approach
115
Table 5.2 Uncertainties for Lady 2 from the oblique (top) and normal (bottom) covariance matrix.
The Garden-cage sequence of 34 images at 640 × 512 was captured by a handheld still camera (Olympus C2500L) with an irregular but complete tour using a rather short focal length to increase the viewing field for the larger background. The Garden-cage sequence contains a close-up of a bird cage and a background of a house, and tree, with a very profound viewing field. SPARSE methods failed because some triplets of consecutive images do not have sufficiently matched points of interest. The QUASI method gives the uncertainties listed in Table 5.3 and 90% ellipsoids shown in Figure 5.17. As the images were captured with the smallest focal length available, the camera’s non-linear distortion became non-negligible. After a first round of Euclidean bundle adjustment, a second adjustment by adding one radial distortion parameter ρ for all cameras is carried out. We find that ρ = −0.086. This result is similar to that obtained with a very different method proposed in [35] for the same camera but different images: ρ = −0.084.
Fig. 5.17 Left: three of the 34 Garden-cage images. Right: top view of the 90% confidence ellipsoids. The small square-shaped connected component at the center is the reconstructed bird cage while the visible crosses forming a circle are camera positions.
The corridor sequence of 11 images at 512 × 512 resolution from Oxford University has a lateral forward motion along the scene which does not provide strong geometry, but favors the SPARSE method as it is a low textured polyhedric scene
116
5 Structure from Motion
Table 5.3 Uncertainties for Garden-cage from the oblique (top) and normal (bottom) covariance matrix.
in which matched points of interest are abundant and spread well over the scene. With almost 40 times redundancy in the number of points, camera position and focal length uncertainties for QUASI are two to four times smaller than for SPARSE. However, the point uncertainties are almost of the same order of magnitude for the majority of points. As the camera direction and path are almost aligned with the scene points, many points on the far background of the corridor are almost at infinity. Not surprisingly with the actual fixing rules of the coordinate choice, they have extremely high uncertainty bound along the camera direction for both methods as illustrated in Figure 5.18.
Fig. 5.18 From left to right: three of the 10 Corridor images from Oxford University, QUASI and SPARSE reconstructions for Corridor and their 90% confidence ellipsoids viewed on a horizontal plane. One out of every 10 ellipsoids for QUASI is displayed.
According to the previous discussions, H+ gives the normal covariance matrix while H∗ is an oblique covariance matrix at a given solution point as was previously used in [149, 80]. The normal covariance matrix should be the ‘smallest’ one in the matrix trace, and the lower bound of all oblique covariance matrices. We want to empirically demonstrate this by comparing these different covariance matrices. In all cases, the main uncertainty changes are those of camera centers which are bigger for the normal covariance matrix than for the oblique one, while the trace of the whole normal covariance matrix is slightly smaller than that of the oblique
5.5 Bibliographic notes
117
Table 5.4 Uncertainty measures for the Corridor sequence: the mean of the uncertainty bounds of camera centers and the rank-k of the sorted uncertainty bounds of points, calculated from the oblique covariance matrix on the top of the table and from the normal covariance matrix in the middle and on the bottom of the table.
one as expected. This suggests that the normal covariance matrix describes a better distribution of uncertainties between cameras and structures.
Reconstruction robustness and efficiency To measure the reconstruction robustness, we consider the success rate of reconstruction for all tested sequences in this paper as illustrated in Table 5.5. The robustness of QUASI with respect to the sampling rate of the sequence is also tested. For the Lady 2 sequence making a complete tour around the object, SPARSE fails for a sequence of 43 images, but succeeds for a sequence of 86 images with only 1827 sparse points. QUASI succeeds up to a subsequence of 28 images with 25339 quasi-dense points. A typical pair of this subsequence is shown in Figures 5.10 and 5.11. It is clear that QUASI has superior robustness: whenever a sequence is successful for SPARSE, it is equally successful for QUASI, while SPARSE fails for many sequences (including those not shown in this paper). Furthermore, even when SPARSE is successful, it is sometimes only the mixed SPARSE that is successful. Recall that the SPARSE method was defined as the best result of a pure sparse and a mixed sparse-quasi method. We also notice that our QUASI method requires only about 30 to 35 frames for a complete tour around an object. This is far fewer than the 50 to 100 frames necessary for SPARSE methods such as the ones in [155, 170], which are more suitable for video sequences. The computation time is summarized in Table 5.6 for the examples given in this chapter.
5.5 Bibliographic notes The quasi-dense matching approach by propagation was first introduced in [111] by Lhuillier in search of an efficient and robust dense matching, then the method has been improved in [114, 115] for view interpolation applications [112, 113, 117]. The section of match propagation 5.3.1 is adapted from [115] by Lhuillier and Quan. The
118
5 Structure from Motion
Table 5.5 Automatic success rate of reconstruction between Q(uasi) and S(parse).
Table 5.6 Computation time in minutes for the QUASI method with a P4 2.4 Ghz.
section of the quasi-dense approach for SFM is adapted from [119] by Lhuillier and Quan, which is based on a previous conference publication [116]. The standard sparse SFM or 3D reconstruction from a linear sequence of images is based on the sparse points of interest developed in the 1990s [47, 80, 109, 82, 6, 55, 170, 155] within the uncalibrated approach to computer vision geometry. The SFM from a collection of unstructured images was more recently proposed in [207]. The dense matching has been developed in traditional calibrated stereo framework [168, 37, 163, 61, 105, 160, 91, 40, 10]. They usually search along epipolar lines, but mismatches are still frequent around occluding and textureless areas. More recent stereo methods use volumetric representation to simultaneously reconstruct the object and compute the dense correspondences [195, 107, 46, 104]. Matching based on optical flow computation (e.g. [1, 5, 220, 186]) is restricted to closer spaced images and assumes a smooth and well-behaved intensity function. The dense matching is intrinsically ill-posed due to the aperture problem and a regularization based on smoothness constraints is always necessary [5]. It is therefore difficult to expect reliable dense matching, which might be sometimes unnecessary for view synthesis as discussed by [192, 204] for computational efficiency: homogeneous areas are difficult to match, but usually do not create visual artifacts if their boundaries are correctly matched. A common practical pipeline is to combine the sparse and dense methods for handling uncalibrated sequences of images [155, 170]. For instance, from a sparse geometry, dense stereo matching algorithms are run for some selected pairs in [169] and all in [155]. The dense reconstruction is triangulated and texture-mapped to obtain the final models. The surface models obtained are often partial and the surface triangulation is simply inherited from a 2D triangulation in one image plane, which means that the surface topology can not be properly handled.
Part III
Modeling: from 3D points to objects
Chapter 6
Surface modeling
This chapter proposes an automatic surface reconstruction method from a sequence of images via the reconstructed quasi-dense 3D points. The surface-based representation is a natural extension of the point-based representation for smooth objects. A surface geometry is indispensable for most applications. We use a dynamic implicit surface to represent object surfaces, and propose an unified functional integrating both 3D points and 2D image information. The intrinsic nature of the functionals makes the implementation of the surface evolution by the level-set method possible. The integrated functional results in a more robust approach than those using only 2D information or 3D data. Intensive experiments and examples validated the proposed approach.
Fig. 6.1 One input image and the reconstructed surface geometry (Gouraud shaded and textured mapped).
121
122
6 Surface modeling
6.1 Introduction Surface reconstruction from a set of 3D points is a classical topic in computational geometry, computer graphics, and computer vision. The density of the point clouds should be sufficient to be able to reconstruct a smooth surface of arbitrary topology. Vision researchers have been more ambitious in trying to derive a surface representation using only 2D images at multiple views. It is challenging and only works for topologically simple objects that are sufficiently textured due to the ill-posedness of the reconstruction problem. The insufficient 3D reconstruction from images and the difficulties of obtaining surface data from only images have motivated us to develop an approach that integrates both 3D points and 2D image information. Our formulation is different from surface reconstruction from a set of calibrated images as addressed in [46, 195, 107] in which only 2D images are used without any intermediate 3D information. It is also different from surface reconstruction from scanned 3D data, which does not have any 2D image information [85, 221, 29, 257, 142, 223].
Overview The surface reconstruction method is limited to smooth and closed objects. The outdoor scenes such as the garden-cage example are not handled. We usually make a full turn around the object by capturing about 30 to 35 images to compute a quasidense geometry of the sequence.
Fig. 6.2 Overview of our surface reconstruction method.
The reconstructed 3D points are first segmented into the foreground object and the background. The background includes obvious outliers like isolated and distant points from the majority of points. The points of the foreground object are obtained as the largest connected component of the graph neighborhood of all points such that the distance between any two ‘edge’ points of this graph should be smaller than a multiple of the uncertainty median of the points. The object silhouettes are
6.2 Minimal surface functionals
123
interactively extracted from each input image in order to compare the reconstructed surface with the visual hull reconstructed from the object silhouettes. The general methodology for surface evolution is a variational approach. Intrinsic functionals as a kind of weighted minimal surface are defined to integrate both 3D point data and 2D image data. The object surfaces are represented as a dynamic implicit surface, u(x) = 0 in R3 , which evolves in the direction of the steepest descent provided by the variation calculation of the functional we minimize. The levelset implementation of the surface evolution naturally handles the surface topology changes. The level-set evolution is accelerated by a bounded regularization method.
6.2 Minimal surface functionals By analogy to 2D geodesic active contours [20] whose mathematical properties have been established, the weighted minimal surface formulation was introduced by Caselles et al. [21] and Kichenassamy et al. [98] for 3D segmentation from 3D images, i.e., the 3D surfaces they seek are those minimizing the functional wds using the weight w = g(∇I) where ds is the infinitesimal surface element and g is a positive and decreasing function of the 3D image gradient ∇I. Faugeras and Keriven [46] developed a surface reconstruction from multiple images by minimizing the functional wds using a weighting function w that measures the consistency of the reconstructed objects reprojected onto 2D images. This measure is usually taken as a function of the correlation functions ρ(x, n) between pairs of 2D images, i.e., w(x, n) = g(ρ(x, n)). The correlation function is dependent not only on the position x of the object surface, but also on its orientation n. A potentially general and powerful reconstruction approach was therefore established. But the existence and uniqueness of the solution for the proposed functional have not yet been elucidated. In the different context of surface reconstruction from sufficiently dense and regular sets of scanned 3D point data, Zhao et al. [257] proposed to minimize the functional wds using a new weighting function w as the distance function of any surface point x to the set of 3D data points. Given a set of data points P and d(x, P) the Euclidean distance of the point x to P, the weighting function is simply w(x) = dp (x, P). The method gives interesting results with good 3D data points.
124
6 Surface modeling
6.3 A unified functional In our surface reconstruction, we have both 3D data points and 2D image data. It is interesting to observe that the variational formulation mentioned above in different contexts is based on the minimal surface. This makes it possible to define a unifying functional taking into account data of a different nature. Thus, we first propose to minimize the functional wds using a new weighting function for the minimal surface formulation consisting of two terms w(x, n) = dp (x, P) + λe(x, n, I) where the first d(x, P) is the 3D data attachment term that allows the surface to be attracted directly onto the 3D points; and the second e(x, n, I) is a consistency measure of the reconstructed object in the original 2D image space. The consistency measure might be taken to be any photo-consistency or correlation function. The minimizing functional is given by p(x) = (dp (x, P) + λe(x, n, I))ds. Silhouette information might also be a useful source of information for surface construction [219]. It is not sufficient on its own as it gives only an approximate visual hull, but it is complementary to other sources of information. If used, it amends the distance function of the weighting function as d(x, P ∪ S) = min(d(x, P), + d(x, S)), where d is the 3D Euclidean distance function; P is the set of 3D points; S is the surface of the intersections of the cones defined by the silhouettes, i.e., the visual hull; and is a small constant favoring 3D points over the visual hull in the neighborhood of 3D points. An adequate initialization is also proposed to optimize the functional derived from this weighting function.
6.4 Level-set method The solutions of the minimizing functional are given by a set of PDEs: the Eulerequation designated ∇p = 0, and obtained from the functional p = Lagrange wds to be minimized. The Euler-Lagrange equation is often impossible to solve directly. One common way is to use an iterative and steepest-descent method by considering a one-parameter family of smooth surfaces x(t) : (u, v, t) → (x(u, v, t), y(u, v, t), z(u, v, t))
6.5 A bounded regularization method
125
as a time-evolving surface x parametrized by time t. The surface moves in the direction of the gradient of the functional p with the velocity −∇p, according to the flow ∂x(u, v, t) xt = = −∇p. ∂t This is the Lagrangian formulation of the problem that describes how each point on the dynamic surface moves in order to decrease the weighted surface. The final surface is then given by the steady state solution xt = 0. The problem with this approach is that it does not handle the topology change [197]. However, it is important to notice that though the derivation has been based on a parametrization, the various quantities including the velocity for the steepest descent flow are intrinsic, i.e., independent of any chosen parametrization that makes the computation possible. This paves the way for the well-known and powerful level-set formulation [162, 197] that regards the surface as the zero level-set of a higher dimensional function. As the flow velocity −∇p is intrinsic (it has been demonstrated for a general w depending also on the surface normal in [46]), we may easily embed it into a higher dimensional smooth hyper-surface u(t, x) = 0 which evolves according to ut = −(∇p · n)||∇u||2 , ∇u . Topological changes, accuracy, and stability of the and the normal n = − ||∇u|| 2 evolution are handled using the proper numerical schemes developed by Osher and Sethian [162].
6.5 A bounded regularization method The Bounded Regularization Method The Euler-Lagrange expression ∇p might be complicated if the weighting function w(x, n) also depends on the surface normal [46]. It seems that the complication by this dependency on the surface normal is rather unnecessary in practice [69]. We therefore assume a weighting function independent of the surface normal. Thus, the expression ∇p · n consists simply of two terms like the geodesic active contour case, ∇w · n + w∇ · n, in which the first is the data attachment term and the second ∇u on the level-set function, the surface the regularization term. Using n = − ||∇u|| 2 evolves according to ∂u = ∇w∇u + w||∇u||2 H, ∂t ∇u where H = div ||∇u|| is the sum of the two principal curvatures (twice the mean 2 curvature). When w is taken to be the correlation function, it is the simplified version of [46] presented in [69]. And when w is taken to be the 3D distance function, it is the first method proposed in [257]. However the curvature-based regulariza-
126
6 Surface modeling
tion w||∇u||2 H over-smooths, resulting in a loss of geometric details and in slow convergence as the time step has to be Δt = O(Δx2 ) for a stable solution. In [257], a convection model is also proposed to ignore the regularization term, w||∇u||2 H, to speed up the procedure, but this is only envisageable for applications where data quality is sufficient, for instance, for synthetic and high-quality scanned data [257]. Motivated by the need for regularization of noisy data and the inefficiency of the curvature-based regularization, we propose an intermediate bounded regularization method. It has a “bounded” regularization term, min(w, wmax )||∇u||2 H, instead of the “full” regularization term, w||∇u||2 H. The corresponding evolution equation is given as: ∂u = ∇w∇u + min(w, wmax )||∇u||2 H. ∂t The following remarks can be made: • The fully regularized surface evolution is obtained when wmax ≥ ||w||∞ . • The unregularized surface evolution is obtained when wmax = 0. • As 0 ≈ w ≤ wmax in the vicinity of the steady surface for any w, it is expected that the fully regularized and the bounded regularized evolutions behave in the same manner in this region.
Efficiency of the bounded regularization method The efficiency of our proposed bounded regularization method is evaluated by estimating the maximum time step Δtmax for stability computation. We are currently unable to quantify Δtmax of the bounded regularization method for the general curvature-based regularization, but we are able to prove it for a simplified isotropic regularization using a Laplacian operator. Replacing the curvaturebased regularization by the isotropic regularization for Δtmax calculation is a heuristic motivated by the fact that the curvature/anisotropic regularization term, ∇u , and the Laplacian/isotropic one u are equal when ||∇u||2 H = ||∇u||2 div ||∇u|| 2 ||∇u||2 = 1 is enforced periodically, which is the case often in practice to avoid too flat and too steep variations of u. It is therefore tempting to simplify the evolution equation to ∂u = ∇w∇u + min(w, wmax )u. ∂t Assuming that the stability condition is the same for curvature-based and Laplacianbased regularizations, the stability, ||un+1 ||∞ ≤ ||un ||∞ , is achieved if Δt ≤ Δtmax with Δtmax =
6wmax +
Δx2 , + |d0y w| + |d0z w|)||∞
||Δx(|d0x w|
6.6 Implementation
127
where d0x , d0y , and d0z are the centered differences at the grid point in the three axes. We choose wmax to be proportional to Δx for our bounded regularization max method, i.e., fixing w0 = wΔx , and obtain Δtmax =
Δx . 6w0 + ||||∇w||1 ||∞
Under this condition, the complexity of Δtmax is given by Δtmax = Θ(Δx), the same for the bounded regularized and unregularized evolutions, much better than Δtmax = Θ(Δx2 ) for the fully regularized evolution. No previous work to our knowledge provides such a stability analysis for an evolution equation with both convection and regularization terms. In practice, the time step, Δt = Δtmax , is always used for surface evolution in all our examples with the bounded and curvature-based regularization.
6.6 Implementation Initialization The foreground object points are regularly sliced into sections along the major direction of the point cloud. A 2D-convex hull is computed for each section and these convex hulls are used to define the successive sections of a truncated cone as the bounding volume of the object. The initialization of all examples shown in this paper is automatically obtained using this method. One example of the initialization for the Bust sequence is shown on the left of Figure 6.7. We note that the initialization procedures proposed in [257] cannot be applied here because of the big holes without 3D points, especially at the object bottom. Also, all 3D points are rescaled into a 150×150×150 voxel space in all examples by applying a similarity transformation. The resulting voxel size is of the same order of magnitude as the uncertainty median of the 3D points.
Different methods The following surface evolution methods are tested and compared in Section 6.7. • BR3D is the bounded regularization method by taking the weighting function w to be the 3D distance from the set P of the reconstructed 3D points: w(x) = d(x, P). The number of iterations is always 100 with w0 = 0.1. • BR2D is the bounded regularization method by taking the weighting function w to be the image correlation function ρ. More details are given in Section 6.7. • BR3D+BR2D sequentially applies the BR3D and BR2D methods. Fifty iterations are used with BR2D.
128
6 Surface modeling
• BR3D2D uses the weighting function w as a combination of a 3D distance function and a 2D image consistency measure using a bounded regularization method:
w(x) = d(x, P) + e(x, I), where e = 0.2 σr2 (x) + σg2 (x) + σb2 (x) and σ(x) is the standard deviation of the reprojected voxel in each of three color channels in [0, 1]. The consistency measure e is similar to the photo-consistency of the space-carving method. The basic idea is to avoid surface evolution in the immediate neighborhood of the reconstructed points where the surface previously obtained by BR3D is assumed to be correct. It also inflates the surface elsewhere and stops in the surface portions having inconsistent reprojections, mainly due to the difference between the object and the background colors. Thus, we use the following evolution equation ∂u = ∇w∇u + min(w, wmax )||∇u||2 (c + H), ∂t where c is an inflating constant introduced and used in segmentation works [133, 21]. Note that the term min(w, wmax )||∇u||2 c is negligible in areas where w ≈ 0, i.e., in the close neighborhood of the reconstructed points. This is a much desired outcome. We choose
– = 0 in the immediate neighborhood of reconstructed points d(x, P) < 2Δx in the unit cube [0, 1]3 and = 1 elsewhere; – c = −5 and w0 = 0.1 with u < 0 inside the current surface u = 0. • BR3D+2D sequentially applies the BR3D and BR3D2D methods. Fifty iterations are used with BR3D2D. • BR3DS is a mixed method combining both 3D points and the silhouette information using a weighting function, w(x) = min(d(x, P), + d(x, S)). We choose = 2Δx to favor the 3D points, P, over the visual hull, S, in the immediate neighborhood of the reconstructed points. This method should be initialized by BR3D. Otherwise, the evolving surface may never reach the concave parts of the object. • BR3D+S sequentially applies the BR3D and BR3DS methods. Fifty iterations are used with BR3DS. First, the surface is only attracted by 3D points including those of the object concavities. Second, the surface does not move in the immediate -neighborhoods of 3D points, but it moves toward the visual hull in the areas closer to the visual hull than the 3D points. To avoid the convergence of the dynamic surface to the empty surface, a freeze plane is often introduced to stop/freeze the surface evolution in one of the two delimited half spaces. The freeze plane is manually placed to fill in the biggest gap, often on the bottom or on the back of the object if the sequence is not complete.
6.7 Results and discussions
129
6.7 Results and discussions The reconstructed surfaces of human faces and objects are systematically shown in Figures 6.1, 6.3, 6.4 and 6.5 on image sequences taken by a hand-held still digital camera. More comparative studies are detailed in the following paragraphs.
(a)
(d)
(a)
(d)
(b)
(e)
(b)
(e)
(c)
(f)
(c)
(f)
Fig. 6.3 Two examples of man’s faces. (a) Parameters of the experiment: #C the number of cameras, #P the number of points, R the image resolution, M the surface reconstruction method, and F the location of the freeze plane. (b) Two frames of the sequence. (c) Reconstructed quasi-dense 3D points. (d) Gouraud-shaded surface geometry. (e) Textured-mapped surface geometry.
130
6 Surface modeling
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6.4 One example of woman’s face and one of a statue. (a) Parameters of the experiment: #C the number of cameras, #P the number of points, R the image resolution, M the surface reconstruction method, and F the location of the freeze plane. (b) Two frames of the sequence. (c) Reconstructed quasi-dense 3D points. (d) Gouraud-shaded surface geometry. (e) Textured-mapped surface geometry.
6.7 Results and discussions
131
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6.5 Two examples of objects. (a) Parameters of the experiment: #C the number of cameras, #P the number of points, R the image resolution, M the surface reconstruction method, and F the location of the freeze plane. (b) Two frames of the sequence. (c) Reconstructed quasi-dense 3D points. (d) Gouraud-shaded surface geometry. (e) Textured-mapped surface geometry.
BR3D vs. BR3D+2D Combining 2D image information using BR3D2D can significantly improve the final reconstruction results as using only a 3D distance function may fail when there are no sufficient reconstructed points on some parts of the surface. This is illustrated in Fig. 6.6.
132
6 Surface modeling
(b)
(a)
(c)
Fig. 6.6 Surface geometry obtained by BR3D in (b) and BR3D+2D in (c). There are many missing 3D points in the low-textured cheeks, so BR3D using only 3D information gives poor results while BR3D+2D gives good results by adding 2D information.
(a)
(b)
Fig. 6.7 (a) Initial surface. (b) BR3D method.
(a)
(b)
(c)
Fig. 6.8 (a) BR3D+BR2D with w0 = 0.1. (b) BR3D+BR2D with w0 = 0.5. (c) BR3D+BR2D with w0 = 1.
3D distance vs. image correlation Using only image correlation as suggested in [46, 69] makes convergence very difficult for low-textured objects. Here we take a reasonably textured object, the bust,
6.7 Results and discussions
133
to test the BR2D method and compare it with the others. The surface initialization is shown on the left of Figure 6.7 and is obtained with the method described in Section 6.6. Figure 6.9 shows the results by the BR2D method with w = 0.1(1 − ρ) and ρ ∈ [−1, 1] for 400 and 1000 iterations with w0 = 0.1, w0 = 0.5, and w0 = 1, using a 9 × 9 ZNCC-window. The lower bound w0 = 0.1 gives a noisy surface (see the pyramid part). The upper bound w0 = 1 gives a too smoothed surface (see the flat nose). The intermediate bound gives a compromise between the two. The original correlation [46, 69] with full regularization is even smoother than the upper bound w0 = 1 case. Also, the convergence is extremely slow. It is still not done around the intersection of the concave part between the cube and the pyramid after 800 iterations. We have also found that the original correlation method is actually slower than BR2D, since its time step, Δtmax , is 380 smaller. We experimented with the two-step method BR3D+BR2D, shown in Figure 6.8, which are similar to the previous case and not very satisfactory. However, this method is more efficient: the 100 steps of BR3D-iterations take only about 5 minutes on a P4 2.4 GHz (including initialization), compared with the 20 (resp. 50) minutes for 400 (resp. 1000) BR2D steps. Figure 6.10 shows the difference between the BR3D and BR3D+BR2D methods, with the best previous bound w0 = 0.5 and only 50 iterations for BR2D. Still, the nose is too smooth for BR3D+2D and the chin is also degraded.
Fig. 6.9 Surfaces obtained with the BR2D method after 400 and 1000 iterations with w0 = 0.1 (the first two), w0 = 0.5 (the middle two), and w0 = 1 (the last two).
134
6 Surface modeling
Isotropic vs. anisotropic smooth Using Laplacian/ isotropic u instead of the curvature/anisotropic smooth ||∇u||2 H = ∇u ||∇u||2 div ||∇u|| leads to faster evolution, as the level-set function update, ||∇u||2 = 2 1, is done twice as frequently for the anisotropic smoothing than for the isotropic smoothing, which has a smaller discretization neighborhood. It is also important to observe that no apparent difference occurs between these two different smooths in the final surface geometry, as shown in Figure 6.10 (same conclusion with w0 = 0.5 and w0 = 1). This suggests that the benefit of using curvature-based smoothing is negligible for our context.
(a)
(c)
(b)
(d)
Fig. 6.10 Surfaces computed using different smoothing methods. (a) One original image. (b) BR3D+BR2D with w0 = 0.5. (c) Curvature-based smoothing BR3D. (d) Laplacian-based smoothing BR3D.
With vs. without silhouette Figure 6.11 shows results obtained by BR3D, BR3D+2D, BR3D+S and the pure silhouette method S for the Man 3 sequence. Using only 3D points by BR3D misses the low-textured cheeks, and using only the visual hull by S misses many important concavities on the surface, like in the areas of the ears and nose. Combining the two gives excellent final results.
6.7 Results and discussions
135
Adding silhouette information improves the pure 3D results; both automatic and interactive extraction of silhouettes from unknown backgrounds have been used for different cases. Note that silhouette information is only optional in our approach and that the majority of our results presented here do not use it.
(a)
(b)
(c)
(d)
Fig. 6.11 Surfaces computed using (a) BR3D method with only quasi-dense 3D points, (b) BR3D+S method with a combination of the quasi-dense 3D points and the silhouettes, (c) BR3D+2D method with a combination of the quasi-dense 3D points and the image photoconsistency, and (d) S method with only the silhouettes.
Conclusion This chapter proposed new surface reconstruction algorithms integrating both 3D data points and 2D images. This is possible because of a unified functional based on a minimal surface formulation. We believe that the new functionals have far less fewer minima than those derived from 2D data alone and that this will result in more stable and more efficient algorithms. For the efficient evolution of surfaces, we also propose a bounded regularization method based on level-set methods. Its stability is also proved. However, the main limitation of our system is due to the choice of the surface evolution approach, which assumes a closed and smooth surface. The surface reconstruction module is not designed for outdoor or polyhedric objects. The impact of the proposed approach is threefold: First, we show that the accuracy of the reconstructed 3D points is sufficient for the 3D modeling application.
136
6 Surface modeling
Second, we introduce new intrinsic functionals that take into account both 3D data points and 2D original image information, unlike previous works that consider either only 2D image information [46] or only scanned 3D data [257]. By doing this, we compensate for the lack of reconstructed 3D points with 2D information. The new functionals are also expected to have a much smaller number of local minima and better convergence than a pure 2D approach [46]. Third, we propose a bounded regularization method that is more efficient than the usual full regularization methods and give a proof of its stability.
6.8 Bibliographic notes This chapter is adapted from [119] by Lhuillier and Quan, which was an extension to the previous publications in [115, 118]. Many surface reconstruction algorithms have been proposed for different data. For only 2D images and camera geometry, the recent volumetric methods [195, 107, 46, 104] are the most general image-based approaches, but they are not robust enough. For densely scanned 3D point data, Szeliski et al. [221] used a particlebased model of deformable surfaces; Hoppe et al. [85] presented a signed distance for implicit surfaces; Curless and Levoy [29] described a volumetric method; and Tang and Medioni [223] introduced a tensor voting method. Most recently, Zhao et al. [257] developed a level-set method based on a variational method of minimizing a weighted minimal surface. Similar work to [257] has also been reported by Whitaker [247] using a MAP framework. For depth data obtained from stereo systems, it is more challenging than that from scanned 3D data as the stereo data are usually much sparser and less regular. Fua [63] used a system of particles to fit the stereo data. Kanade et al. [154] and Fua and Leclerc [62] proposed a deformable mesh representation to match the multiple dense stereo data. These methods that perform reconstruction by deforming an initial model or tracking the discretized particles to fit the data points are both topologically and numerically limited compared to modern dynamic implicit surface approaches. Last, but not least, the specific face reconstruction using a generic face model was introduced by Zhang et al. [256].
Chapter 7
Hair modeling
Hair is a prominent feature of a character look, and hair modeling is necessary for a character animation. This chapter proposes an image-based approach to model hair geometry from images at multiple viewpoints. A hair fiber is unfortunately too fine to be discernable at the typical image resolution if the entire hair is covered in an image. But a hair fiber has a simple generic shape, which is a space curve approximated by a poly-line. Therefore, a hair fiber is reconstructed by synthesizing, segment by segment, which follows local image orientations. The hair volume is also estimated from the visual-hull-like rough reconstruction from the multiple views. We demonstrate the method on varieties of hair styles.
Fig. 7.1 From left to right: one of the 40 images captured by a handheld camera under natural conditions; the recovered hair rendered with the recovered diffuse color; a fraction of the longest recovered hair fibers rendered with the recovered diffuse color to show the hair threads; the recovered hair rendered with an artificial constant color.
137
138
7 Hair modeling
7.1 Introduction Many computer graphics applications, such as animations, computer games or 3D teleconferencing, require 3D models of people. Hair is one of the most important characteristics of human appearance, but the capturing of hair geometry remains elusive. The geometry of hair, with hundreds of thousands of thin fibers, is complex and hardly perceivable at normal image resolutions. The intriguing reflectance properties of hair make active scanning methods unreliable. Most prior dedicated modeling tools [99] required costly and tedious user interaction. The recent work by [164, 70] demonstrates the potential and possibility of an image-based hair modeling approach combining synthesis and analysis. Impressive results are produced by [164]. But the capturing of the images via this approach has to be done in a controlled environment with a known moving light. This requirement limits its applicability. The subject to be captured must be still during the entire image capturing. This process excludes the possibility of capturing a subject in motion. Moreover, this approach can not analyze already-captured videos. Inspired by the approach in [164, 70], we propose, a more practical and natural method of modeling from images of a given subject from different viewpoints. A hand-held camera or video camera is used under natural lighting conditions without any specific setup. It offers greater flexibilities for image capture; moreover, it may produce more complex and accurate models of sophisticated hair styles. This is mainly due to the inherent strong and accurate geometric constraints from multiple views. Our approach involves little user interaction and could be fully automated. This makes it a useful solution as a starting point for other interactive modeling systems, thereby reducing the amount of work needed for generation of a final model.
Overview Hair structure consists of fibers of a spatial frequency much higher than that can actually be captured by the relatively low image resolution of cameras. This makes the direct recovery of individual fibers ill-posed, even impossible. Yet, it is observed in [164, 70] that neighboring hair fibers tend to have similar directions, and a group of such closely tied hair fibers, called a strand, generates structured and detectable 2D orientation information in the images. Thus, a dense orientation map per pixel in each image can be computed using an oriented filtering method. Our approach is to recover the observed hair geometry from multiple views. We use a hand-held camera under uncontrolled natural conditions to capture images from multiple viewpoints. Then, we compute automatically the camera geometry using the quasi-dense approach presented in Section 5.4 or in [119]. Next, we detect a local orientation per pixel in each image. We represent each hair fiber by a sequence of chained line segments. The key is now how to triangulate each fiber segment from image orientations from multiple views. If a hair fiber segment was always visible in images, the recovery would be geometrically equivalent to a 3D line reconstruction from multiple 2D image lines [43, 181], as an image orienta-
7.2 Hair volume determination
139
Fig. 7.2 Overview of our hair modeling system.
tion at a given pixel position defines, locally, a 2D image line. But fiber segment is invisible at normal image resolutions, a synthetic method is necessary. Therefore, each fiber segment is first synthesized in space, then validated by orientation consistency from at least three views, and finally optimally triangulated from all its visible views. Our multi-view setting naturally allows a robust and accurate estimation of the hair volume inside which the hair synthesis takes place. This is done by computing a visual hull from multiple views to approximate the hair surface. The approach is illustrated in Figure 7.2. The major difference between our method and the most relevant work by [164] lies in our multiple view approach versus their per se single view approach. This results in a different reconstruction method, different capturing techniques and different results. The practicality of our capturing method and the strong 3D geometric information inherent to the multi-view approach are clear advantages of our multiview approach over a fixed view one.
7.2 Hair volume determination The hair volume is the volume bounded by two surfaces, the hair surface, Shair , and the scalp surface, Sscalp , as illustrated in the violet area in Figure 7.3.b. It is important to have a good estimate of the hair volume for the synthesis of the hair fibers. Although this is difficult for the fixed viewpoint approach [164], it can be reliably estimated through a multiple view approach. It is now possible to compute a surface representation of the subject, including the the hair surface portion Shair , using, for example, the quasi-dense approach developed by [119]. A visual hull [108, 136, 219] of the subject, denoted as Smax , usually gives a good approximation to the real surface of the subject, as we are making a full turn around the subject. A visual hull can be automatically computed from silhouettes of the subject in a more practical and robust manner than the real surface. We approximate the surface of the subject by the visual hull, Smax , to make the approach more practical. The hair area, H(S), of a given surface S can be defined as the union of all patches whose projection onto any view is in the hair mask of that view. The extraction of the hair mask is described in Section 7.4. The
140
7 Hair modeling
hair surface is approximated by the portion of the Smax restricted to the hair area, H(Smax ). The scalp surface Sscalp underneath the hair is totally invisible, thus unobtainable by cameras. It might be possible to use a generic head model or simply an ellipsoid as an approximation of the scalp surface. However, a good systematic way of approximating the scalp surface is to use an inward offset version of Smax , denoted as Smin . More precisely, the scalp surface is given by H(Smin ). The hair volume is approximated to be the volume between the two surfaces H(Smax ) and H(Smin ). The synthesized hair fibers start from a regular distribution over H(Smin ) and terminate on H(Smax ). All these quantities are illustrated in Figure 7.3.a.
a.
b.
Fig. 7.3 a. Hair volume, marked in violet, is approximated from a visual hull Smax and an inward offset surface Smin of the visual hull. b. The visibility of a fiber point P is determined from that of the closest point Pmax on the hair surface. The projection of P onto an image is not a point, but a small image area that is the projection of a ball around P with radius P Pmax .
7.3 Hair fiber recovery 7.3.1 Visibility determination Given a space point, P , through which we would like to synthesize a hair fiber, we need first to determine its visibility with respect to all views. The visibility of a point on the hair surface Smax is determined by the surface itself and can be stored off-line. However a point P inside the hair volume, not on the surface, is obviously invisible. Its visibility will be taken as that of the closest point Pmax on the hair surface (see Figure 7.3.b). This may be considered as a convenient heuristic. On the other hand, when we are computing the 3D orientation of P , we do not take the 2D orientation at the exact projected positions of P , but compute an averaged orientation from the small image area that is the projection of the ball defined in Figure 7.3.b. Intuitively, the observed image information is smoothly diffused from the visible surface to the inside of the hair volume, and the direction of inside invisible fibers should be interpolated from the outside visible fibers on the hair surface. This procedure naturally results in more smoothed hair
7.3 Hair fiber recovery
141
fibers inside the volume: the further away from the visible hair surface the fibers are, the more smoothed the reconstructed directions of the fibers are.
7.3.2 Orientation consistency Let P be a point in space. In each view, the 2D orientation at the projection of P defines a line in the image, a, a homogeneous 3-vector. Back-projecting this line into space from the viewpoint gives a plane going through the camera center and the line. The normal vector n of the plane is n = AT a, where A3×3 is the 3 × 3 submatrix of the 3 × 4 projection matrix (A3×3 , t1×3 ). If P is a valid fiber point in space, its 3D orientation should be consistent with the multiple 2D orientations obtained on its projections in these views where it is visible. The underlying geometry is that all back projection planes should intersect into the common supporting line of a fiber segment at P . This constraint can be expressed in a minimum of three views, resulting in two independent algebraic constraints. One of them, important to our application, is expressed only in terms of the orientation quantities [43]: n · (n × n ) = 0. Therefore, a space point P is a valid fiber point if the orientations a, a , and a at the projections of P in three views represented by A, A and A satisfy AT a · (AT a × AT a ) = 0.
(7.1)
This is indeed one of the trilinear constraints [80, 47] and can be used to efficiently validate the positions of the fiber points in space.
7.3.3 Orientation triangulation A hair fiber is a space curve that is approximated by a sequence of chained line segments. The position of one end-point of a fiber segment is synthesized in space, and the direction of the fiber segment is optimally triangulated from all visible views if the fiber segment is validated in multiple views. Let nj be the normal vector to the plane defined by the line with orientation aj in the jth view. Assume that the direction vector in space of the fiber segment is d. We have d · nj = 0. The direction vector d can therefore be recovered from at least two views by solving this linear homogeneous equation system. When more than two views are available, the direction vector can be optimized over all visible views j as follows. A line through the point P can be parameterized by any other point Q of the line. The position of Q is optimized by minimizing the sum of all distances, Σdj (Q, πj )2 , from Q to each plane πj back-projected from the corresponding image line. Taking the uncertainty of orientation estimation into account, we solve for
142
7 Hair modeling
d by minimizing Σj
1 (nj · d)2 , subject to ||d|| = 1, σj2 ||nj ||
(7.2)
where σj is the variance of the response curve in different orientations of the given filter at that position. Its inverse encodes the strength of the detected orientation edge. This is a linear optimization problem that can be efficiently solved by singular value decomposition. A small line segment from P to P ± λd can be created as a portion of the fiber. The scalar, λ, is determined such that the segment is of unit length, i.e., the projection of ||λd|| onto an image should be the size of a pixel. The sign is fixed such that we choose the current segment that forms a larger angle with the previous segment to keep the fiber as smooth as possible. Given the fact that the computed 2D orientations are noisy, we first systematically discard those with low confidence. Second, instead of directly using the algebraic constraint given by Eq. 7.1 to validate a synthesized fiber point P , we use the average unweighted residual error of Eq. 7.2, computed from at least three views, (nj ·d)2 1 n Σj ||n ||||d|| , for the validation. j
7.4 Implementation Image capture and camera calibration The image capturing in our approach is conveniently achieved using a hand-held digital camera under natural conditions. Typically, about 30 to 40 images are taken to cover the whole head of the subject. We either move around the subject that remains fixed to make a full turn or turn the subject around itself. This acquisition process does not require any ad hoc setup. Necessary precautions are that a short shutter time should be used to avoid motion blurring of the hair images and the environment should be sufficiently illuminated to make sure that all parts of the subject are visible, particularly with very dark hair areas. The camera geometry through correspondences and auto-calibration is fully automatically recovered using the quasi-dense approach developed in [119]. A more standard sparse approach using the points of interest as described in [47, 80] may require more images for camera geometry determination.
2D orientation map computation We adapted the filtering method developed in [164] to compute a dense orientation map at each pixel for each image. We choose only one oriented filter, a Canny-like first derivative of Gaussian, and apply it at discrete angles for each pixel. The angle at which the filter gives the highest response is then retained for the orientation of
7.4 Implementation
143
that pixel. The confidence of the orientation at each pixel is taken as the inverse of the variance of the response curve. We have only one image at a fixed natural lighting for each viewpoint. The 2D orientation estimation for each pixel obtained in [164] is usually better than that from our images, as the estimate for each pixel is taken to be the most reliable response among all images under different lighting conditions for a given viewpoint. But we do not use the 2D orientation information to directly constrain one degree of freedom of the fiber direction in space. Instead, we take advantage of redundancy in multiple views. A direction of a fiber segment is typically triangulated from about 10 views. We also found that using multiple filters to choose the best response and bilateral filtering do not significantly improve the quality of the results.
Hair mask extraction An approximated binary hair mask for each image is necessary to determine the hair area of a surface H(S) so that the synthesized hair fibers are inside the actual hair volume. Silhouettes of the subject are required as well to compute a visual hull as an approximation of the hair surface. In our current implementation, we roughly draw these masks manually. This is essentially the only interactive part of our approach, and it can be automated. For example, the silhouettes separating the foreground subject and the background could be recovered using techniques such as space-carving [107] for a general case, and chroma keying or background accumulation for a moving head and static camera case. Separating the boundary between the hair and face could be achieved by advanced skin tone analysis.
Hair fiber synthesis We fix an object-centered coordinate frame at the center O of the volume bounded by Smax and align it with the vertical direction of Smax . We use spherical coordinates for points on the surface, and angular coordinates, θ ∈ [0, 180] and φ ∈ [0, 360], for surface parametrization. We also define a mapping between these surfaces by drawing a line through the center O and intersecting it with a surface at points P , Pmax , and Pmin , as illustrated in Figure 7.3.a. This mapping may not be one to one if one of the surfaces is not star shaped. In that case, we systematically choose the intersection point that is the shortest to the center. The starting points of hair fibers are uniformly sampled points in the hair area of the scalp surface H(Smin ). A point on Smin is in the hair area H(Smin ) if its projection in the most fronto-parallel view is in the hair mask of that view. It is sufficient to discretize the parameter φ at each degree from 0 to 360 degrees, and the parameter θ, pointing to the top of the head with θ = 0, from 0 to 120 degrees at each degree.
144
7 Hair modeling
Each hair fiber is a sequence of chained small line segments, and each of them roughly corresponds to an imaged pixel size. The segments are generated one by one. The procedure terminates when the total accumulated length of all segments exceeds a pre-defined threshold or the segment reaches the boundary of the hair volume. The fiber segment direction is optimally triangulated from all visible views as described in Section 7.3.3. But the visible views that are at extreme viewing angles are discarded to avoid uncertainty caused by occlusion and blurring around the silhouettes. When the angle between the current and previous segments is too sharp, we prefer to keep the direction of the previous segment for a smooth fiber. But repeated failure to get a newly triangulated direction will terminate the fiber growth. The final fibers are smoothed by curvature, and the very short ones are discarded.
7.5 Results and discussions Examples We tested different hair styles and hair colors to demonstrate the flexibility and the accuracy of our hair geometry recovery. For a typical man’s example with very dark hair, shown in Figure 7.4, 38 images of resolution 1024 × 768 are captured in the corner of our lab. For the woman’s styles shown in Figures 7.1, 7.6 and 7.7, we used short and long wavy wigs worn on a real person to capture 40 images of resolution 1280 × 960. We also tested a difficult curly hair style shown in Figure 7.5. For that, a wig was put on a dummy head on a turntable with a black background to capture 36 images of resolution 1280 × 960. All these results, fully automatically computed except the extraction of the hair masks, clearly demonstrate the robustness and the quality of the reconstruction with a multi-view approach. Many long strands from the long wavy hair style have been successfully recovered in good order. In the curly example, the recovered density is slightly lower due to the intriguing complex curls. The flexibility of the image capturing and the high quality of the recovered hair geometry confirm the advantages of our method.
Hair rendering The recovered hair fibers are rendered as OpenGL lines without anti-aliasing to better visualize the separate fibers. A diffused color for each fiber segment is collected from all its visible views. Then, a median of all sampled colors is retained to remove directional illumination, but specularity originating from our full turn about the subject in an indoor environment is retained. If a better estimation of the colors is required, capturing from additional viewpoint directions or modeling of the original illumination structure can be used. If a generic head model replacing the subject head is scaled to fit the subject head, we see some geometric discrepancies.
7.5 Results and discussions
145
Fig. 7.4 Example of a typical man’s short and dark hair. In each row, one of the original images on the left, the recovered hair rendered with the recovered diffuse colors in the middle, and rendered with a curvature map on the right.
Fig. 7.5 Example of a difficult curly hair style. One of the original image on the left. The recovered hair rendered with the recovered diffuse colors in the middle, rendered with a curvature map on the right.
To emphasize the structure of the recovered fibers, we use other rendering methods. The curvature rendering uses a curvature map in which a redder color encodes a higher curvature. The sparse rendering only renders a fraction of the longest recovered fibers so that the hair threads are better illustrated. The long wavy hair style is also rendered, shown in Figure 7.7, with the self-shadowing algorithm developed in [8].
146
7 Hair modeling
Fig. 7.6 Example of a typical woman’s style. One of the original images on the top left. The recovered hair rendered with the recovered diffuse colors on the top right, rendered with a fraction of long fibers on the bottom left, rendered with a curvature map on the bottom right.
Running performance The image capturing takes only about a few minutes. The registration is automatically done within 5 minutes on a 1.9GHz P4 PC. The orientation map computation is implemented in Matlab and takes a few hours. Extraction of hair masks which could be automatic is currently done interactively for each view. The visual hull computation takes about 1 minute. We typically synthesize about 40K fibers, consisting of 4 to 8 million line segments, in about 15 minutes.
Limitations As we use natural images, we can model only what we can see. Some parts of the hair might be saturated or shadowed in the images, so the orientation can not be reliably estimated. Some interpolation techniques like in-painting could be used to fill
7.5 Results and discussions
147
Fig. 7.7 Example of a typical long wavy hair style. One of the original images on the top left. The recovered hair rendered with recovered diffuse colors on the top right, rendered with a constant hair color using the self-shadowing algorithm on the bottom left, rendered with the recovered diffuse colors using the self-shadowing algorithm on the bottom right.
in these areas. The invisible fibers are now synthesized from a lower resolution orientation map or neighboring ones, but there exist multi-layered hair styles in which one layer is completely occluding the others. Since each layer already contains invisible strands, it is impossible to recover the occluded parts of the inner layer. Heuristics or user interactions are necessary for the improvement and enhancement of the recovered hair.
148
7 Hair modeling
Conclusion We presented a method for modeling hair from a real person by taking images from multiple viewpoints. It first synthesizes fiber segments representing the hair geometry, then validates the synthesized segment in visible views by orientation consistency, and finally reconstructs the fiber from all visible views. It typically recovers a complete hair model from about 30 to 40 views. The method offers many practical advantages. Images can be captured at any place and any time as long as the subject is sufficiently illuminated. It opens up possibilities for capture of hair dynamics, using a setup of multiple cameras. It is highly automated with little user interaction. The only user interaction in our current implementation is the extraction of hair masks from images that could have been automated with a more advanced image analysis method. The method also produces high quality results from complex hair styles. We will focus our future research directions on the following issues: the immediate extension to dynamic modeling of hair with a multi-camera setting; improvement of orientation filtering in the scale space, particularly for the areas in which no reliable orientation information is available; and development of simple user interactive tools for enhancement and improvement of the modeled hair style.
7.6 Bibliographic notes This chapter is adapted from [243] by Wei, Ofek, Quan and Shum. Active 3D laser scanners cannot handle hair reconstruction due to hair’s complex geometry and reflection properties. Traditional image-based vision techniques, such as stereo methods or photometric stereo approaches [43, 250, 119] at best generate a smoothed surface boundary of the hair volume without the fiber geometry. Most of existing methods [66, 99, 22, 241] are highly interactive and take huge amounts of operator time to produce hair models. Nakajima et al. [153] started to use images for bounding the hair volume, yet the synthesis of the hair fibers is heuristic and does not follow the observed images. More recently, Grabli et al. [70] and Paris et al. [164] pioneered a new image-based hair modeling approach. The key observation is that detectable local orientation information in the images reflects the direction in space, not of an individual fiber, invisible at image resolution, but of a group of hair fibers. The fiber direction in space, first constrained by the local orientation at one viewpoint, is finally determined from the scattering properties [135] observed from multiple images at the same viewpoint with the known light positions. Paris et al. [164] proposed an efficient filtering method based on the oriented filters to produce a robust, dense and reliable orientation map of the images. Repeating the procedure from several different viewpoints to cover the entire hair, they produced excellent results. The main disadvantage of the method is that the capturing procedure is rather complex under a controlled environment and a fixed viewpoint approach per se restricts the visibility and geometric constraints.
Chapter 8
Tree modeling
This chapter proposes a semi-automatic technique for modeling trees and plants directly from images. Our image-based approach has the distinct advantage that the resulting model inherits the realistic shape and complexity of a real tree or plant. This is possible because the leaves of a tree or a plant all have the same generic shape, and the branches have a natural generative model of self-similarity like a fractal. We designed our modeling system to be interactive, automating the process of shape recovery while relying on the user to provide simple hints on segmentation. Segmentation is performed in both image and 3D spaces, allowing the user to easily visualize its effect immediately. Using the segmented image and 3D data, the geometry of each leaf is then automatically recovered from the multiple views by fitting a deformable leaf model. Our system also allows the user to easily reconstruct branches in a similar manner. We show realistic reconstructions of a variety of trees and plants.
Fig. 8.1 Nephthytis plant. An input image out of 35 images on the left, and recovered model rendered at the same viewpoint as the image on the left.
149
150
8 Tree modeling
8.1 Introduction Trees and plants remain one of most difficult kinds of object to model due to their complex geometry and wide variation in appearance. While techniques have been proposed to synthetically generate realistic-looking plants, they either require expertise to use (e.g., [174]) or they are highly manual intensive. Current image-based techniques that use images of real plants have either produced models that are not easily manipulated (e.g., [187]) or models that are just approximations (e.g., [202]). Our approach is image-based as well, but we explicitly extract geometry and we strictly enforce geometric compatibility across the input images. Image acquisition is simple: The camera need not be calibrated, and the images can be freely taken around the plant of interest. Our modeling system is designed to take advantage of the robust structure from motion algorithm developed in computer vision community. It is also designed to allow the user to quickly recover the remaining details in the form of individual leaves and branches. Furthermore, it does not require any expertise in botany to use. We show how plants with complicated geometry can be constructed with relative ease. One of the motivations for developing an imagebased approach to plant modeling is that the geometry computation from images tend to work remarkably well for textured objects [47, 80], and the plants are often well textured. Once the realistic geometry of a plant has been extracted, it can be used in a number of ways, for example, as part of an architectural design, in games, or even for the scientific study of plant growth. Furthermore, since geometry is available, it can be easily manipulated or edited. We show examples of plant reconstruction and editing in this paper.
Overview We use a hand-held camera to capture images of the plant at different views. We then apply a standard structure from motion technique to recover the camera parameters and a 3D point cloud.
Fig. 8.2 Overview of our tree modeling system.
8.1 Introduction
151
For trees, the branching structure is more important than the leaves. Our tree modeling system consists of three main parts: image capture and 3D point recovery, branch recovery, and leaf population illustrated in Figure 8.2. It is designed to reduce the amount of user interaction required by using as much data from images as possible. The recovery of the visible branches is mostly automatic, with the user given the option of refining their shapes. The subsequent recovery of the occluded branches and leaves is automatic with only a few parameters to be set by the user. As was done by researchers in the past, we capitalize on the structural regularity of trees, more specifically the self-similarity of structural patterns of branches and arrangement of leaves. However, rather than extracting rule parameters (which is very difficult to do in general), we use the extracted local arrangement of visible branches as building blocks to generate the occluded ones. This is done using the recovered 3D points as hard constraints and the matte silhouettes of trees in the source images as soft constraints. To populate the tree with leaves, the user first provides the expected average image footprint of leaves. The system then segments each source image based on color. Smaller segments are merged while larger ones are split. We then keep about 10 to 20 clusters after a K-mean clustering on these regions. All regions of these clusters are considered as the leaf candidate. The 3D position of each leaf segment is determined either by its closest 3D point or by its closest branch segment. The orientation of each leaf is approximated from the shape of the region relative to the leaf model or the best-fit plane of leaf points in its vicinity. For plants with relatively large leaves, the leaves need to be individually modeled. Our model-based leaf extraction and reconstruction for plants is summarized in Figure 8.3.
Fig. 8.3 Overview of our model-based leaf extraction and reconstruction.
We first segment the 3D data points and 2D images into individual leaves. To facilitate this process, we designed a simple interface that allows the user to specify the segmentation jointly using 3D data points and 2D images. The data to be partitioned is implemented as a 3D undirected weighted graph that gets updated onthe-fly. For a given plant to model, the user first segments out a leaf; this is used as a deformable generic model. This generic leaf model is subsequently used to fit the other segmented data to model all the other visible leaves.
152
8 Tree modeling
The resulting model of the plant very closely resembles the appearance and complexity of the real plant. Just as important, because the output is a geometric model, it can be easily manipulated or edited. In this chapter, we differentiate between plants and trees—we consider ’plants’ as terrestrial flora with large discernible leaves (relative to the plant size), and ’trees’ as large terrestrial flora with small leaves (relative to the tree size). The spectrum of plants and trees with varying leaf sizes is shown in Figure 8.4.
3ODQWVZLWK ODUJHGLVFHUQLEOH OHDYHV
7UHHVZLWKVPDOOXQGLVFHUQLEOH OHDYHV
Fig. 8.4 Spectrum of plants and trees based on relative leaf size: On the left end of the spectrum, the size of the leaves relative to the plant is large. On the right end, the size of the leaves is small with respect to the entire tree.
Data capturing We use a hand-held camera to capture the appearance of the tree of interest from a number of different overlapping views. In all but one of our experiments, only 10 to 20 images were taken for each tree, with coverage between 120◦ and 200◦ around the tree. The exception is the potted flower tree shown in Figure 8.13, where 32 images covering 360◦ were taken. In principle, we can use as few as three images, which is just enough to carry out structure from motion; however, the reconstructed geometry will be limited in size and realism. Prior to any user-assisted geometry reconstruction, we extract point correspondences and ran structure from motion on them to recover camera parameters and a 3D point cloud. We also assume the matte for the tree has been extracted in each image, so that we know the extracted 3D point cloud is that of the tree and not the background. In our implementation, matting is achieved with automatic color-based segmentation and some user guidance. We used the approach described in [119] to compute a semi-dense cloud of reliable 3D points in space. This technique is used because it has been shown to be robust and capable of providing sufficiently dense point clouds for depicting objects. This technique is well-suited because plants tend to be highly textured. The quasi-dense feature points used in [119] are not the points of interest [80], but regularly re-sampled image points from a kind of disparity maps. Examples are shown in Figures 8.2 and 8.3. We are typically able to obtain about a hundred thou-
8.2 Branche recovery
153
sand 3D points that unsurprisingly tend to cluster at textured areas. These points help by delineating the shape of the plant. Each 3D point is associated with images where it was observed; this book-keeping is useful in segmenting leaves and branches during the modeling cycle. This technique is used because it has been shown to be robust and capable of providing sufficiently dense point clouds for object modeling.
8.2 Branche recovery Once the camera poses and 3D point cloud have been extracted, we next reconstruct the tree branches, starting with the visible ones. The local structures of the visible branches are subsequently used to reconstruct those that are occluded in a manner similar to non-parametric texture synthesis in 2D [39] (and later 3D [12]), using the 3D points as constraints. To enable the branch recovery stage to be robust, we make three assumptions. First, we assume that the cloud of 3D points has been partitioned into points belonging to the branches and leaves (using color and position). Second, the tree trunk and its branches are assumed to be un-occluded. Finally, we expect the structures of the visible branches to be highly representative of those that are occluded (modulo some scaled rigid transform).
8.2.1 Reconstruction of visible branches The cloud of 3D points associated with the branches is used to guide the reconstruction of visible branches. Note that these 3D points can be in the form of multiple point clusters due to occlusion of branches. We call each cluster a branch cluster; each branch cluster has a primary branch with the rest being secondary branches. The visible branches are reconstructed using a data-driven, bottom-up approach with a reasonable amount of user interaction. The reconstruction starts with graph construction, with each sub-graph representing a branch cluster. The user clicks on a 3D point of the primary branch to initiate the process. Once the reconstruction is done, the user iteratively selects another branch cluster to be reconstructed in very much the same way, until all the visible branches are accounted for. The very first branch cluster handled consists of the tree trunk (primary branch) and its branches (secondary branches). There are two parts to the process of reconstructing visible branches: graph construction to find the branch clusters, followed by sub-graph refinement to extract structure from each branch cluster.
154
8 Tree modeling
Graph construction Given the 3D points and 3D-2D projection information, we build a graph G by taking each 3D point as a node and connecting to its neighboring points with edges. The neighboring points are all those points whose distance to a given point is smaller than a threshold set by the user. The weight associated with each edge between a pair of points is a combined distance d(p, q) = (1 − α)d3D + αd2D with α = 0.5 by default. The 3D distance d3D is the 3D Euclidean distance between p and q normalized by its variance. For each image Ii that p and q project to, let li be the resulting line segment in the image joining their projections Pi (p) and Pi (q). Also, let ni be the number of pixels in li and {xij |j = 1, ..., ni } be the set of 2D points in li . We define a 2D distance function d2D =
1 |∇Ii (xij )|, ni j i
normalized by its variance, with ∇I(x) being the gradient in image I at 2D location x. The 2D distance function accounts for the normalized intensity variation along the projected line segments over all observed views. This function is set to infinity if any line segment is projected outside the tree silhouette. Each connected component, or sub-graph, is considered as a branch cluster. We now describe how each branch cluster is processed to produce geometry, which consists of the skeleton and its thickness distribution.
Conversion of sub-graph into branches We start with the branch cluster that contains the lowest 3D point (the “root” point), which we assume to be part of the primary branch. (For the first cluster, the primary branch is the tree trunk.) The shortest paths from the root point to all other points are computed by a standard shortest path algorithm. The edges of the sub-graph are kept if they are part of the shortest paths and discarded otherwise. This step results in 3D points linked along the surface of branches. Next, to extract the skeleton, the lengths of the shortest paths are divided into segments of a pre-specified length. The centroid of the points in each segment is computed and is referred to as a skeleton node. The radius of this node (or the radius of the corresponding branch) is the standard deviation of the points in the same bin. This procedure is similar to those described in [252] and [14].
8.2 Branche recovery
155
User interface for branch refinement Our system allows the user to refine the branches through simple operations that include adding or removing skeleton nodes, inserting a node between two adjacent nodes, and adjusting the radius of a node (which controls the local branch thickness). In addition, the user can also connect different branch clusters by clicking on one skeleton node of one cluster and a root point of another cluster. The connection is used to guide the creation of occluded branches that link these two clusters (see Section 8.2.2). Another feature of our system is that all these operations can be specified at a view corresponding to any one of the source images; this allows user interaction to occur with the appropriate source image superimposed for reference. A result of branch structure recovery is shown for the bare tree example in Figure 8.5. This tree was captured with 19 images covering about 120◦ . The model was automatically generated from only one branch cluster.
Fig. 8.5 Bare tree example. From left to right: one of the source images, superimposed branch-only tree model, and branch-only tree model rendered at a different viewpoint.
8.2.2 Synthesis of occluded branches The recovery of visible branches serves two purposes: portions of the tree model is reconstructed, and the reconstructed parts are used to replicate the occluded branches. We make the important assumption that the tree branch structure is locally self-similar. In our current implementation, any subset, i.e., a subtree, of the recovered visible branches is a candidate replication block. This is illustrated in Figure 8.6 for the final branch results shown in Figure 8.18 and 8.13. The next step is to recover the occluded branches given the visible branches and the library of replication blocks. We treat this problem as texture synthesis, with the visible branches providing the texture sample and boundary conditions. There is
156
8 Tree modeling
(a)
(b)
(c)
(d)
Fig. 8.6 Branch reconstruction for two different trees. The left column shows the skeletons associated with visible branches while the right are representative replication blocks. (a,b) are for the fig tree in Figure 8.18, and (c,d) are for the potted flower tree in Figure 8.13. (Only the main branch of the flower tree is clearly visible.)
a major difference with conventional texture synthesis: the scaling of a replication block is spatially dependent. This is necessary to ensure that the generated branches are geometrically plausible with the visible branches. In a typical tree with dense foliage, most of branches in the upper crown are occluded. To create plausible branches in this area, the system starts from the existing branches and “grows” to occupy part of the upper crown using our synthesis approach. The cut-off boundaries are specified by the tree silhouettes from the source images. The growth of the new branches can also be influenced by the reconstructed 3D points on the tree surface as branch endpoints. Depending on the availability of reconstructed 3D points, the “growth” of occluded branches can be unconstrained or constrained.
Unconstrained growth In areas where 3D points are unavailable, the system randomly picks an endpoint or a node of a branch structure and attaches the endpoint or node to a random replication block. Although the branch selection is mostly random, priority is given to thicker branches or those closer to the tree trunk. In growing the branches, two parameters associated with the replication block are computed on the fly: a random rotation about its primary branch and the scale that is determined by the ratio between the length of the end-branch of the parent sub-tree and that of the primary branch of the replication block. Once scaled, the primary branch of the replication block replaces the end-branch. This growth is capped by the silhouettes of the source images to ensure that the reconstructed overall shape is as close as possible to that of the real tree.
8.2 Branche recovery
157
Constrained growth The reconstructed 3D points, by virtue of being visible, are considered to be very close to the branch endpoints. By branch endpoints, we mean the exposed endpoints of the last generation of the branches. These points are thus used to constrain the extents of the branch structure. By comparison, in the work of [184], the 3D points are primarily used to extract the shapes of leaves. This constrained growth of branches (resulting in T ree) is computed by minimizing D(pi , T ree) i
over all the 3D points {pi |i = 1, ..., n3D }, with n3D being the number of 3D points. D(p, T ree) is the smallest distance between a given point p and the branch endpoints of T ree. Unfortunately, the space of all possible subtrees with a fixed number of generations to be added to a given tree is exponential. Instead, we solve this optimization in a greedy manner. For each node of the current tree, we define an influence cone with its axis along the current branch and an angular extent of 90◦ side to side. For that node, only the 3D points that fall within its influence cone are considered. This restricts the number of points and set of subtrees considered in the optimization. Our problem reduces to minimizing pi ∈Cone D(pi , Subtree) for each subtree, with Cone being the set of points within the influence cone associated with Subtree. If Cone is empty, the branches for this node are created using the unconstrained growth procedure described earlier. The order in which subtrees are computed is in the same order of the size of Cone, and is done generation by generation. The number of generations of branches to be added at a time can be controlled. In our implementation, for speed considerations, we add one generation at a time and set a maximum number of 7 generations. The branches are generated by the constrained growth shown in Figure 8.7 from three subtrees drawn by the user. Once the skeleton and thickness distribution have been computed, the branch structure can be converted into a 3D mesh, as shown in Figures 8.18, 8.5, and 8.13. The user has the option to perform the same basic editing functions as described in Section 8.2.1.
8.2.3 Interactive editing In some cases, particularly true for plants, the branching structure is often totally occluded by the leaves, this makes the reconstruction impossible. One possible approach would be to use rules or grammar (L-systems). However, it is not clear how these can be used to fit partial information closely or how the novice user can exercise local control to edit the 3D branches (say, in occluded areas). Our solution is to design a data-driven editor that allows the user to easily recover the branch structure.
158
8 Tree modeling
Fig. 8.7 An example of totally occluded branch structure by the constraint growth. It started from a vertical segment on the bottom left with three subtrees given on the bottom right.
We model each branch as a generalized cylinder, with the skeleton being a 3D spline curve. The cylindrical radius can be spatially varying. It is specified at each endpoint of each branch, and linearly interpolated between the endpoints. While the simple scheme may not follow specific botanical models, it is substantially more flexible and easier to handle—and it can be used to closely model the observed plant. The user is presented with an interface with two areas: an area showing the current synthesized tree (optionally with 3D points and/or leaves as overlay), and the other showing the synthetic tree superimposed on an input image. The image can be switched to any other input image at any time. User operations can be performed in any one area, with changes propagated to the other area in real-time as feedback. There are four basic operations the user can perform: • Draw curve. The user can draw a curve from any point of an existing branch to create the next level branch. If the curve is drawn in 3D, the closest existing 3D points are used to trace a branch. If drawn in 2D, at each point in the curve, its 3D coordinate is assigned to be the 3D point whose projection is the closest. If the closest points having 3D information are too far away, the 3D position is inherited from its parent branch. • Move curve. The user can move a curve by clicking on any point of the current branch to a new position. • Edit radius. The radius is indicated as a circle (in 2D) or a sphere (in 3D). The user can enlarge or shrink the circle or sphere directly on the interface, effectively increasing or reducing the radius, respectively. • Specify leaf. Each branch can be specified whether it can grow a leaf at its endpoint, as illustrated as green branches in Figure 8.8. A synthesized leaf will be the average leaf over all reconstructed leaves, scaled by the thickness of the branch.
8.3 Leaf extraction and reconstruction
159
Fig. 8.8 Branch structure editing. The editable areas are shown: 2D area (left), 3D space (right). The user modifies the radii of the circles or spheres (shown in red) to change the thicknesses of branches.
Once the branching structure is finalized, each leaf is automatically connected to the nearest branch. The orientation of each leaf, initially determined using a heuristic as described in Section 8.3, is also automatically refined at this stage. The plant model is produced by assembling all the reconstructed branches and leaves.
8.3 Leaf extraction and reconstruction Given the branches, we next proceed to recover the tree leaves. When the leaf size is relatively small, its individual shape is of no importance. One could always just add the leaves directly on the branches using simple guidelines such as making the leaves point away from branches. While this approach would have the advantage of not requiring the use of the source images, the result may deviate significantly from the look of the real tree we wish to model. Instead, we chose to analyze the source images by segmenting and clustering and use the results of the analysis to guide the leaf population process in Section 8.3.1. For plants, however, the leaf size becomes significant, it is impossible to ignore its individual shape, so we propose a generic shape based extraction and reconstruction in Section 8.3.2.
8.3.1 Leaf texture segmentation Segmentation and clustering Since the leaves appear relatively repetitive, one could conceivably use texture analysis for image segmentation. Unfortunately, the spatially-varying amounts of foreshortening and mutual occlusion of leaves significantly complicate the use of texture analysis. However, we do not require very accurate leaf-by-leaf segmentation to produce models of realistic-looking trees.
160
8 Tree modeling
We assume that the color for a leaf is homogeneous and there are intensity edges between adjacent leaves. We first apply the mean shift filter [24] to produce homogeneous regions, with each region tagged with a mean-shift color. These regions undergo a split or merge operation to produce new regions within a prescribed range of sizes. These new regions are then clustered based on the mean-shift colors. Each cluster is a set of new regions with similar color and size that are distributed in space, as can be seen in Figure 8.9(c,d). These three steps are detailed below.
(a)
(b)
(c)
(d)
Fig. 8.9 Segmentation and clustering. (a) Matted leaves from source image. (b) Regions created after the mean shift filtering. (c) The first 30 color-coded clusters. (d) 17 textured clusters (textures from source images).
Mean shift filtering. The mean shift filter is performed on color and space jointly. We map the RGB color space into LUV space, which is more perceptually meaningful. We define our multivariate kernel as the product of two radially symmetric kernels: xs 2 xr 2 C Khs ,hr (x) = 2 2 kE kE , h h hs hr s r
where xs is the spatial vector (2D coordinates), and xr is the color vector in LUV, and C is the normalization constant. kE (x) the profile of Epanechnikov kernel, kE (x) = 1 − x if 0 ≤ x ≤ 1, and 0 for x > 1. The bandwidths parameters hs and hr are interactively set by the user. Region split or merge. After applying mean-shift filtering, we build a graph on the image grid with each pixel as a node; edges are established between 8neighboring nodes if their (mean-shift) color difference is below a threshold (1 in our implementation). Prior to the split or merge operation, the user specifies the
8.3 Leaf extraction and reconstruction
161
range of valid leaf sizes. Connected regions that are too small are merged with neighboring ones until a valid size is reached. On the other hand, connected regions that are too large are split into smaller valid ones. Splitting is done by seeding and region growing; the seeds can be either automatic by even distribution or interactively set. This split or merge operation produces a set of new regions. Color-based clustering. Each new region is considered a candidate leaf. We use K-means clustering method to obtain about 20 to 30 clusters. We only keep about 10 clusters associated with the brightest colors, as they are much more likely to represent visible leaves. Each new region in the kept clusters is fitted to an ellipse through singular value decomposition (SVD). User interaction. The user can click to add a seed for splitting and creating a new region, or click on a specific cluster to accept or reject it. With the exception of the fig tree shown in Figure 8.18, the leaves were fully automatically segmented. (For the fig tree, the user manually specified a few boundary regions and clusters.)
Adding leaves to branches There are two types of leaves that are added to the tree model: leaves that are created from the source images using the results of segmentation and clustering (Section 8.3.1), and leaves that are synthesized to fill in areas that either are occluded or lack coverage from the source viewpoints. Creating leaves from segmentation. Once we have produced the clusters, we now proceed to compute their 3D locations. Recall that each region in a cluster represents a leaf. We also have a user-specified generic leaf model for each tree example (usually an ellipse, but a more complicated model is possible). For each region in each source image, we first find the closest pre-computed 3D point (Section 8.1) or branch (Section 8.2) along the line of sight of the region’s centroid. The 3D location of the leaf is then snapped to the closest pre-computed 3D point or nearest 3D position on the branch. Using branches to create leaves is necessary to make up for the possible lack of pre-computed 3D points (say, due to using a small number of source images). The orientation of the generic leaf model is initially set to be parallel to the source image plane. In the case where more than three pre-computed 3D points project onto a region, SVD is applied to all these points to compute the leaf’s orientation. Otherwise, its orientation is such that its projection is closest to the region shape. This approach of leaf generation is simple and fast, and is applied to each of source images. However, since we do not compute the explicit correspondences of regions in different views, it typically results in multiple leaves for a given corresponding leaf region. (Correspondence is not computed because our automatic segmentation technique does not guarantee consistent segmentation across the source images.) We just use a distance threshold (half the width of a leaf) to remove redundant leaves. Synthesizing missing leaves. Because of lack of coverage by the source images and occlusion, the tree model that has been reconstructed thus far may be missing a
162
8 Tree modeling
significant number of leaves. To overcome this limitation, we synthesize leaves on the branch structure to produce a more evenly distributed leaf density. The leaf density on a branch is computed as the ratio of the number of leaves on the branch to the length of the branch. We synthesize leaves on branches with the lowest leaf densities (bottom third) using the branches with the highest leaf densities (top third) as exemplars.
8.3.2 Graph-based leaf extraction Recovering the geometry of the individual leaves for plants is a difficult problem, due to the similarity of color between different overlapping leaves. To minimize the amount of user interaction, we formulate the leaf segmentation problem as interactive graph-based optimization aided by 3D and 2D information. The graph-based technique simultaneously partitions 3D points and image pixels into discrete sets, with each set representing a leaf. We bring the user into the loop to make the segmentation process more efficient and robust. One unique feature of our system is the joint use of 2D and 3D data to allow simple user assist, as all our images and 3D data are perfectly registered. The process is not manual intensive as we only need to have one image segmentation for each leaf. This is because the leaf reconstruction algorithm (see Section 8.3.3) needs only one image segmentation boundary per leaf. This is sub-optimal but substantially more efficient. There are two main steps to the segmentation process: automatic segmentation by a graph partition algorithm, followed by user interaction to refine the segmentation. Our system responds to user input by immediately updating the graph and image boundaries.
Graph partition The weighted graph G = {V, E} is built by taking each 3D point as a node and connecting it to its K-nearest neighboring points (K=3) with edges. The K-nearest neighor is computed using 3D Euclidean distance, and each connecting edge should at least be visible at one view. The weight on each edge reflects the likelihood that the two points being connected belong to the same leaf. We define a combined distance function for a pair of points (nodes) p and q as d2D (p, q) d3D (p, q) +α √ , d(p, q) = (1 − α) √ 2σ3D 2σ2D where α is a weighting scalar set to 0.5 by default. The 3D distance d3D (p, q) is the 3D Euclidean distance, with σ3D being its variance. The 2D distance measurement, computed over all observed views, is
8.3 Leaf extraction and reconstruction
163
d2D (p, q) = max{ max gi (ui )}. i
ui ∈[pi ,qi ]
The interval [pi , qi ] specifies the line segment joining the projections of p and q on the ith image. ui is an image point on this line segment. The function g(·) is the color gradient along the line segment, it is approximated using color difference between adjacent pixels. σ2D is the variance of the color gradient. The weight of a 2 graph edge is defined as w(p, q) = e−d (p,q) . The initial graph partition is obtained by thresholding the weight function w(·) with k set to 3 by default, ⎧ 2 ⎨ e−d (p,q) , if d3D < kσ3D and d2D < kσ2D , w(p, q) = ⎩ 0, otherwise. This produces groups that are clearly different. However, at this point, the partitioning is typically coarse, requiring further subdivision for each group. We use the normalized cut approach described in [200, 185]. The normalized cut computation is efficient in our case as the initial graph partition significantly reduces the size of each subgraph, and the weight matrix is sparse as it has at most (2K +1)N non-zero elements for a graph of N nodes for K-nearest neighbors. This approach is effective due to the joint use of 2D and 3D information. Figure 8.10 shows that if only 3D distance is used (α = 0), a collection of leaves are segmented but not individual ones. When images are used (α = 0.5) as well, each collection is further partitioned into individual leaves using edge point information.
(a)
(b)
(c)
Fig. 8.10 Benefit of jointly using 3D and 2D information. (a) The projection of visible 3D points (in yellow) and connecting edges (in red) are superimposed on an input image. Using only 3D distance resulted in coarse segmentation of the leaves. (b) The projection of segmented 3D points with only the connecting edges superimposed on the gradient image (in white). A different color is used to indicate a different group of connecting edges. Using both 3D and 2D image gradient information resulted in segmentation of leaflets. (c) Automatically generated leaflets are shown as solid-colored regions. The user drew the rough boundary region (thick orange line) to assist segmentation, which relabels the red points and triggers a graph update. The leaflet boundary is then automatically extracted (dashed curve).
164
8 Tree modeling
User interface In general, the process of segmentation is subjective; the segments that represent different objects depend on the person’s perception of what an object is. In the case for partitioning the image into leaves, while the interpretation is much clearer, the problem is nonetheless still very difficult. This is because leaves in the same plant look very similar, and boundaries between overlapping leaves are often very subtle. We designed simple ways the user can help refine areas where automatic segmentation fails. In the interface, each current 3D group is projected into the image as a feedback mechanism. The user can modify the segmentation and obtain the image boundary segmentation in any of the following ways: • Click to confirm segmentation. The user can click on an image region to indicate that the current segmentation group is acceptable. This operation triggers 2D boundary segmentation, described in Section 8.3.2. • Draw to split and refine. The user can draw a rough boundary to split the group and refine the boundary. This operation triggers two actions: First, it cuts off the connecting edges crossing the marked boundary, so that points inside the boundary are relabelled, which in turn causes an update of the graph partition by splitting the group into two subgroups. The graph updating method is described in Section 8.3.2. Second, it triggers 2D boundary segmentation. • Click to merge. The user can click on two points to create an connecting edge to merge the two subgroups.
Graph update The graph update for affected groups is formulated as a two-label graph-cut problem [11] that minimizes the following energy function: E(l) =
l
(1 − δ(lp , lq ))
1 d2 (p, q)
+
+
D(lp ),
l
where δ(lp , lq ) is 1 if lp = lq , 0 if lp = lq , and lp , lq = {0, 1}. is a very small positive constant set to 0.0001. The data term D(·) encodes the user-confirmed labels: D(0) = 0, D(0) = ∞, if lp = 0 and if lp = 1. D(1) = ∞, D(1) = 0, It is implemented as a min-cut algorithm that produces a global minimum [11]. The complexity of the min-cut is O(N 3 ), with N nodes and at most 5N edges for our graph. Since each group is usually rather small (a few thousand nodes), the update is immediate, allowing the interface to provide real-time visual feedback.
8.3 Leaf extraction and reconstruction
165
Boundary segmentation The image segmentation for a given group of 3D points in a given image is also solved as a two-way graph-cut problem, this time using a 2D graph (not the graph for our 3D points) built with pixels as nodes. Our segmentation algorithm is similar to that of [120]. However, for our algorithm, the foreground and background are automatically computed, as opposed to being supplied by the user in [120]. The foreground is defined as the entire region covered by the projected 3D points in a group. The background consists of the projections of all other points not in the group currently being considered. As was done in [120], we oversegment each image using the watershed algorithm in order to reduce the complexity of processing. Any reference to the image is actually a pointer to a color segment rather than to a pixel.
8.3.3 Model-based leaf reconstruction Since leaves in the same plant are typically very similar, we adopt the strategy of extracting a generic leaf model from a sample leaf and using it to fit all the other leaves. This strategy turns out to be more robust as it reduces uncertainty due to noise and occlusion by constraining the shapes of leaves.
(a)
(b)
Fig. 8.11 Leaf reconstruction for poinsettia (Figure 8.18). (a) Reconstructed flat leaves using 3D points. The generic leaf model is shown at top right. (b) Leaves after deforming using image boundary and closest 3D points.
Extraction of a generic leaf model To extract a generic leaf model, the user manually chooses an example leaf from its most fronto-parallel view, as shown in Figure 8.11. The texture and boundary
166
8 Tree modeling
associated with the leaf are taken to be the flat model of the leaf. The leaf model consists of three polylines: two for the leaf boundary and one for the central vein. Each polyline is represented by about 10 vertices. The leaf model is expressed in a local Euclidean coordinate frame with the x−axis being the major axis. The region inside the boundary is triangulated; the model is automatically subdivided to increase the accuracy of the model, depending on the density of points in the group.
Leaf reconstruction Leaf reconstruction consists of four steps: generic flat leaf fit, 3D boundary warping, shape deformation, followed by texture assignment. Flat leaf fit. We start by fitting the generic flat leaf model to the group of 3D points. This is done by computing the principal components of the data points of the group via SVD decomposition. A flat leaf is reconstructed at the local coordinate frame determined by the first two components of the 3D points. Then, the flat leaf is scaled in two directions by mapping it to the model. The recovered flat leaves are shown in Figure 8.11(a). There is, however, an orientation ambiguity for the flat leaf. Note that this orientation ambiguity does not affect the whole geometry reconstruction procedure described in this section, so that the ambiguity is not critical at this point. A leaf is usually facing up and away from a branch. We use this heuristic to find the orientation of the leaf, i.e., facing up and away from the vertical axis going through the centroid of the whole plant. For a complicated branching structure, the leaf orientations may be incorrect. Once the branches have been reconstructed (Section 8.2), the system will automatically recompute the leaf orientation using information from the branching structure. Leaf boundary warping. While the 3D points of a leaf are adequate for locating its pose and approximate shape, they do not completely specify the leaf boundary. This is where boundary information obtained from the images is used. Each group of 3D points are associated with 2D image segmentations at different views (if such segmentation exists). A leaf boundary will not be refined if there is no corresponding image segmentation for the leaf. On the other hand, if multiple regions (across multiple views) exist for a leaf, the largest region is used. Once the 3D leaf plane has been recovered, we back-project the image boundary onto the leaf plane. We look for the best affine transformation that maps the flat leaf to the image boundary by minimizing the distances between the two sets of points of the boundaries on the leaf plane in space. We adapted the ICP (Iterative ClosestPoint) algorithm [9] to compute the registration of two curve boundaries. We first compute a global affine transformation using the first two components of the SVD decomposition as the initial transformation. Correspondences are established by assigning each leaf boundary point to the closest image boundary; these correspondences are used to re-estimate the affine transformation. The procedure stops when the mean distance between the two sets of points falls below a threshold.
8.4 Results and discussions
167
Shape deformation. The final shape of the leaf is obtained by locally deforming the flat leaf in directions perpendicular to the plane to fit the nearest 3D points. This adds shape detail to the leaf (Figure 8.11(b)). Texture reconstruction. The texture of each leaf first inherits that of the generic model. The texture from the image segmentation is subsequently used to overwrite the default texture. This is done to ensure that occluded or slightly misaligned parts are textured.
8.4 Results and discussions For the rendering of trees, as usually the trees are not created with such a detailed geometry like ours, the standard softwares like Maya do not handle the specific inter-reflection and subsurface scattering of leaves. The discrepancy between the rendered image and the original image is mainly due to the rendering. We first show reconstruction results for a variety of trees with relatively small leaves. The recovered models have leaves numbering from about 3,000 to 140,000, which are all generic leaves from segmentation and clustering. • The tree in Figure 8.12 is large with relatively tiny leaves. It was captured with 16 images covering about 120◦ . We spent five minutes editing the branches after automatic reconstruction to clean up the appearance of the tree. Since the branches are extracted by connecting nearby points, branches that are close to each other may be merged. The rightmost visible branch in Figure 8.12 is an example of a merged branch. • Figure 8.15 shows a medium-sized tree, which was captured with 16 images covering about 135◦ . The branches took 10 minutes to modify, and the leaf segmentation was fully automatic. The rightmost image in Figure 8.15 shows a view not covered by the source images; here, synthesized leaves are shown as well. • The potted flower tree shown in Figure 8.13 is an interesting example: the leaf size relative to the entire tree is moderate and not small as in the other examples. Here, 32 source images were taken along a complete 360◦ path around the tree. Its leaves were discernable enough that our automatic leaf generation technique produced only moderately realistic leaves, since larger leaves require more accurate segmentation. The other challenge is the very dense foliage—dense to the extent that only the trunk is clearly visible. In this case, the user supplied only three simple replication blocks shown in Figure 8.6(d); our system then automatically produced a very plausible-looking model. About 60% of the reconstructed leaves relied on the recovered branches for placement. Based on leaf/tree size ratio, this example falls in the middle of the plant/tree spectrum shown in Figure 8.4. • The fig tree shown in Figure 8.18 was captured using 18 images covering about 180◦ . It is a typical but challenging example as there are substantial missing points in the crown. Nevertheless, its shape has been recovered reasonably well,
168
8 Tree modeling
Fig. 8.12 Large tree example. Two of the source images on the left column. Reconstructed model rendered at the same viewpoint on the right column.
with a plausible-looking branch structure. The process was automatic, with the exceptions of manual addition of a branch and a few adjustments to the thickness of branches. • Figure 8.15 shows a medium-sized tree, which was captured with 16 images covering about 135◦ . The branches took 10 minutes to modify, and the leaf segmentation was fully automatic. The rightmost image in Figure 8.15 shows a view not covered by the source images; here, synthesized leaves are shown as well. We then show the results on a variety of plants with different shapes and densities of foliage. We show results for four different plants: nephthytis, poinsetta, schefflera, and an indoor tree. Because of variation in leaf shape, size, and density, the level of difficulty for each example is different. We typically capture about 35 images, more for plants with smaller leaves (specifically the indoor tree). The captured image resolution is 1944 × 2592 (except for the poinsettia, which is 1200 × 1600). For the efficiency of structure from motion, we down-sampled the images to 583 × 777 (for the poinsettia, to 600×800). It took approximately 10 mins for about 40 images on a
8.4 Results and discussions
169
(a)
(b)
(c)
(d)
Fig. 8.13 Potted flower tree example. (a) One of the source images, (b) reconstructed branches, (c) complete tree model, and (d) model seen from a different viewpoint.
1.9GHz P4 PC with 1 GB of RAM. On average, we reconstructed about 30,000 3D points for the foreground plant. The statistics associated with the reconstructions are summarized in Table 8.4. About 80 percent of the leaves were automatically recovered by our system. Figure 8.16 shows a simple example of inserting the reconstructed model into a synthetic scene. We also show examples of plant editing: texture replacement (Figure 8.18), and branch and leaf cut-and-paste (Figure 8.17). • The nephthytis plant has large broad leaves, which makes modeling easy. The 3D points are accurate, and leaf extraction and recovery are fairly straighforward. Only the extraction of 6 leaves are assisted by user, and the reconstruction is fully automatic. The 3D points were sufficient in characterizing the spatial detail of the leaf shapes, as shown in Figure 8.1. • The poinsettia and schefflera plants have medium sized leaves; as a result, the recovered 3D points on leaves were still of high quality. Occlusion is difficult
170
8 Tree modeling
Fig. 8.14 Image-based modeling of a tree. From left to right: A source image (out of 18 images), reconstructed branch structure rendered at the same viewpoint, tree model rendered at the same viewpoint, and tree model rendered at a different viewpoint.
when the foliage is dense, and the foliage for the poinsettia (see Figure 8.18) and schefflera is denser than that of the nephthytis. Leaf segmentation of the schefflera is complicated by the overlapping leaves and the small leaves at the tip of branches. We recovered about two-thirds of all possible leaves, and synthesized some of them (on the top most branch) using the branching structure, as shown in Figure 8.17. • The indoor tree, with its small leaves, was the most difficult to model. First, the 3D points are less accurate because they were typically recovered from 2D points on occluding boundaries of leaves (which are not reliable). In addition, much more user interaction was required to recover the branching structure due to its
8.4 Results and discussions
171
(b)
(a)
(c)
Fig. 8.15 Medium-sized tree example. From left to right: (a) One of the source images. (b) Reconstructed tree model. (c) Model rendered at a different viewpoint.
(a)
(b)
Fig. 8.16 An indoor tree. (a) an input image (out of 45). (b) the recovered model, inserted into a simple synthetic scene.
high complexity. The segmentation was fully automatic using a smaller k = 2. And for each group containing more than 3 points, the same automatic leaf fitting procedure like all other examples except the generic model is much simpler, is used. But the geometric accuracy of the orientation of the recovered leaves are noticeably less reliable than that of the large leaves in the other examples. If the group contains fewer than 3 points, it is no longer possible to compute the pose of the leaf. In this case, the pose of each leaf was heuristically determined using geometry information of its nearest branch.
172
8 Tree modeling
# Images # 3D pts # FG pts # Leaves (α,k) # AL # UAL # ASL All leaves BET (min)
nephthytis 35 128,000 53,000 30 (0,3) 23 6 0 29 5
poinsettia 35 103,000 83,000 ≈120 (0.5,3) 85 21 10 116 2
schefflera 40 118,000 43,000 ≈450 (0.5,3) 287 69 18 374 15
indoor tree 45 156,000 31,000 ≈1500 (0.3,2) 509 35 492 1036 40
Table 8.1 Reconstruction statistics. The foreground points are automatically computed as the largest connected component in front of the cameras; they include both the plant, pot, and sometimes part of the floor. The segmentation parameters α and k are defined in Section 8.3.2. Note: FG = foreground, AL = automatic leaves, UAL = user assisted leaves, ASL = additional synthetic leaves, BET = branch edit time.
Fig. 8.17 Schefflera plant. Top left: an input image (out of 40 images). Top right: recovered model with synthesized leaves rendered at the same viewpoint as the top left. Bottom right: recovered model from images only. The white rectangles show where the synthesized leaves were added in the top right. Bottom left: recovered model after some geometry editing.
8.4 Results and discussions
173
(a)
(b)
(c)
(d)
Fig. 8.18 Image-based modeling of poinsettia plant. (a) An input image out of 35 images, (b) recovered model rendered at the same viewpoint as (a), (c) recovered model rendered at a different viewpoint, (d) recovered model with modified leaf textures.
Discussions We have described a system for constructing realistic-looking tree models from images. Our system was designed with the goal of minimizing user interaction in mind. To this end, we devised automatic techniques for recovering visible branches, generating plausible-looking branches that have been originally occluded, and populating the tree with leaves. Our technique for reconstructing occluded branches is akin to image-based hair growing [243]. However, there are major differences. For example, branches have a tree structure while hair is basically a set of independent lines. In addition, plausiblelooking branches can be created to fill occluded spaces using the self-similarity assumption; there is no such analog for hair. There are certainly ways for improving our system. For one, the replication block need not necessarily be restricted to being part of the visible branches in the same tree. It is possible to generate a much larger and tree-specific database of replication blocks. The observed set of replica-
174
8 Tree modeling
tion blocks can be used to fetch the appropriate database for branch generation, thus providing a richer set of branch shapes. The key idea to our plant modeling is to combine both available reconstructed 3D points and the images to more effectively segment the data into individual leaves. To increase robustness, we use a generic leaf model (extracted from the same image dataset) to fit all the other leaves. We also developed a user-friendly branch structure editor that is also guided by 3D and 2D information. The results demonstrate the effectiveness of our system. We designed our system to be easy to use; specialized knowledge about plant structure, while helpful, is not required. There are several straightforward improvements on our current implementation. For instance, the graph-based segmentation algorithm could be made more efficient by incorporating more priors based on real examples. Our current leaf reconstruction involves shape interpolation using the pre-computed 3D points; better estimates may be obtained by referring to the original images during this process. Also, we use only one 2D boundary to refine the shape of the 3D leaf model. It may be more robust to incorporate the boundaries from multiple views instead. However, occlusions are still a problem, and accounting for multiple view substantially complicates the optimization. A more complex model for handling complex-looking flowers could be built as suggested in [90]. Finally, for enhanced realism, one can use specialized algorithms for rendering specific parts of the plant, e.g., leaf rendering [240]. Our system currently requires that the images be pre-segmented into tree branches, tree leaves, and background. Although there are many automatic techniques for segmentation, the degree of automation for reliable segmentation is highly dependent on the complexity of the tree and background. We currently do not see an adequate solution to this issue.
8.5 Bibliographic notes This chapter is adapted from [184] by Quan, Tan, Zeng, Yuan, Wang and Kang, and [222] by Tan, Zeng, Wang, Kang and Quan. Many approaches have been proposed to model plants and trees, we roughly classify the literature as either rule-based or image-based methods.
Rule-based methods Rule-based methods use compact rules or grammar for creating models of plants and trees. As a prime example, Prusinkiewicz et al. [174] developed a series of approaches based on the idea of the generative L-system. Weber and Penn [242] use a series of geometric rules to produce realistic-looking trees. De Reffye et al. [31] also use a collection of rules, but the rules are motivated by models of plant growth. There are also a number of techniques that take into account various kinds of tree interaction with the environment (e.g., [141, 175, 238, 157, 158]).
8.5 Bibliographic notes
175
While these methods are capable of synthesizing impressive-looking plants, trees, and forests, they are based on rules and parameters that are typically difficult to use for a non-expert. Plants have specific shapes that are formed naturally (inherent growth pattern), caused by external biological factors (such as disease), a result of human activity (such as localized pruning), or shaped by other external factors (such as fire, flood, or nearby structures). Generating a model that very closely resembles an actual plant under a variety of real-world conditions would not be easy using such type of approach.
Image-based methods Image-based methods directly model the plant using image samples (our proposed technique is one such method). Han et al. [72] described a Bayesian approach to modeling tree-like objects from a single image using good priors. The prior models are ad hoc, and the type of recovered 3D model is rather limited. The approaches described in [189, 202] mainly use the visual hull of the tree computed from silhouettes to represent a rough shape of the tree. The tree volume is then used to create the tree branching structure for synthesizing leaves. Shlyakhter et al.’s system [202] starts with an approximation of medial axis of the estimated tree volume, and ends with a simple L-system fit. Sakaguchi et al. [189, 190] use simple branching rules in voxel space instead of L-system for building the branching structure. All these methods generate only approximate shapes with limited realism. More recently, Reche et al. [187] proposed a technique for computing a volumetric representation of the tree with opacity. While their results look impressive, their approach does not recover explicit geometries of the branches and leaves. As a result, their technique is limited to visualization only, with no direct means for animation or editing.
Data-driven methods Xu et al. [252] used a laser scanner to acquire the range data for modeling tree. Part of our work—the generation of initial visible branches—is inspired by their work. The major difference is that they use a 3D point cloud for modeling; no registered source images are used. It is not easy to generate complete tree models from just 3D points because of the difficulties in determining what is missing and in filling the missing information. Our experience has led us to believe that adapting models to images is a more intuitive means for realistic modeling. The image-based approach is also more flexible for modeling a wide variety of trees at different scales.
Chapter 9
Fac¸ade modeling
This chapter proposes a semi-automatic image-based approach that uses images captured along the streets, and relies on structure from motion to automatically recover the camera positions and point clouds as the initial stage for the modeling. We start a building fac¸ade as a flat rectangular plane or a developable surface, and the texture image of the flat fac¸ade is composited from the multiple visible images with handling of occluding objects. A fac¸ade is then decomposed and structured into a Directed Acyclic Graph of rectilinear elementary patches. The decomposition is carried out top-down by a recursive subdivision, and followed by a bottom-up merging with the detection of the architectural bilateral symmetry and repetitive patterns. Each subdivided patch of the flat fac¸ade is augmented with the depth that is optimized from the 3D points. Our system also allows the user to easily provide feedbacks in the 2D image space for the proposed decomposition and augmentation. Finally, our approach is demonstrated on a large number of fac¸ades from a variety of street-side images.
Fig. 9.1 A few fac¸ade modeling examples: some input images in the bottom, the recovered model rendered in the middle row and on the top left, and two zoomed sections of the recovered model rendered in the top middle and on the top right.
177
178
9 Fac¸ade modeling
9.1 Introduction There is a tremendous demand for the photo-realistic modeling of the cities for games, movies and map services such as in Google Earth and Microsoft Virtual Earth. However, most of the efforts have been spent on the large-scale aerial photography based city modeling. When we zoomed to ground level, the viewing experience is often disappointing, with blurry models of little details. On the other hand, many potential applications require street-level representation of the cities, where most of our daily activities take places. By the spatial constraints, the coverage of ground level images is close-range, therefore more data needs to be captured and processed. This makes street-side modeling much more technically challenging. The current state of the art ranges from pure synthetic methods such as artificial synthesis of buildings based on grammar rules [150], 3D scanning of street fac¸ades [60], to image-based approach [32]. M¨uller et al. [151] needs manually assignment of depths to the fac¸ade as they have only one image. However, we do have the information from the reconstructed 3D points to automatically infer the critical depth of each primitive. Fr¨uh and Zakhor [60] needs tedious 3D scanning, while Debevec et al. [32] dedicates to small set of images and cannot scaled up well for large scale modeling of buildings. We propose in this paper a semi-automatic method to reconstruct 3D fac¸ade models of high visual quality from multiple ground-level street-view images. The key innovation of our approach is the introduction of a systematic and automatic decomposition schema of the fac¸ade for both analysis and reconstruction. The decomposition is achieved through a recursive subdivision that preserves the architectural structure to obtain a Directed Acyclic Graph representation of the fac¸de by both top-down subdivision and bottom-up merging with local bilateral symmetries and repetitive patterns handling. This representation naturally encodes the architectural shape prior of a fac¸ade and enables the depth of the fac¸ade to be optimally computed on the surface and at the level of the subdivided regions. We also introduced a simple and intuitive UI that assists the user to provide the feedbacks on the fac¸ade partition.
Overview Our approach is schematized in Figure 10.2. SFM. From the captured sequence of overlapping images, we first automatically compute the structure from motion to obtain a set of semi-dense 3D points and all camera positions. We then register the reconstruction with an existing approximate model of the buildings (often recovered from the areal images) using the GPS data if provided or manually if the geo-registration information is not possible. Fac¸ade initialization. We start a building fac¸ade as a flat rectangular plane or a developable surface which is obtained either automatically from the geo-registered approximate building model or manually mark up a line segment or a curve on the projected 3D points onto the ground plane. The texture image of the flat fac¸ade is
9.1 Introduction
179
Fig. 9.2 Overview of the semi-automatic approach to image-based fac¸ade modeling.
computed from the multiple visible images of the fac¸ade. The detection of occluding objects in the texture composition is possible thanks to the multiple images with parallaxes. Fac¸ade decomposition. A fac¸ade is then systematically decomposed into a partition of rectangular patches based on the horizontal and vertical lines detected in the texture image. The decomposition is carried out top-down by a recursive subdivision, and followed by a bottom-up merging, with the detection of the architectural bilateral symmetry and repetitive patterns. The partition is finally structured into a Directed Acyclic Graph of rectilinear elementary patches. We also allow the user to easily edit the partition by simply adding and removing horizontal and vertical lines. Fac¸ade augmentation. Each subdivided patch of the flat fac¸ade is augmented with the depth obtained from the MAP estimation of Markov Random Field with data cost defined on the 3D points from the structure from motion. Fac¸ade completion. The final fac¸ade geometry is automatically re-textured from all input images. Our main technical contribution is the introduction of a systematic decomposition schema of the fac¸ade that is structured into a Direct Acyclic Graph and implemented as a top-down recursive subdivision and bottom-up merging. This representation strongly embeds the architectural prior of the fac¸ades and buildings into the different stages of modeling. The proposed optimization for fac¸ade depths is also unique in that it operates in the fac¸ade surface and in the super-pixel level of a whole subdivision region.
180
9 Fac¸ade modeling
Image Collection We use a camera that usually faces to the building fac¸ade and moves laterally along streets. The camera should be preferably held straight and the neighboring two views have sufficient overlapping to make the feature correspondences computable. Depending on the distance between the camera and the objects, and the distance between the neighboring viewing positions, the density and the accuracy of the reconstructed points vary. We focus on the building fac¸ades to adjust the distance between the viewing positions and camera parameters. We first compute point correspondences and structure from motion for a given sequence of images. There are standard computer vision techniques for structure from motion [47, 80]. We used the approach described in [119] to compute the camera poses and a semi-dense set of 3D point clouds in space. This technique is used because it has been shown to be robust and capable of providing sufficient point clouds for object modeling purposes. The results of structure from motion for each sequence of images are shown in the companion video.
9.2 Fac¸ade initialization In this paper, we consider that a fac¸ade has a dominant planar structure. Therefore, a fac¸ade is a flat plane plus a depth field on the plane. We also expect and assume that the depth variation within a simple fac¸ade is moderate. A real building fac¸ade having complex geometry and topology could therefore be broken down into multiple simple fac¸ades. A building is merely a collection of fac¸ades, and a street is a collection of buildings. The dominant plane of the majority of the fac¸ades is flat, but it can be curved sometimes as well. We also consider the dominant surface structure to be any cylinder portion or any developable surface that can be swept by a straight line as illustrated in Figure 9.3. To ease the description, but without loss of generality, we will use a flat fac¸ade in the remainder of the paper. The cylindrical fac¸ade examples are given in the experiments.
Fig. 9.3 A simple fac¸ade can be initialized from a flat rectangle (a), a cylindrical portion (b) or a developable surface (c).
9.2 Fac¸ade initialization
181
9.2.1 Initial flat rectangle The reference system of the 3D reconstruction can be geo-registered using the GPS data of the camera if available or using an interactive technique. Illustrated in Figure 10.2, the fac¸ade modeling process can begin with an existing approximate model of the buildings often reconstructed from the areal images, such as served publicly by Google Earth and Microsoft Virtual Earth. Alternatively, if no such approximate model exists, a simple manual process in the current implementation is used to segment the fac¸ades, based on the projections of the 3D points to the ground floor. We draw a line segment or a curve on the ground to mark up a fac¸ade plane as a flat rectangle or a developable surface portion. The plane or surface position is automatically fitted to the 3D points or manually adjusted if necessary. Algorithm 19 (Photo-consistency check for occlusion detection) Given a set of N image patches P = {p1 , p2 , . . . , pN } and η ∈ [0, 1] indicating the similarity of two patches, compute the index set of visible projections V and the index set of occluded projections O. for all pi ∈ P do si ← 0 (accumulated similarity for pi ) for all pj ∈ P do sij ← NCC(pi ,pj ) if sij > η then si ← si + sij n ← arg maxi si ( n is the patch with the best support) V ← ∅ O ← ∅ for all pi ∈ P if sin > η then V ← V ∪ {i} else O ← O ∪ {i} return V and O
9.2.2 Texture composition The geometry of the fac¸ade is initialized as a flat rectangle. Usually a fac¸ade is too big to be entirely observable in one input image. We first compose a texture image for the entire rectangle of the fac¸ade from the input images. This process is different from image mosaic, as the images have parallax, which is helpful for removing the undesired occluding objects such as pedestrians, cars, trees, telegraph poles and rubbish bins, in front of the target fac¸ade. Furthermore, the fac¸ade plane position is known, compared with an unknown spatial position in stereo algorithms. Hence, the photo consistency constraint is more efficient and robust for occluding object removal, with a better result texture image than a pure mosaic.
182
9 Fac¸ade modeling
Multi-view occlusion removal As many multiple view stereo methods, the photo consistency is defined as followed. Consider a 3D point X = (x, y, z, 1) with color c. If it has a projection xi = (ui , vi , 1) = Pi X in the i-th camera Pi , under Lambertian surface assumption, the projection xi should also have the same color c. However, if the point is occluded by some other objects in this camera, the color of the projection is usually not the same as c. Note that c is unknown and what we want here. Assuming the point X is visible from multiple cameras I = {Pi } and occluded by some objects in the other cameras I = {Pj }, then the color ci of the projections in I should be the same as c, while it may be different from the color cj of projections in I . Now, given a set of projection color {ck }, the task is to identify a set O of the occluded cameras. In most of the situation, we can assume that the point X is visible from most of the cameras. Under this assumption, we have c ≈ mediank {ck }. Given the estimated color of the 3D point c, it is now very easy to identify the occluded set O according to their distances with c. To improve the robustness, instead of a single color, the image patches centered at the projections are used, and patch similarity, normalized cross correlation (NCC), is used as a metric. The detail is presented in Algorithm 19. In this way, with the assumption that the fac¸ade is almost plannar, each pixel of the reference texture corresponds to a point lies on the flat fac¸ade. Hence, for each pixel, we can identified whether it is occluded in a particular camera. Now, for a given planar fac¸ade in space, all visible images are first sorted according to the fronto-parallelism of the images with respect to the given fac¸ade. An image is to be more fronto-parallel if the projected surface of the fac¸ade in the image is larger. The reference image is first warped from the most fronto-parallel image, then from the lesser ones according to the visibility of the point.
Inpainting In each step, because the occluding region by other objects is not pasted on the reference image, some regions of the reference texture image are still left empty. In a later step, if an empty region is not occluded and visible from the new camera, the region is filled. In this way of a multi-view inpainting, the occluded region is filled from each single camera. At the end of the process, if some regions are still empty, a normal image inpainting technique are used to fill it either automatically by [28] or interactively with guide by the users in Section 9.2.3. Since we have adjusted the cameras according to the image correspondence, this simple mosaic without explicit blending can already produce very visual pleasing results.
9.2 Fac¸ade initialization
(a) Indicate
183
(b) Remove
(c) Inpaint
(d) Guide
(e) Result
Fig. 9.4 Interactive texture refinement: (a) drawn strokes on the object to indicate removal. (b) the object is removed. (c) automatically inpainting. (d) some green lines drawn to guide the structure. (e) better result achieved with the guide lines.
9.2.3 Interactive refinement As shown in Figure 9.4, if the automatic texture composition result is not satisfactory, a two-step interactive user interface is provided for refinement. In the first step, the user can draw strokes to indicate which object or part of the texture is undesirable as in Figure 9.4(a). The corresponding region is automatically extracted based on the input strokes as in Figure 9.4(b) using the method in [120]. The removal operation can be interpreted as that the most fronto-parallel and photoconsisted texture selection, from the result of Algorithm ??, is not what the user wants. For each pixel, n from Line ?? of Algorithm ?? and V should be wrong. Hence, P is updated to exclude V: P ← O. Then, if P = ∅, Algorithm ?? is run again. Otherwise, image inpainting [28] is used for automatically inpainting as in Figure 9.4(c). In the second step, if the automatic texture filling is poor, the user can manually specifies important missing structure information by extending a few curves or line segments from the known to the unknown regions as in Figure 9.4(d). Then as in [217], image patches are synthesized along these user-specified curves in the unknown region using patches selected around the curves in the known region by Loopy Belief Propagation to find the optimal patches. After completing structure propagation, the remaining unknown regions are filled using patch-based texture synthesis as in Figure 9.4(e). A
B
C
D
M
E F
(a) Input
(b) Structure
(c) Weight
H
G
(d) Subdivide
(e) Merge
Fig. 9.5 Structure preserving subdivision. The hidden structure of the fac¸ade is extracted out to form a grid in (b). Such hypothesis are evaluated according to the edge support in (c), and the fac¸ade is recursive subdivided into several regions in (d). Since there are no enough support between Region A, B, C, D, E, F, G, H , they are all merged into one single region M in (e).
184
9 Fac¸ade modeling
9.3 Fac¸ade decomposition Decomposing a fac¸ade is also analyzing the structure in the hope of reconstructing it with a smaller number of elements. The fac¸ades that we are considering inherit the natural horizontal and vertical directions by construction. In the first approximation, we may take all visible horizontal and vertical lines to construct an irregular partition of the fac¸ade plane into rectangles of various sizes. This partition captures the global rectilinear structure of the fac¸ades and buildings, also keeps all discontinuities of the fac¸ade substructures. This usually gives an over-segmentation of the image into patches. But this over-segmentation has several advantages. The over-segmenting lines can also be regarded as auxiliary lines that regularize the compositional units of the fac¸ades and buildings. Some ’hidden’ rectilinear structure of the fac¸ade during the construction can also be re-discovered by this over-segmentation process.
9.3.1 Hidden structure discovery To discover the structure inside the fac¸ade, the edge of the reference texture image is first detected by [17]. With such edge maps, Hough transform [?] is used to recover the lines. To improve robustness, the direction of Hough transform is constrained to only horizontal and vertical, which happens in most of architecture fac¸ades. The detected lines are now formed a grid to partition the whole reference image, and this grid contains many non-overlapped small line segments by taking intersection of hough lines as endpoints as in Figure 9.5(b). These small line segments are now the hypothesis to partition the fac¸ade. The hough transformation is good for structure discovery since it can extract the hidden global information from the fac¸ade and align small line segments to this hidden structure. However, some small line segments in the formed grid may not really be a partition boundary between different region. Hence, a weight we is defined on each small line segment e to indicate the likelihood that this small line segment is a boundary of two different region as shown in Figure 9.5(c). This weight is computed as the number of edge point from Canny edge map that the line segment covered.
Remark on over-segmented partition It is true that the current partition schema is subject to segmentation parameters. But it is important to note that usually a slightly over-segmented partition is not harmful for the purpose of modeling. A perfect partition certainly eases the regularization of the fac¸ade augmentation by depth presented in the next section, nevertheless, an imperfect, particularly a slight over-segmented partition, does not affect the modeling results particularly when the 3D points are dense and the optimization works well.
9.3 Fac¸ade decomposition
185
9.3.2 Recursive subdivision Given a region D in the texture image, it is divided into two sub rectangle regions D1 and D2 , such that D = D1 ∪ D2 , by a line segment L with strongest support from the edge points. After D is subdivided into two separate regions, the subdivision procedures continue on the two regions D1 and D2 recursively. The recursive subdivision procedure is stopped if either the target region D is too small to be subdivided, or their is no strong enough hypothesis, i.e. the region D is very smooth. For a fac¸ade, the bilateral symmetry about a vertical axis may not exist for the whole facc¸de, but it exists locally and can be used for more robust subdivision. First, for each region D, the NCC score sD of the two halves D1 and D2 vertically divided at the center of D is computed. If sD > η, region D is considered to have the bilateral symmetry. Then, the edge map of D1 and D2 are averaged, and subdivision is recursively done on D1 only. Finally, the subdivision in D1 is reflected across the axis to become the subdivision of D2 , and merged the two subdivisions into the subdivision of D.
A
A e1 e2 e3 e4
(a) Edge weight support
B
→ − − → B A∩ B
axis
B
(b) Region statistics support
Fig. 9.6 Merging support evaluation.
Recursive subdivision is good to preserve boundaries for man-made structure styles. However, it may produce some unnecessary fragments for depth computation and rendering as in Figure 9.5(d). Hence, as a post-processing, if two leaf subdivision regions A and B are neighbors, and there is not enough support sAB to separate them, they are merged into one region. The support sAB to separate two neighbor regions A and B is defined to be the strongest weights of all the small line segments on the border between A and B: sAB = maxe {we }. However, the weights of small line segments can only offer a local image statistic on the border. To improve the robustness, as a dual information region statistic between A and B can be used more globally. As in Figure 9.6, Since regions A and B may not have the same size, this region statistic similarity is defined as the following: First, an axis is defined on the − → border between A and B, and region B is reflected on this axis to have a region B . − → − → The overlapped region A ∩ B between A and B is defined to be the pixels from A − → ← − with location inside B . In a similar way, A ∩ B contains the pixels from B with − −−−→ ← − ← − location inside A , and then it is reflected to become A ∩ B according the the same − −−−→ − → ← − axis. The normalized cross correlation (NCC) between A ∩ B and A ∩ B is used to defined the region similarity of A and B. In this way, only the symmetric part of A
186
9 Fac¸ade modeling
and B is used for region comparison. Therefore, the affect of other far away part of the region is avoided, which will happen if the size of A and B is dramatically different and global statistic such as color histogram is used. Weighted by a parameter κ, the support sAB to separate two neighbor regions A and B is now defined as −−−→ ← − − → − sAB = max {we } − κNCC(A ∩ B , A ∩ B). e
Note that the representation of the fac¸ade is a binary recursive tree before merging, and a Directed Acyclic Graph (DAG) after region merging.The DAG representation can innately support Level of Detail displaying technique. When great details are demanded, the rendering engine can just go down the rendering graph to expand all detail leaves and render them correspondingly. Vice verse, the intermediate node is rendered and all its descendants are pruned at rendering time.
9.3.3 Repetitive pattern representation The repetitive patterns of a fac¸ade does locally exist in many fac¸ades and most of them are windows. The method in [151] uses a quite complicated technique for synchronization of subdivision between different windows. To save storage space and easy the synchronization task, in our method, only one subdivision representation for the same types of windows are maintained. Precisely, window template is first detected by trained model [7] or manually indicated on the texture images. The templates are matched across the reference texture image using NCC as measurement. If good matches exist, they are aligned to the horizontal or vertical direction by a hierarchical clustering, and the canny edge map on these regions are averaged. One example is illustrated in Figure 10.12. During the subdivision, each matched region is isolated by shrinking a bounding rectangle on the average edge maps until snapping to strong edges, and is regarded as a whole leaf region. The edges inside these isolated regions should not affect the global structure, and hence these edge points are not used during the global subdivision procedure. Then, as in Figure 9.7, all the matched leaf regions are linked to the root of a common subdivision DAG for that type of window, by introducing 2D translation nodes for the pivot position. Recursive subdivision is again executed on the average edge maps of all matched regions. To preserve photo realism, the textures in these regions are not shared and only the subdivision DAG and their respective depths are shared. Furthermore, to improve the robustness of subdivision, vertical bilateral symmetric is taken as hard constraint for windows.
9.3 Fac¸ade decomposition
187
(x1, y1 )
(x1 , y1 ) (x2 , y2 )
(x2, y2 )
(x3 , y3)
(x4 , y4)
(x3 , y3 ) (x4 , y4 )
...... (a) Fac¸ade
(b) DAG
Fig. 9.7 A DAG for repetitive pattern representation.
9.3.4 Interactive subdivision refinement In most situations, the automatic subdivision works satisfactorily. If the user wants to further refine the subdivision layout, three line operations and two region operations are provided. The current automatic subdivision operates on the horizontal and vertical directions for robustness and simplicity. The fifth ’carve’ operator allows the user to manually sketch arbitrarily shaped objects, appeared less frequently, to be included in the fac¸ade representation. • Add. In an existing region, the user can sketch a stroke to indicate the partition as in Figure 9.8(a). The edge points near the stroke are forced to become salient, and hence the subdivision engine can figure the line segment out and partition the region. • Delete. The user can sketch a ”Z” shape stroke to cross out a line segment as in Figure 9.8(b). • Change. The user can first delete the partition line segments and then add a new line segment. Alternatively, the user can directly sketch a stroke. Then, the line segment across by the stroke will be deleted and a new line segment will be constructed accordingly as in Figure 9.8(c). After the operation, all descendants with the target region as root node in the DAG representation will be triggered to be re-computed. • Group. The user can draw a stroke to cover several regions, in order to merge them into a single group as in Figure 9.8(d). • Carve. The user can draw line segments or NURBS curve to carve and split the existing subdivision layout as in Figure 9.8(e). In this way, any shape can be extruded and hence be supported. Example results before and after editing are given in Figure 9.10.
188
9 Fac¸ade modeling
(a) Add
(b) Delete
(d) Group
(c) Change
(e) Carve
Fig. 9.8 Five operations for subdivision refinement: the left figure is the original subdivision layout shown in red and the user sketched stroke shown in green, while the right figure is the result subdivision layout.
9.4 Fac¸ade augmentation At the previous stage, we obtained a subdivision of the fac¸ade plane. In this section, each subdivision region will be assigned a depth. If the 3D points are not dense, there might be subdivision regions that could not be assigned a valid depth. These depth can be obtained from the MAP estimation of Markov Random Field. In traditional stereo methods [193], a reference image is selected and a disparity or depth value is assigned to each of its pixels. The problem is often formulated as minimization of Markov Random Field (MRF) [67] energy functions providing a clean and computationally tractable formulation. However, a key limitation of these solutions is that they can only represent depth maps with a unique disparity per pixel, i.e. depth is a function of image point. Capturing complete objects in this manner requires further processing to merge multiple depth maps, a complicated and errorprone procedure. A second limitation is that the smoothness term imposed by the MRF is viewpoint dependent, in that if a different view was chosen as the reference image the results could be different. Now with our representation of the fac¸ade with the subdivision regions, we could extend the MRF techniques by recovering a surface for each fac¸ade, or the depth map on the flat fac¸ade plane instead of a depth map on an image plane. Here we could lay a MRF over the fac¸ade surface and define an image and viewpoint independent smoothness constraint.
9.4.1 Depth optimization Suppose the graph G = V, E , where V = {sk } is the set of all sites and E is the set of all arcs connecting adjacent nodes. The labeling problem is to assign a unique label hk for each node sk ∈ V. The solution H = {hk } can be obtained by minimizing a Gibbs energy [67]:
9.4 Fac¸ade augmentation
189
E (H) =
sk ∈V
Ekd (hk ) + λ
s E(k,l) (hk , hl )
(9.1)
(sk ,sl )∈E
where Ekd (hk ) is the data cost (likelihood energy), encoding the cost when the s label of site sk is hk , and E(k,l) (hk , hl ) is the prior energy, denoting the smoothness cost when the labels of adjacent sites sk and sl are hk and hl respectively.
n
H1
(a) Layout 1
(b) Layout 2
HL
n
(c) Data cost
Fig. 9.9 Markov Random Field on the fac¸ade surface. M
Assume we have a set of M sample points {sk }k=1 , and denote the normal direction of the fac¸ade of the initial base surface to be n. The sites of the MRF correspond to height values h1 , . . ., hM measured from the sample points s1 , . . . , sM along the normal n. The labels H1k , . . . , HLk are a set of possible height values that variables hk can take. If the k-th site is assigned label hk then the relief surface passes through 3D point sk + hk n. Different from [239] where the base surface sample points are uniformly and densely defined, our graph layout is based on the subdivision region, and thus much more efficient for approximation optimization and with less chance to fall into local optimal. There are two possible choices to define the MRF graph on the subdivision as shown in Figures 9.9(a)9.9(b). The Layout 1 is good for understanding since it is a regular grid. However, there may be several sites representing the same subdivision region, and this complicates the definition and brings unnecessary scaling-up for optimization. Hence we prefer to represent each subdivision region by a single site centered at each region as in Layout 2. However, different subdivision regions may have very different area. This situation is not address in the original MRF definition. Hence, the data cost vector cdk of each site sk is weighted by the area ak for the k-th subdivision region, i.e. Ekd = ak cdk , and the smoothness cost matrix cs(k,l) is weighted by the length l(k,l) of the border edge between the k-th subdivision region and the l-th subdivision region, i.e. s = l(k,l) cs(k,l) . E(k,l)
190
9 Fac¸ade modeling
9.4.2 Cost definition Traditionally, data cost is defined as the photo consistency between multiple images. Here, we use the photo consistency by means of the point set that we obtain from Structure From Motion. This reverse way of using photo consistency is more robust and is the key reason for achieving the great accuracy of the top performance multiview stereo method [64]. As shown in Figure 9.9(c), the 3D points close to the working fac¸ade (within 0.8 meter) and with projections inside the subdivision region corresponding to the k-th site sk are projected onto the normal direction n to obtain a normalized height histogram θk with the bins H1k , . . . , HLk . The cost vector is now defined as cdk = exp {−θk }. Note that if no 3D point exists for a particular region, for example, due to occlusion, a uniform distribution is chosen for θk . And the smoothness cost is defined to be cs(k,l) = exp z(k,l) (sk + hk n) − (sl + hl n) where z(k,l) is inverse symmetric Kullback-Leibler divergence between the normalized color histograms of the k-th region and the l-th region from the reference texture image. This definition penalizes the Euclidean distance between neighborhood regions with similar texture. and favors minimal area surfaces. Note that sk is always placed on the center of the subdivision region, with the depth to be the peak of the height histogram θk . And H1k , . . . , HLk is adaptively defined to span four standard deviations of the projection heights for the point set. After the costs are all defined on the graph, the energy is minimized by Maxproduct Loopy Belief Propagation [244].
9.4.3 Interactive depth assignment In most situations, the automatically reconstructed depth is already good enough for visual inspection. For buildings that need more details such as landmarks, our workflow also provides a user interface to facilitate the interactive depth assignment task, all in 2D image space for ease of manipulation. • Transfer from other region. If it is not easy to directly paint the corresponding depth, the depth can be transfered from other region by dragging an arrow line to indicate the source and target regions. • Relative depth constraint. The relative depth between two regions can also be constrained by drag a two-circle ended line. The sign symbols in the circles indicate the order, and the radius of the circles, controlling by the + and - key in the keyboard, represent the depth difference. The difference is taken as hard constraint in the MRF optimization by merging the two nodes in the layout into one and updating the data and smooth costs accordingly. The depth maps of different fac¸ade examples are given in Figure 9.10.
9.4 Fac¸ade augmentation
191
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 9.10 Three typical fac¸ade examples: (a) One input view. (b) The 3D points from SFM. (c) The automatic fac¸ade partition (the group of repetitive patterns is color-coded) on the initial textured flat fac¸ade. (d) The user-refined final partition. (e) The re-estimated smoothed fac¸ade depth. (f) The user-refined final depth map. (g) The fac¸ade geometry. (h) The textured fac¸ade model.
192
9 Fac¸ade modeling
9.5 Fac¸ade completion Parametrization and texture atlas After the augmentation of the fac¸ade with the appropriate depths, each 2D region of the fac¸ade partition has evolved from a flat rectangle to a box on the dominant plane of the fac¸ade. The parametrization on the fac¸de plane can only be used to represent the front faces of the augmented subdivision regions. The textures of all front faces are stored in one map, the front texture map. The discontinuity between regions due to the difference in depth creates additional side faces: a typical protruding rectangle will have two left and right faces and two top and down faces. The textures of all these side faces are stored in a different side texture map. All textures both for front and side faces are automatically computed from the original registered images using the same method as in Section 9.2.2.
Re-modeling So far, we approximated each elementary unit of the fac¸ade as a cubical box that is sufficient for majority of the architectural objects at the scale in which we are interested. Obviously, some fac¸ades may have elements of different geometries. Each element can be manually re-modeled by using a pre-defined generic model of type cylinder, sphere, and polygonal solid to replace the given object. The texture is then re-computed automatically from the original images. The columns, the arches, and pediments can be modeled this way. Our decomposition approach makes this replacement convenient particularly for a group of elements with automatic texture re-mapping. Figure 9.13(b) shows a re-modeling example. Again, all textures for re-modeled objects are automatically computed from the original registered images using the same algorithm as in Section 9.2.2.
9.6 Results and discussions Three representative large-scale data sets captured under different conditions were chosen to show the flexibility of our approach. Video cameras on a vehicle are used for the first data set, a digital camera on a vehicle for the second, and a handheld camera for the third. For computation of structure from motion, long sequences were broken down into subsequences of about 50 images that are down-sampled to the resolution of below 1000 × 1000. Semi-dense SFM is automatically computed for each subsequence with auto-calibration in about 10 minutes with a PC (CPU Intel Core 2 6400 at 2.13GHz and 3GB RAM). The subsequences are merged into a global sequence using only fifth of the reconstructed points from the subsequences and the GPS/INS (Inertial Navigation System) data if it is available. To capture tall buildings in full,
9.6 Results and discussions
193
an additional camera captures views looking upwards in 45 degrees, with little or no overlapping between the viewing fields of the cameras. The cameras are mounted on a rigid rig that can be pre-calibrated, so that viewing positions could be transferable between the cameras if the computation for one camera is difficult.
Fig. 9.11 The modeling of a Chapel Hill street from 616 images: two input images on the top left, the recovered model rendered in the bottom row, and two zoomed sections of the recovered model rendered in the middle and on the right of the top row.
• Dishifu road, Canton. Shown in Figure 9.1, this was captured by a handheld Canon 5D camera using a 24mm wide lens with 2912 × 4368 image resolution. A total of 343 views were captured for one side (forward) of the street and 271 views for the opposite side (backward). The main difficulty in SFM is the drift of camera poses due to lack of GPS/INS data. The nonlinear distortion was automatically corrected by rectifying long line segments from the first 10 images. The drift of camera poses required us to break manually some fac¸ade planes into subfac¸ades for modeling. The arcade is modeled using two fac¸ades: A front fac¸ade for the top part of the arcade and the columns in the front, and a back fac¸ade for its lower part. These two fac¸ades are simply merged together and the invisible ceiling connecting the two parts is manually added. The interactive segmentation on the ground took about 1 hours and the reconstruction took about 2 hours for both sides of the street. • Baity Hill drive, Chapel Hill. Shown in Figure 9.11, this was captured by two video cameras [171] of resolution 1024 × 768 mounted on a vehicle with a GPS/INS. We sampled a sequences of 308 images from each camera. The resulting clouds of 3D points were geo-registered with the GPS/INS data. The video image quality is mediocre, however the richness of the building texture was excellent for SFM. It took about 20 minutes for the segmentation on the ground.
194
9 Fac¸ade modeling
The geometry of the building blocks is rather simple, and it was reconstructed in about 1 hours. • Hennepin avenue, Minneapolis. Shown in Figure 9.12, this was captured by a set of cameras mounted on a vehicle equipped with an GPS/INS system. Each camera has a resolution of 1024 × 1360. The set included 3 cameras that looked toward the street side: two of the same height, and one higher, to create both a 0.3 meter horizontal parallax and 1.0 meter vertical parallax between the cameras. An additional camera was pointed 45 degrees up, to capture the top part of the buildings. The main portion of the Minneapolis Hennepin avenue is covered by a sequence of 130 views using only one of the side-looking cameras. An additional sequence of 7 viewing positions, of the additional side cameras was used for the processing of the structure of the masonic temple in order to recover more fine details. To generate a more homogeneous texture layer from multiple images, taken at different directions, the images were white balanced using the diagonal model of illumination change [86]. The Hennepin avenue in Figure 9.12 was modeled in about 1 hour. The masonic temple is the most difficult one and takes about 10 minutes including re-modeling. We assigned different reflective properties for the windows for the rendering results in the video. Our approach has been found to be efficient: Most manual postediting were needed for visually important details near the roof tops of the buildings, where the common coverage of the images is small, and the quality of the recovered point cloud is poor. The re-modeling with generic models for clusters of patches is done only for the Hennepin avenue example. It is obvious that the accuracy of the camera geometry and the density of reconstructed points are the key to the modeling. GPS/INS data did help to improve the registration of long sequences, and avoid the drift associated with the SFM.
Typical fac¸ades Some typical fac¸ade examples from each data set are shown in Figure 9.10. An example for the Minneapolis data is also in the flowchart. We show both the before and after editing of the automatic partition, and can see that the majority of the fac¸ade partitions can be automatically computed with a over-segmentation followed by minor user adjustment. On average, the automatic computation time is about one minute per fac¸ade, and then followed by about another one minute manual refinement per fac¸ade, depending on the complexity and the desired reconstruction quality. All parameters are manually tuned on a small set of example fac¸ades, and then fixed. Specifically, they are η = 0.45, κ = 0.75, λ = 0.2, L = 17 in our prototype system.
9.6 Results and discussions
195
Fig. 9.12 Modeling of the Hennepin avenue in Minneapolis from 281 images: some input images in the bottom row, the recovered model rendered in the middle row, and three zoomed sections of the recovered model rendered in the top row.
Difficult fac¸ades The masonic temple fac¸ade in the third row of Figure 9.10 shows the most difficult case that we encountered, mainly due to the specific capturing conditions. We have two lower cameras, and only one upper camera for each vehicle position. The overlapping of the images for the upper part of the building is small and we reconstruct only few points. The depth map for the upper part is almost constant after optimization. The user interaction is more intensive to re-assign the depth for this building.
Atypical fac¸ades Figure 9.13 shows some special fac¸ade examples that are also nicely handled by our approach. • Cylindrical fac¸ade An example of two cylindrical fac¸ades is illustrated in Figure 9.13(a). The cylindrical fac¸ade with the letters is modeled first, then the cylindrical fac¸ade with the windows second, the background fac¸ade touched on them is modeled last. • Re-modeling This option was tested in the example of Hennepin avenue in Figure 9.12. The re-modeling results with 12 re-modeling objects, shown in Figure 9.13(b) can be compared with the results obtained without re-modeling shown on the right of Figure 9.10. • Multiple fac¸ades For the topologically complex building fac¸ades, we could use multiple fac¸ades. The whole Canton arcade street in Figure 9.1 systematically
196
9 Fac¸ade modeling
used two fac¸ades, one front and one back, where the back fac¸ade uses the front fac¸ade in front as the occluders.
(a) Two cylindrical fac¸ades.
(b) Re-modeling by replacing a cube by a cylinder or a sphere. Fig. 9.13 Atypical fac¸ades examples: the geometry on the left and the textured model on the right.
Discussions We have presented an image-based street-side modeling approach that takes a sequence of overlapping images, taken along the street, and turns out the complete photo-realistic 3D fac¸ade models. The system had been applied to three different large data sets, and demonstrated the applicability of the approach for large-scale city modeling tasks. Our approach has several limitations in its current implementation that we would like to improve in the future work. The automatic depth reconstruction techniques are based on a view independent appearance assumption, and may fail when trying to model highly reflective mirror-like buildings. The reflectance properties of the models, and in particular windows, are currently set manually for rendering, but they might be automatically recovered from multiple views as we know both point correspondences and plane orientations. Furthermore, it will
9.7 Bibliographic notes
197
be more useful to register the fac¸ade model with an existing vector map. The automatic detection and reconstruction of non-rectilinear objects and features in the partition is also a further direction. For instance, the conic features can also be reconstructed from two views [179].
9.7 Bibliographic notes This chapter is adapted from [94] by Xiao, Fang, Tan, Zhao, Ofek and Quan. We classify the related literature into rule-based, image-based and vision-based modeling approaches.
Rule-based methods The procedural modeling of buildings specifies a set of rules along the idea of Lsystem. The methods in [249, 150] are typical examples of procedural modeling. In general, procedural modeling needs expert specifications of the rules and may be limited in the realism of resulting models and their variations. Furthermore, it is very hard to define the needed rules to generate exact existing buildings.
Image-based methods Image-based methods use images as guidance to interactively generate models of architectures. Fac¸ade developed by Debevec et al. in [32] is a seminal work in this category. However, the required manual selection of features and the correspondence in different views is tedious, and cannot be scaled up well. M¨uller et al. [151] uses the limited domain of regular fac¸ades to signal out the importance of the windows in the architectural setting with one single image and creates an impressive result of the building fac¸ade while depth is manually assigned. Although, this technique is good for modeling regular building, it is limited to simple repetitive fac¸ades and cannot be applicable to street-view data as in Figure 9.1. Oh et al. [159] presented an interactive system to create model from a single image. They also manually assigned the depth based on a painting metaphor. van den Hengel et al. [81] used a sketching approach in one (or more) image. Although this method is quite general, it is also difficult to scale up for large scale reconstruction due to the heavy manual interaction. There are also a few manual modeling solutions on the market, such as Adobe Canoma, RealViz ImageModeler, Eos Systems PhotoModeler and The Pixel Farm PFTrack, which all require tedious manual model parametrizations and point correspondences.
198
9 Fac¸ade modeling
Vision-based methods Vision-based methods automatically reconstruct the urban scenes from images. The typical examples are the work in [207, 68] , [25] and the dedicated urban modeling work pursued at the UNC and UKT [171] that results in the meshes on the dense stereo reconstruction. The proper modeling with man-made structure constraints from the reconstructed point clouds and stereo data is not yet addressed. Werner and Zisserman [246] used the line segments for reconstructing the buildings. Dick et al. [38] developed 3D modeling architectural models from short image sequences. The approach is Bayesian and model-based, but relies on many specific architectural rules and model parameters. Zebedin et al. [253] have developed a complete system of urban scene modeling, based on aerial images. The result looks good from the top view, but not from the ground level. Our approach is therefore complementary to their system such that the street level details are added. Fr¨uh and Zakhor [60] were also using a combination of aerial imagery and ground color and LIDAR scans to construct models of fac¸ades. However, as stereo methods, it suffers from the lack of representation for man-made style architecture.
Chapter 10
Building modeling
This chapter proposes an automatic approach that generates 3D photo-realistic building models from images captured along the streets at ground level. We first develop a multi-view semantic segmentation method that recognizes and segments each image at pixel level into semantically meaningful areas, each labeled with a specific object class, such as buildings, sky, ground, vegetation and cars. A partition scheme is then introduced to separate buildings into independent blocks using the major line structures of the scene. Finally, for each block, we propose an inverse patch-based orthographic composition and structure analysis method for fac¸ade modeling that efficiently regularizes the noisy and missing reconstructed 3D data. Our system has the distinct advantage of producing visually compelling results by imposing strong priors of building regularity. We demonstrate the fully automatic system on a typical city example to validate our methodology.
Fig. 10.1 An example of automatically reconstructed buildings in Pittsburgh from the images shown on the bottom.
199
200
10 Building modeling
10.1 Introduction Current models of cities are often obtained from aerial images as demonstrated by Google Earth and Microsoft Virtual Earth 3D platforms. However, these methods using aerial images cannot produce photo-realistic models at ground level. As a transition solution, Google Street-View, Microsoft Live Street-Side and the like display the captured 2D panorama-like images with fixed view-points. Obviously, it is insufficient for applications that require true 3D photo-realistic models to enable user interactions with 3D environment. Researchers have proposed many methods to generate 3D models from images. Unfortunately, the interactive methods [32, 151, 94, 205] typically require significant user interactions, which cannot be easily deployed in large-scale modeling tasks; the automatic methods [171, 25, 246] focused on the early stages of the modeling pipeline, and have not yet been able to produce regular mesh for buildings.
Overview We propose in this paper an automatic approach to reconstruct 3D models of buildings and fac¸ades from street-side images. The image sequence is reconstructed using a structure from motion algorithm to produce a set of semi-dense points and camera poses. Approach. From the reconstructed sequence of the input images, there are three major stages during city modeling. First, in Section 10.3, each input image is segmented per pixel by a supervised learning method into semantically meaningful regions labeled as building, sky, ground, vegetation and car. The classified pixels are then optimized across multiple registered views to produce a coherent semantic segmentation. Then, in Section 10.4, the whole sequence is partitioned into building blocks that can be modeled independently. The coordinate frame is further aligned with the major orthogonal directions of each block. Finally, in Section 10.5, we propose an inverse orthographic composition and shape-based analysis method that efficiently regularizes the missing and noisy 3D data with strong architectural priors. Interdependency. Each step can provide very helpful information for later steps. The semantic segmentation in Section 3 can help the removal of line segments that are out of recognized building regions in Section 4. And it can identify the occluding regions of the fac¸ade by filtering out non-building 3D points for inverse orthographic depth and texture composition in Section 5.1. After the semantic segmentation results are mapped from the input image space to the orthographic space in Section 5.2, they are the most important cues for boundary regularization, especially the upper boundary optimization in Section 5.4. When the model is produced, the texture re-fetching from input images in Section 6.2 can also use the segmented regions to filter out occluding objects. On the other hand, the block partition in Section 4 provides us a way to divide the data into block level for further process, and also gives accuracy boundaries for the fac¸ades.
10.2 Pre-processing
201
Fig. 10.2 Overview of our automatic street-side modeling approach.
Assumptions. For the city and input images, our approach only assume that building facades have two major directions, vertical and horizontal, which are true for most buildings except some special landmarks. We utilize generic features from bank filters to train a recognizer from examples and optimized in multi-view to recognize sky regions without blue sky assumption. The final boundary between the sky and the buildings is robustly regularized by optimization. Buildings may be attached together or separated for a long distance, as the semantic segmentation can indicate the presence and absence of buildings. It is unnecessary to assume that buildings are perpendicular to the ground plane in our approach, as buildings are automatically aligned to the reconstructed vertical line direction.
10.2 Pre-processing The street-side images are captured by a camera mounted on a moving vehicle along the street and facing the building fac¸ades. The vehicle is equipped with GPS/INS (Inertial Navigation System) that is calibrated with the camera system. Figure 10.3 shows a few samples of input images.
Points The structure from motion for a sequence of images is now standard [80, 47]. We use a semi-dense structure from motion presented in Section 5.4 or in [119] in our cur-
202
10 Building modeling
Fig. 10.3 Some samples of input images captured at ground level.
rent implementation to automatically compute semi-dense point clouds and camera positions. The advantage of the quasi-dense approach is that it provides a sufficient density of points that are globally and optimally triangulated in a bundle-like approach. The availability of pose data from GPS/INS per view further improves the robustness of structure from motion and facilitates the large-scale modeling task. In the remainder of the paper, we assume that a reconstructed sequence is a set of semi-dense reconstructed 3D points and a set of input images with registered camera poses. One example is shown in Figure 10.4
Fig. 10.4 Reconstructed quasi-dense 3D points and camera poses.
Lines Canny edge detection [17] is performed on each image, and connected edge points are linked together to form line segments. We then identify two groups of the line segments: vertical line segments and horizontal ones. The grouping [205, 183] is carried out by checking whether they go through the common vanishing point using a RANSAC method. Since we have a semi-dense point matching information between each pair of images from the previous computation of SFM, the matching of the detected line segments can be obtained. The pair-wise matching of line segments is then extended to the whole sequence. As the camera is moving laterally on the ground, it is difficult to reconstruct the horizontal lines in 3D space due to lack
10.3 Building segmentation
203
of the horizontal parallax. Therefore, we only reconstruct vertical lines which can be tracked over more than three views. Finally, we keep the 3D vertical lines whose directions are consistent with each other inside RANSAC framework, and remove other outlier vertical lines.
Fig. 10.5 Reconstructed 3D vertical lines in red with the 3D points.
10.3 Building segmentation For a reconstructed sequence of images, we are interested in recognizing and segmenting the building regions from all images. First, in Section 3.1, we train discriminative classifiers to learn the mapping from features to object class. Then, in Section 3.2, multiple view information is used to improve the segmentation accuracy and consistency.
10.3.1 Supervised class recognition We first train a pixel-level classifier from a labeled image database to recognize and distinguish five object classes, including building, sky, ground, vegetation and car.
Features To characterize the image feature, we use textons which have been proved to be effective in categorizing materials and general object classes [248]. A 17-dimensional filter-bank, including 3 Gaussians, 4 Laplacian of Gaussians (LoG) and 4 first order derivatives of Gaussians, is used to compute the response on both training and testing images at pixel level. The textons are then obtained from the centroids by K-means clustering on the responses of the filter-bank. Since the nearby images in the testing sequence are very similar, to save computation time and memory space,
204
10 Building modeling
we do not run the texton clustering over the whole sequence. We currently pick up only one out of six images for obtaining the clustered textons. After the textons are identified, the texture-layout descriptor [203] is adopted to extract the features for classifier training, because it has been proved to be successful in recognizing and segmenting images of general classes. The each dimension of the descriptor corresponds a pair [r, t] of an image region r and a texton t. The region r relative to a given pixel location is a rectangle chosen at random within a rectangular window of i is the proportion of pixels ±100 pixels. The response v[r,t] (i) at the pixel location under regions r+i that have the texton t, i.e. v[r,t] (i) = j∈(r+i) [Tj = t] /size(r). Classifier We employ the Joint Boost algorithm [230, 203], which iteratively selects discriminative texture-layout filters as weak and combines them into a strong clas learners, m sifier of the form H (l, i) = m hi (l). Each weak learner hi (l) is a decision stump based on the response v[r,t] (i) of the form a v[r,t] (i) > θ + b l ∈ C. hi (l) = l∈ /C kl For those classes that share the feature l ∈ C, the weak learner gives hi (l) ∈ {a + b, b} depending on the comparison of feature response to a threshold θ. For classes not sharing the feature l ∈ / C, the constant kl makes sure that unequal numbers of training examples of each class do not adversely affect the learning procedure. We use sub-sampling and random feature selection techniques for the iterative boosting [203]. The estimated confidence value can be reinterpreted as a probability distribution using softmax transformation: exp (H (l, i)) Pg (l, i) = . k exp (H (l, k)) For performance and speed, the classifier will not be trained from the full labeled data that might be huge. We choose a subset of labeled images that are closest to the given testing sequence to train the classifier, in order to guarantee the learning of reliable and transferable knowledge. We use the gist descriptor [161] to characterize the distance between an input image and a labeled image, because the descriptor has been shown to work well for retrieving images of similar scenes in semantics, structure, and geo-locations. We create a gist descriptor for each image with 4 by 4 spatial resolution where each bin contains the average response to steerable filters in 4 scales with 8,8,4 and 4 orientations respectively in that image region. After the distances between the labeled images and input images of the testing sequence are computed, we then choose the 20 closest labeled images from the database as the training data for the sequence by nearest neighbors classification.
10.3 Building segmentation
205 ht
sky building
ground (a)
(b)
(c)
(d)
Fig. 10.6 Recognition and segmentation. (a) One input image. (b) The over-segmented patches. (c) The recognition per pixel. (d) The segmentation.
Location prior A camera is usually kept approximately straight in capturing. It is therefore possible to learn the approximate location priors of each object class. In a street-side image, for example, the sky always appears in the upper part of the image, the ground in the lower part, and the buildings in-between. Thus, we can use the labeled data to compute the accumulated frequencies of different object classes Pl (l, i). Moreover, the camera moves laterally along the street in the capturing of street-side images. A pixel at the same height in the image space should have the same chance of belonging to the same class. With this observation, we only need to accumulate the frequencies in the vertical direction of the image space.
10.3.2 Multi-view semantic segmentation The per-pixel recognition produces a semantic segmentation of each input image. But the segmentation is noisy and needs to be optimized in a coherent manner for the entire reconstructed sequence. Since the testing sequence have been reconstructed by SFM, we utilize the point matching information between multiple view to impose segmentation consistency.
Graph topology Each image Ii is first over-segmented using the method by [53]. Then we build a graph Gi = Vi , Ei on the over-segmentation patches for each image. Each vertex v ∈ Vi in the graph is an image patch or a super-pixel in the oversegmentation, while the edges Ei denotes the neighboring relationships between
206
10 Building modeling
super-pixels. Then, the graphs {Gi } from multiple images in the same sequence are merged into a large graph G by adding edges between two super-pixels in correspondence but from different images. The super-pixels pi and pj in images Ii and Ij are in correspondence, if and only if there is at least one feature track t = (xu , yu , i) , (xv , yv , j) , . . . with projection (xu , yu ) lying inside the superpixel pi in image Ii , and (xv , yv ) inside pj in Ij . To limit the graph size, there is at most only one edge eij between any super-pixel pi and pj in the final graph G = V, E , which is shown in Figure 10.7.
Adaptive features For object segmentation with fine boundaries, we prefer to use color cues to characterize the local appearance. In our model, the color distribution of all pixels in the image is approximated by a mixture model of m Gaussians in the color space with mean uk and covariance Σk . At the beginning, all pixel colors in all images of the same sequence are taken as input data points, and K-means is used to initialize a mixture of 512 Gaussians in RGB space. Let γkl denote the probability that the k-th Gaussian belongs to class l. The probability of vertex pi having label l is Pa (l, i) =
m
γkl N (ci |uk , Σk ) .
k=1
To compute γ, the probability Pg (l, i) is used solely in a greedy way to obtain an initial segmentation {li } as shown in Figure 10.6(c). This initial segmentation {li } is then used to train a Maximal Likelihood estimate for γ from [li = k] p (ci |uk , Σk ) γkl ∝ pi ∈V
under the constraint k γkl = 1. Now, combining the costs from both the local adaptive feature and the global feature, we define the data cost as ψi (li ) = − log Pa (l, i) − λl log Pl (l, i) − λg log Pg (l, i) .
Smoothing terms For an edge eij ∈ Ek in the same image Ik , the smoothness cost is ψij (li , lj ) = [li = lj ] · g (i, j) 2 2 with g (i, j) = 1/(ζ ci − cj + 1), where ci − cj is the L2 -Norm of the RGB color difference of two super-pixels pi and pj . Note that [li = lj ] allows to capture
10.4 Building partition
207
Fig. 10.7 Graph topology for multi-view semantic segmentation.
the gradient information only along the segmentation boundary. In other words, ψij is penalizing the assignment to the different labels of the adjacent nodes. For an edge eij ∈ E across two images, the smoothness cost is ψij (li , lj ) = [li = lj ] · λ |T| g (i, j) , where T = {t = (xu , yu , i) , (xv , yv , j) , . . . } is the set of all feature tracks with projection (xu , yu ) inside the super-pixel pi in image Ii , and (xv , yv ) inside pj in Ij . This definition favors two super-pixels having more matching tracks to have the same label, as the cost of having different labels is higher when |T| is larger.
Optimization With the constructed graph G = V, E , the labeling problem is to assign a unique label li to each node pi ∈ V. The solution L = {li } can be obtained by minimizing a Gibbs energy [67]: ψi (li ) + ρ ψij (li , lj ) . E (L) = pi ∈V
eij ∈E
Since the cost terms we defined satisfy the metric requirement, Graph Cut alpha expansion [11] is used to obtain a local optimized label configuration L within a constant factor of the global minimum.
10.4 Building partition The reconstructed sequence needs to be partitioned into independent building blocks for each block to be individually modeled. However, the definition of building block
208
10 Building modeling
is not unique, in that a block may contain a fraction or any number of physical buildings as long as they share a common dominant plane. As an urban scene is characterized by plenty of man-made structures of vertical and horizontal lines, we use the vertical lines to partition the sequence into blocks because they are stable and distinct separators for our purpose. Moreover, a local alignment process is used to place building blocks regularly. Such local alignment makes the analysis and implementation of the model reconstruction algorithm more straight-forward.
Fig. 10.8 Building block partition. Different blocks are shown by different colors.
10.4.1 Global vertical alignment We first remove the line segments that are projected out of the segmented building regions from the previous section. From all the remaining vertical line segments, we compute the global vertical direction of gravity by taking the median direction of all reconstructed 3D vertical lines, found during the preprocessing stage in Section 10.2. Then, we align the y-axis of coordinate system for the reconstructed sequence with the estimated vertical direction.
10.4.2 Block separator To separate the entire scene into natural building blocks, we find that the vertical lines are an important cue as a block separator, but there are too many vertical lines that tend to yield an over-partition. Therefore, we need to choose a subset of vertical lines as the block separators using the following heuristics.
10.4 Building partition
209
Intersection A vertical line segment is a block separator if its extended line does not meet any horizontal line segments within the same fac¸ade. This heuristic is only true if the fac¸ade is flat. We compute a score for each vertical line segment L by accumulating the number of intersections with all horizontal line segments in each image N (L) = m 1 k=1 Γk where Γk is the number of intersections in the k-th image and m is the m number of correspondences of the line L in the sequence.
Height A vertical line is a potential block separator if its left block and right block have different heights. We calculate the height for its left and right blocks as follows: (1) In every image where the vertical line is visible, we fit two line segments to the upper and lower boundary of the corresponding building region respectively. (2) A height is estimated as the Euclidian distance of the mid-points of these two best-fit line segments. (3) The height of a block is taken as the median of all its estimated heights in corresponding images.
Texture Two buildings often have different textures, so a good block separator is the one that gives very different texture distributions for its left and right blocks. For each block, we build an average color histogram h0 from multiple view. To make use of the spatial information, each block is further downsampled t − 1 times to compute several normalized color histograms h1 , ..., ht−1 . Thus, each block corresponds to a vector of multi-resolution histograms. The dissimilarity of two histogram vector hlef t and hright of neighboring blocks is defined as Dt hlef t , hright = t−1 lef t right 1 ) where d(·, ·) is the Kullback-Leibler divergence. k=0 d(hk , hk t With these heuristics, we first sort all vertical lines by increasing number of intersections N , and retain the first half of the vertical lines. Then, we select all vertical lines that result in more than 35% in height difference for its left and right blocks. After that, we sort again all remaining vertical lines, now by decreasing texture difference Dt , and we select only the first half of the vertical lines. This selection procedure is repeated until each block is in a pre-defined range (from 6 to 30 meters in the current implementation).
10.4.3 Local horizontal alignment After the global vertical alignment in the y-axis, the desired fac¸ade plane of the block is vertical, but it may not be parallel to the xy-plane of the coordinate frame.
210
10 Building modeling
We automatically compute the vanishing point of the horizontal lines in the most fronto-parallel image of the block sequence to obtain a rotation around the y-axis for alignment of the x-axis with the horizontal direction. Note that this is done locally for each block if there are sufficient horizontal lines in the chosen image. One example of this rectification is illustrated in Figure 10.9. After these operations, each independent fac¸ade is facing the negative z axis with x axis as horizontal direction from left to right, and y axis as vertical direction from top to down in their local coordinate system respectively.
(a)
(b)
Fig. 10.9 The rectification results. (a) Global vertical rectification. (b) Local horizontal rectification.
10.5 Fac¸ade modeling Since the semantic segmentation has identify the region of interest, and the block partition has separate the data into fac¸ade level, the remaining task is to model each fac¸ade. The reconstructed 3D points are often noisy or missing due to varying textureness as well as matching and reconstruction errors. Therefore, we introduce a building regularization method in the orthographic view of the fac¸ade for structure analysis and modeling. We first filter out the irrelevant 3D points by semantic segmentation and block separator. Orthographic depth map and texture image are composed from multiple views in Section 5.1, and provide the working image space for later stages. In Section 5.2, the structure elements on each fac¸ade are identified and modeled. When the identification of structure elements is not perfect, a backup solution is introduced in Section 5.3 to rediscover these elements, if they repetitively appear. Now, after the details inside each fac¸ade have been modeled, the boundaries of the fac¸ade are regularized in Section 5.4 to produce the final model.
10.5 Fac¸ade modeling
211
10.5.1 Inverse orthographic composition Each input image of the building block is over-segmented into patches using [53]. The patch size is a trade-off between accuracy and robustness. We choose 700 pixels as the minimum patch size for our images at a resolution of 640×905 pixels to favor relatively large patches since the reconstructed 3D points from images are noise. Algorithm 20 (Inverse orthographic patching) for each image Ik visible to the faccade do for each super-pixel pi ∈ Ik do if normal of pi parallel to z-axis then for each pixel (x, y) in the bounding box do X ← (x, y, zi )T , zi is depth of pi compute (u, v) of X to camera i if super-pixel of (u, v) in Ik = k then accumulate zi , color, and segmentation
Patch reconstruction The normal vector and center position of each pi are estimated from the set of 3D points Pi = {(xk , yk , zk )}, which have projections inside pi . As the local coordinate frame of the block is aligned with the three major orthogonal directions of the building, the computation is straightforward. Let σxi , σyi and σzi be the standard deviations of all 3D points in Pi in three directions. We first compute the normalized s¯ standard deviations σ ˆxi = ss¯xi σxi , σ ˆyi = syi σyi , where six and siy are the horizontal x y and vertical sizes of the bounding box of the patch in the input images, and their median respectively across all patches s¯x = mediani six , s¯y = mediani siy . The normalization avoids bias to a small patch. The patch pi is regarded as parallel to the fac¸ade base plane if σz is smaller than σxi and σyi . And, all these parallel patches with small σz contribute to the composition of an orthographic view of the fac¸ade. The orientation of such a patch pi is aligned with the z-axis, and its position set at the depth zi = median(xj ,yj ,zj )∈pi zj . One example is shown in Figure 10.10(a). Orthographic composition To simplify the representation for irregular shapes of the patches, we deploy a discrete 2D orthographic space on the xy-plane to create an orthographic view O of the fac¸ade. The size and position of O on the xy-plane are determined by the bounding box of the 3D points of the block, and the resolution of O is a parameter that is actually set not to exceed 1024 × 1024. Each patch is mapped from its original image space onto this orthographic space as illustrated in Figure 10.10 from (a) to (b). We use an inverse orthographic mapping algorithm shown in Algorithm 20 to avoid
212
10 Building modeling
Near
Far
(a)
(b)
(c)
building
(d)
(e)
(f)
Fig. 10.10 Inverse orthographic composition. (a) Depth map in input image space. (b) Partial orthographic depth map from one view. (c) Partial orthographic texture from one view. (d) Composed orthographic depth map (unreliably estimated pixels are in yellow). (e) Composed orthographic texture. (f) Composed orthographic building region.
gaps. Theoretically, the warped textures of all patches create a true orthoimage O as each used patch has a known depth and is parallel with the base plane. For each pixel vi of the orthoimage O, we accumulate a set of depth values {zj }, a set corresponding of color values {cj } and a set of segmentation labels {lj }. The depth of this pixel is set to the median of {zj } whose index is κ = arg medianj zj . Since the depth determines the texture color and segmentation label, we take cκ and lκ as the estimated color and label for the pixel. In practice, we accept a small set of estimated points around zκ and take their mean as the color value in the texture composition. As the content of images are highly overlapped, if a pixel is observed only once from one image, it is very likely that it comes from an incorrect reconstruction. It will thus be rejected in the depth fusion process. Moreover, all pixels {vi } with multiple observations {zj }i are sorted in non-decreasing order according to their standard deviation ςi = sd ({zj }) of depth sets. After that, we define ς (η) to be the η |{vi }|-th element in the sorted {ςi }. We declare the pixel vi to be unreliable if ςi > ς (η). The value of η comes from the estimated confidence
10.5 Fac¸ade modeling
213
of the depth measurements. We currently scale the value by the ratio of the number of 3D points and the total pixel number of O. Note that when we reconstruct the patches, we do not use the semantic segmentation results in the input image space for two reasons. The first is that the patches used in reconstruction are much larger in size than those used for semantic segmentation, this may lead to an inconsistent labeling. Though it is possible to estimate a unique label for a patch, it may downgrade the semantic segmentation accuracy. The second is that the possible errors in the semantic segmentation may over-reject patches, which compromises the quality of the depth estimation. Therefore, we reconstruct the depth first and transfer the segmentation results from the input image space to the orthographic view with pixel-level accuracy, shown in Figure 10.10(f). After that, we remove the non-building pixels in the orthoimage according to the segmentation label. Our composition algorithm for the orthographic depth map is functionally close to the depth map fusion techniques such as [29]. But our technique is robust as we use the architectural prior of orthogonality that preserves structural discontinuity without over-smoothing.
10.5.2 Structure analysis and regularization From the composed orthographic depth map and texture image for each fac¸ade, we want to identify the structural elements at different depths of the fac¸ade to enrich the fac¸ade geometry. To cope with the irregular, noisy and missing depth estimations on the fac¸ade, a strong regularization from the architecture priors is therefore required. Most of buildings are governed by vertical and horizontal lines and form naturally rectangular shapes. We restrict the prior shape of each distinct structure element to be a rectangle, such as the typical extruding signboard in Figure 10.10.
Joint segmentation We use a bottom-up, graph-based segmentation framework [53] to jointly segment the orthographic texture and depth maps into regions, where each region is considered as a distinct element within the fac¸ade. The proposed shape-based segmentation method jointly utilizes texture and depth information, and enables the fully automatic fac¸ade structure analysis. Xiao et al. [94] also proposed a functional equivalent top-down recursive sub-division method. However, it has been shown in [94] to be inefficient to produce satisfiable result without any user interaction. A graph G = V, E is defined on the orthoimage image O with all pixels as vertices V and edges E connecting neighboring pixels. To encourage horizontal and vertical cut, we use 4-neighborhood system to construct E. The weight function for an edge connecting two pixels with reliable depth estimations is based both on the color distance and normalized depth difference
214
10 Building modeling
(a)
(b)
(d)
(c)
(e)
(f)
Fig. 10.11 Structure analysis and regularization for modeling. (a) The fac¸ade segmentation. (b) The data cost of boundary regularization. The cost is color-coded from high at red to low at blue via green as the middle. (c) The regularized depth map. (d) The texture-mapped fac¸ade. (e) The texture-mapped block. (f) The block geometry.
2
w ((vi , vj )) = ci − cj · 2
z i − zj ς (η)
2 ,
where ci − cj is the L2 -Norm of the RGB color difference of two pixels vi and vj . We slightly pre-filter the texture image using a Gaussian of small variance before computing the edge weights. The weight for an edge connecting two pixels without reliable depth estimations is set to 0 to force them to have the same label. We do not construct an edge between a pixel with a reliable depth and a pixel without a reliable depth, as the weight cannot be defined. We first sort E by non-decreasing edge weight w. Starting with an initial segmentation in which each vertex vi is in its own component, the algorithm repeats for each edge eq = (vi , vj ) in order for the following process: If vi and vj are in disjoint components Ci = Cj , and w (eq ) is small compared with the internal difference of both those components, w (eq ) ≤ M Int (Ci , Cj ), then the two components are merged. The minimum internal difference is defined as
10.5 Fac¸ade modeling
215
M Int (C1 , C2 ) = min (Int (C1 ) + τ (C1 ) , Int (C2 ) + τ (C2 )) , where the internal difference of a component C is the largest weight in the minimum spanning tree of the component Int (C) =
max
e∈M ST (C,E)
w (e) .
The non-negative threshold function τ (C) is defined on each component C. The difference in this threshold function between two components must be greater than their internal difference for an evidence of a boundary between them. Since we favor a rectangular shape for each region, the threshold function τ (C) is defined by the divergence ϑ (C) between the component C and a rectangle, which is the portion of the bounding box BC with respect to the component C, ϑ (C) = |BC | / |C|. For small components, Int (C) is not a good estimate of the local characteristics of the data. Therefore, we let the threshold function be adaptive based on the component size, ϑ(C) τ (C) = , |C| where is a constant and is set to 3.2 in our prototype. τ is large for components that do not fit a rectangle, and two components with large τ are more likely to be merged. A larger favors larger components, as we require stronger evidence of a boundary for smaller components. Once the segmentation is accomplished, the depth values for all pixels in Ci of each reliable component Ci are set to the median. The depth of the largest region is regarded as the depth of the base plane for the fac¸ade. Moreover, an unreliable component Ci smaller than a particular size, set to 4% of the current fac¸ade area, is merged to its only reliable neighboring component if such a neighboring component exists.
Shape regularization Except for the base plane of the fac¸ade, we fit a rectangle to each element on the fac¸ade. For an element C = {vi = (xi , yi )}, we first obtain the median position xmed , ymed by xmed = median i xi and ymed i yi . We then remove = median xi − x > 2.8σx or yi − y outlier points that are med med > 2.8σy , where σx = / |C| and σy = yi − y / |C|. Furthermore, we reject the xi − x i i med med points that are in the 1% region of the left, right, top and bottom according to their ranking of x and y coordinates in the remaining point set. In this way, we obtain a reliable subset Csub of C. We define the bounding box BC of Csub as the fitting sub rectangle of C. The fitting confidence is then defined as fC =
BC ∩C sub . BC ∪C sub
216
10 Building modeling
In the end, we only retain the rectangles as distinct fac¸ade elements if the confidence fC > 0.72 and the rectangle size is not too small. The rectangular elements are automatically snapped into the nearest vertical and horizontal mode positions of the accumulated Sobel responses on the composed texture image, if their distances are less than 2% of the width and height of the current fac¸ade. The detected rectangles can be nested within each other. When producing the final 3D model, we first pop up the larger element from the base plane and then the smaller element within the larger element. If two rectangles overlap but do not contain each other, we first pop up the one that is closest to the base plane.
10.5.3 Repetitive pattern rediscovery Structure elements are automatically reconstructed in the previous section. However, when the depth composition quality is not good enough due to poor image matching, reflective materials or low image quality, only a few of them could be successfully recovered. For repetitive elements of the fac¸ade, we can now systematically launch a re-discovery process using the discovered elements as templates in the orthographic texture image domain. The idea of taking advantage of repetitive nature of the elements has been explored in [151, 94].
(a) The fac¸ade segmentation
(b) The results using the template in (a)
Fig. 10.12 Repetitive pattern rediscovery.
We use the Sum of Squared Differences (SSD) on RGB channels for template matching. Unlike [94] operating in 2D search space, we use a two-step method to search twice in 1D, shown in Figure 10.12. We first search in horizontal direction for a template Bi and obtain a set of matches Bi by extracting the local minima under a threshold. Then, we use both Bi and Bi together as the template to search for the local minima along the vertical direction. This leads to more efficient and robust matching, and automatic alignment of the elements. A re-discovered element by template matching inherits the depth of the template. When there are more than one structure element discovered previously by joint segmentation representing the same kind of structure elements, we also need to cluster the re-discovered elements using a bottom-up hierarchical merging mechanism. Two templates Bi and Bj obtained by joint segmentation with sets of matching can-
10.5 Fac¸ade modeling
217
didates Mi and Mj are merged into the same class, if one template is sufficiently similar to any element of the candidates of the other template. Here, the similarity between two elements is defined as the ratio of the intersection area by the union area of the two elements. The merging process consists of averaging element sizes between Mi ∪ {Bi } and Mj ∪ {Bj }, as well as computing the average positions for overlapped elements in Mi ∪ {Bi } and Mj ∪ {Bj }.
10.5.4 Boundary regularization The boundaries of the fac¸ade of a block are further regularized to favor sharp change and penalize serration. We use the same method as for shape regularization of structure elements to compute the bounding box [xmin , xmax ] × [ymin , ymax ] of the fac¸ade. Finally, we further optimize the upper boundary of the fac¸ade, as we cannot guarantee that a building block is indeed a single building with the same height during block partition. Illustrated in Figure 10.13, we lay out a 1D Markov random field on the horizontal direction of the orthoimage. Each xi ∈ [xmin , xmax ] defines a vertex, and an edge is added for two neighboring vertices. The label li of xi corresponds to the position of the boundary, and li ∈ [ymin , ymax ] for all xi . Therefore, one label configuration of the MRF corresponds to one fac¸ade boundary. Now, we utilize all texture, depth and segmentation information to define the cost. The data cost is defined according to the horizontal Sobel responses φi (lj ) = 1 −
HorizontalSobel (i, j) . 2 maxxy HorizontalSobel (x, y)
Furthermore, if lj is close to the top boundary ri of reliable depth map, |lj − ri | < β, where β is empirically set to 0.05(ymax − ymin + 1), we update the cost by multiplying it with (|lj − ri | + )/(β + ). Similarly, if lj is close to the top boundary si of segmentation |lj − si | < β, we update the cost by multiplying it with (|lj − si | + )/(β + ). For the fac¸ades whose boundaries are not in the viewing field of any input image, we snap the fac¸ade boundary to the top boundary of the bounding box, and empirically update φi (ymin ) by multiplying it with 0.8. Figure 10.11(b) shows one example of defined data cost. The height of the fac¸ade upper boundary usually changes in the regions with strong vertical edge responses. We thus accumulate vertical Sobel responses at each xi into Vi = y∈[ymin ,ymax ] VerSobel (i, y) , and define the smoothness term to be
Vi + Vi+1 φi,i+1 (li , li+1 ) = μ |li − li+1 | 1 − 2 maxj Vj
where μ is a controllable parameter. The boundary is optimized by minimizing a Gibbs energy [67]
,
218
10 Building modeling l =3
l =3
l =3
l =3
l =1
l =1
l =1
l =2
l =2
0 1 2 3 4 5 Fig. 10.13 A example of MRF to optimize fac¸ade upper boundary.
E (L) =
xi ∈[xmin ,xmax ]
φi (li ) +
φi,i+1 (li , li+1 ) ,
xi ∈[xmin ,xmax −1]
where φi is the data cost and φi,i+1 is the smoothing cost. The exact inference can be obtained with a global optimum by methods such as belief propagation [165].
10.6 Post-processing After the model for each fac¸ade is computed, the mesh is produced, and the texture is optimized.
Model production Each fac¸ade is the front side of the building block. We can extend a fac¸ade in the z-direction into a box with a constant depth (the default constant is set to 18 meters in the current implementation) to represent the geometry of the building block, as illustrated in Figure 10.15(f). All the blocks of a sequence are then assembled into the street side model. The texture mapping is done by visibility checking using z-buffer ordering. The side face of each block can be automatically textured as illustrated in Figure 10.15 if it is not blocked by the neighboring buildings.
Texture optimization The orthographic texture for each front fac¸ade by Algorithm 1 is a true orthographic texture map. But as a texture image, it suffers from the artifacts of color discontinuities, blur and gaps, as each pixel has been independently computed as the color of the median depth from all visible views. However, it does provide very robust and
10.7 Results and discussions
(a)
219
(b)
(c)
Fig. 10.14 Texture optimization. (a) The original orthographic texture image. (b) The optimized texture image. (c) A direct texture composition. The optimized texture image in (b) is more clear than the original orthographic texture image in (a), and has no texture from occluding objects, such as the one contained in (c).
reliable information for the true texture, and contain almost no outlier from occluding objects. Therefore, we re-compute an optimized texture image for each front fac¸ade, regarding the original orthographic texture image as a good reference. Suppose that each fac¸ade has N visible views. Each visible view is used to compute a partial texture image for all visible points of the fac¸ade. Then we obtained N partial texture images for the fac¸ade. Next, we define a difference measurement as the squared sum of differences between each pixel of the partial texture images and the original orthographic texture image at the same coordinate. This is the data term for a Markov Random Field on the orthographic texture image grid. The smoothing term is defined to be the reciprocal of the color difference between each neighboring pair of pixels on the original orthographic texture image. The desired orthographic texture image is computed using GraphCut alpha-expansion [11]. If seam artifacts are serious, poisson blending [166] can be used as post-process. Figure 10.14 shows the comparative results of this process. Figure 10.14(c) also shows a direct texture warping from most fronto-parallel image as in [94], which fails to remove the occluding objects, i.e. the telegraph pole in this case.
10.7 Results and discussions We have implemented our system and tested on the street-side images of downtown Pittsburgh. These images have been used in Google Street View to create seamless panoramic views. Therefore, the same kind of images are currently available for a huge number of cities around the whole world, which have been captured without online human control and with noises and glares. The image resolution is 640×905. Some example images are shown in Figure 10.18. Each sequence is reconstructed
220
10 Building modeling
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 10.15 Modeling examples of various blocks (Part 1). (a) The orthographic texture. (b) The orthographic color-coded depth map (yellow pixel is unreliable). (c) The fac¸ade segmentation. (d) The regularized depth map. (e) The geometry. (f) The textured model.
10.7 Results and discussions
221
using the structure from motion algorithm to produce a set of semi-dense points and camera poses. We break down the entire sequence of 10,498 images in Pittsburgh into 202 sequences corresponding to 202 blocks. Then, each sequence is reconstructed using the structure from motion algorithm to produce a set of quasi-dense points and camera poses. The cameras are then geo-registered back to the GPS coordinate frame using available GPS data. All sequences of a scene are merged with the overlapping camera poses. The current implementation is unoptimized C++ code, and the parameters are manually tuned on a set of 5 fac¸ades. The whole system consists of three major components: SFM, segmentation, and modeling, and it is completely modular. We use the code from [161] for gist feature extraction, the code from [203] for Joint Boost classification, the code from [11] for Graph Cut alpha expansion in MRF optimization, the code from [53] for joint graph-based segmentation in structure analysis and over-segmentation. Note that after block partition, the fac¸ade analysis and modeling component works in a rectified orthographic space, which involves only simple 2D array operations in implementation. The resulted model is represented by pushed-and-poped rectangles. As shown in Figure 8(f) and Figure 12(e), the parallepipeds-like division is the quadrilateral tessellation of the base plane mesh by ‘rectangle map’, which is inspired by the trapezoid map algorithm [30]. A rectangle is then properly extruded according to the depth map. The results may seem to have some ‘parallepipeds’, while some of them may have the same depth as neighboring, depending precisely upon their reconstructed depth values. For a portion of Pittsburgh, we used 10,498 images, and reconstructed 202 building blocks. On a small cluster composed by 15 normal desktop PCs, the results are produced automatically in 23 hours, including approximately 2 hours for SFM, 19 hours for segmentation, and 2 hours for partition and modeling. Figure 10.15 shows different examples of blocks and the intermediate results. Figures 10.17 and 10.1 show a few close-up views of the final model. All presented results in the paper and in the accompanying video are “as is” without any manual touch-up. For rendering, each building block is represented in two levels of detail. The first level has only the fac¸ade base plane. The second level contains the augmented elements of the fac¸ade. In the semantic segmentation, we hand-labeled 173 images by uniformly sampling images from our data set to create the initial database of labeled street-side images. Some example labeled data is shown in the accompanying video. Each sequence is recognized and segmented independently. For testing, we do not use any labeled images if they come from the same sequence in order to fairly demonstrate the real performance on unseen sequences. Our method is remarkably robust for modeling as the minor errors or failure cases do not create visually disturbing artifacts. The distinct elements such as windows and doors within the fac¸ade may not always be reconstructed due to lack of reliable 3D points. They are often smoothed to the fac¸ade base plane with satisfactory textures as the depth variation is small. Most of the artifacts are from the texture. Many of the trees and people are not removed from the textures on the first floor of the buildings seen in Figures 10.16, 10.17, and 10.1. These could be corrected if an in-
222
10 Building modeling
teractive segmentation and inpainting is used. There are some artifacts on the fac¸ade boundaries if the background buildings are not separated from the foreground buildings, shown in the middle of Figure 10.1. Some other modeling examples are also shown in Figure 10.15. Note that there are places where the top of the buildings are chopped off, because they are clipped in the input images.
Fig. 10.16 The top figure is a street-side view automatically generated from the images shown below. The bottom figure is the same street-side view interactively generated using the method presented in the previous chapter for comparison.
Fig. 10.17 One view of a city street model automatically generated from the images shown on the bottom.
10.7 Results and discussions
223
Fig. 10.18 Another example of automatically reconstructed buildings in Pittsburgh from the images shown on the bottom.
Possible interactive editing Our approach is fully automatic for all presented results. However, [94] does provide a very convenient user interface for manual operations. Nevertheless, since our rectangular representation is just a special case of DAG graph used by [94], our method can be seamlessly used together with the user interface provided by them for later manual process if necessary. Figure 10.16 show the improved results after the interactive editing of the automatic results.
Conclusion We have proposed a completely automatic image-based modeling approach that takes a sequence of overlapping images captured along the street and produces the complete photo-realistic 3D models. The main contributions are: a multiple view semantic segmentation method to identify the object classes of interest, a systematic partition of buildings into independent blocks using the man-made vertical and horizontal lines, and a robust fac¸ade modeling with pushed and pulled rectangular shapes. More importantly, the components are assembled into a robust and fully automatic system. The approach has been successfully demonstrated on large amount of data. There are a few limitations to the current implementation of the system, but they can be improved within the same framework. For example, we could incorporate the 3D information in the semantic segmentation. Furthermore, using the grammar rules extracted from the reconstructed models to synthesize missing parts procedurally is also an interesting further direction. One preliminary example along this direction is shown in Figure 1.2 of the introduction chapter.
224
10 Building modeling
10.8 Bibliographic notes This chapter is adapted from [95] by Xiao, Fang, Zhao, Lhuillier and Quan. There is a large literature on image-based city modeling. We classify the literature according to the input images and the user interaction without being exhaustive.
Single-view methods Oh et al. [159] presented an interactive system to create models from a single image by manually assigning the depth based on a painting metaphor. M¨uller et al. [151] relied on repetitive patterns to discover fac¸ade structures, and obtained depth from manual input. Generally, these methods need intensive user interactions to produce visual pleasing results. Hoiem et al. [34] proposed surface layout estimation for modeling. Saxena et al. [191] learned the mapping information between image features and depth directly. Barinova et al. [4] made used of manhattan structure for man-made building to divide the model fitting problem into chain graph inference. However, these approaches can only produce a rough shape for modeling objects without lots of details.
Interactive multi-view methods Fac¸ade, developed by Debevec et al. [32], is a seminal work in this category. They used line segment features in images and polyhedral blocks as 3D primitives to interactively register images and to reconstruct blocks with view-dependent texture mapping. However, the required manual selection of features and correspondences in different views is tedious, which makes it difficult to be scaled up when the number of images grows. Van den Hengel et al. [81] used a sketching approach in one or more images to model a general object. But it is difficult to use this approach for detail modeling even with intensive interaction. Xiao et al. [94] proposed to approximate orthographic image by fronto-parallel reference image for each fac¸ade during automatic detail reconstruction and interactive user refinement. Therefore, their approach requires an accurate initialization and boundary for each fac¸ade as input, probably manually specified by the user. Sinha et al. [205] used registered multiple views and extracted the major directions by vanishing points. The significant user interactions required by these two methods for good results make them difficult to adopt in large-scale city modeling applications.
Automatic multi-view methods Dick et al. [38] developed a 3D modeling architectural modeling method for short image sequences. The user is required to provide intensive architectural rules for the Bayesian inference. Many researchers realized the importance of line features
10.8 Bibliographic notes
225
in man-made scenes. Werner and Zisserman [246] used line segments for building reconstruction from registered images by sparse points. Schindler et al. [194] proposed the use of line features for both structure from motion and modeling. However, line features tend to be sparse and geometrically less stable than points. A more systematic approach to modeling urban environments using video cameras has been investigated by several teams [171, 25]. They have been very successful in developing real-time video registration and focused on the global reconstruction of dense stereo results from the registered images. Pollefeys et al. [171] proposed a system for automatic, geo-registered, real-time 3D reconstruction from video of urban scenes. The system collects video streams, as well as GPS and inertia measurements in order to place the reconstructed models in geo-registered coordinates. It is designed using current state of the art real-time modules for all processing steps. It employs commodity graphics hardware and standard CPU’s to achieve real-time performance. However, the final modeling results still present many irregularities due to the lack of architectural constraints. Cornelis et al. [25] also included a recognition component for cars and pedestrians to enhance the modeling process. Our approach does not try to obtain dense stereo reconstruction, but focuses on the identification and modeling of buildings from the semi-dense reconstructions. Zebedin et al. [253] is representative of city modeling from aerial images, which is complementary to our approach from street-level images.
Scanner-based methods Using 3D scanners for city modeling is definitely an important alternative [60, 210]. One of the representative work is [60], which generates 3D models of fac¸ades and street scenery as seen from street level, by ground-based laser scanning data acquisition. Then,the positions of cameras and scanners are estimated using scan matching and Monte-Carlo localization. Finally, the 3D points are segmented and triangulated to produce mesh models. Actually, as one important feature of our approach, any technique that computes a 3D point cloud and registers with the images can be used together with our methods for city modeling. 3D range finder usually produces more accurate 3D points at a finer level than image-based approach, but also results in much larger data size for city modeling.
List of Algorithms
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The DLT calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 The three-point pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 The seven-point fundamental matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 The linear eight-point fundamental matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 The five-point relative orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 The six-point projective reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 The linear auto-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Bundle-adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 RANSAC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Sparse two-view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Sparse three-view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Sparse SFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Incremental sparse SFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Unstructured sparse SFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Match propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Quasi-dense two-view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Quasi-dense three-view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Quasi-dense SFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Photo-consistency check for occlusion detection . . . . . . . . . . . . . . . . . . . . . 181 Inverse orthographic patching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
227
List of Figures
1.1
1.2
2.1 2.2 2.3 2.4
3.1 3.2 3.3 3.4 3.5 3.6
3.7 3.8
A small-scale modeling example. (a) One of the twenty-five images captured by a handheld camera. (b) The reconstructed quasi-dense 3D points and their segmentation. (c) A desired segmentation of an input image. (d) A rendered image at a similar viewpoint of the reconstructed 3D model based on the segmentation results in (b) and (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A large-scale city modeling example in downtown Pittsburgh areas from images (shown on the bottom) captured at ground level by a camera mounted on a vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
Left: affine coordinates as coefficients of linear combinations. Right: affine coordinates as ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 The invariance of a cross-ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Definition of (inhomogeneous) projective coordinates (α, β) of a point P on a plane through four reference points A, B, C and D. . . . . 13 Affine classification of a conic: an ellipse, a parabola, and a hyperbola is a conic intersecting the line at infinity at 0 point, 1 (double) point, and 2 points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 The central projection in the camera-centered coordinate frame. . . . . The transformation within the image plane between the Cartesian frame xy − p and the affine frame uv − o. . . . . . . . . . . . . . . . . . . . . . . . A Chasles conic defined by a constant cross-ratio of a pencil of lines. Determining the view line associated with a known point A. . . . . . . . . The geometric constraint for a pair of points. . . . . . . . . . . . . . . . . . . . . The epipolar geometry of two views. A pair of corresponding points is u ↔ u . The camera centers are c and c . The epipoles are e and e in the first and second image. The line l is the epipolar line of the point u. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A circular cylinder is one simple form of the critical configurations. . The geometry of three views as two epipolar geometries. . . . . . . . . . .
31 32 35 36 40
43 53 54
229
230
List of Figures
3.9 The geometry of three views for line correspondence. . . . . . . . . . . . . . 56 3.10 The configuration of four non-coplanar points in space. . . . . . . . . . . . . 65 5.1
A typical sparseness pattern in a bundle adjustment (courtesy of M. Lhuillier). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Possible matches (u, u ) and (v, v ) around a seed match (x, x ) come from its 5 × 5-neighbor N (x) and N (x ) as the smallest size for discrete 2D disparity gradient limit. The match candidates for u (resp. v ) are within the 3 × 3 window(black framed) centered at u (resp. v). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 The disparity maps produced by propagation with different seed points and without the epipolar constraint. Left column: automatic seed points with the match outliers marked with a square instead of a cross. Middle column: four seed points manually selected. Right column: four seed points manually selected plus 158 match outliers with strong correlation score ZN CC > 0.9). . . . . . . . . . . . . . . . . . . . . 98 5.4 Two images with low geometric distortion of a small wooden house. . 99 5.5 Examples of propagation for low textured images in Figure 5.4. (a) The disparity map automatically produced with the epipolar constraint. (b) the disparity map produced by a single manual seed on the bottom without the epipolar constraint. (c) The common matching areas (in black) between the two maps in (a) and (b). . . . . . 99 5.6 Two views of a scene with background A, foreground B and half occluded areas C and D. Assume that correct matches within A and B have better scores than bad ones within C and D. A and B are first filled in by propagation from seed a and b before the algorithm attempts to grow in C or D. Once A and B are matched, the procedure is stopped by the uniqueness constraint at the boundary of C in the first view (resp. D in the second view) because the corresponding boundary in the second view (resp. the first one) encloses an empty area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.7 The disparity maps produced by propagation with different seed points and without the epipolar constraint. Left and Middle columns: four manually selected seed points marked by a cross between the 1st and the 20th frame of the flower garden. Right column: remove one manual seed located on the front tree. It has more match outliers in the occluded regions. . . . . . . . . . . . . . . . . . . . . . 100 5.8 (a) Two sub-images of an electric post. The disparity maps without (b) and with (c) the epipolar constraint. . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.9 For each corresponding patch, the resampled points include the center point of the patch and all points of interest within the patch and their homography-induced correspondences in the second image. 104 5.10 Seed matches and the resulting propagation for a Lady 2 image pair . 106
List of Figures
231
5.11 The improvements made by the constrained propagation and local homography fitting. The matching results from the second constrained propagation in (b) compared to the first unconstrained propagation in (a). (c) The matching results from the robust local homography fitting with respect to the second constrained propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.12 Quasi-dense two-view computation. (a) The quasi-dense disparity map by two propagations and the estimated epipolar geometry. (b) The resampled quasi-dense correspondence points. . . . . . . . . . . . . . . . 108 5.13 Quasi-dense three-view computation. The inliers of quasi-dense correspondences in a triplet of images are in white, and the outliers are in gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.14 (a) A top view of the Euclidean quasi-dense geometry. (b) A close-up view of the face in point cloud. . . . . . . . . . . . . . . . . . . . . . . . . 111 5.15 Quasi reconstruction and its ellipsoids for Lady 2 . . . . . . . . . . . . . . . . 113 5.16 Three images of the Lady 1 sequence from D. Taylor on the top row. QUASI (left) and SPARSE (right) reconstruction and their 90% ellipsoids (magnified by a scale of 4) viewed on a horizontal plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.17 Quasi reconstruction and its ellipsoids for Garden-cage . . . . . . . . . . . 115 5.18 Sparse and quasi reconstructions, and their ellipsoids for corridor sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1
One input image and the reconstructed surface geometry (Gouraud shaded and textured mapped). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2 Overview of our surface reconstruction method. . . . . . . . . . . . . . . . . . . 122 6.3 Reconstructed points, Gouraud-shaded and texture-shaded 3D models for many image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4 Reconstructed points, Gouraud-shaded and texture-shaded 3D models for many image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.5 Reconstructed points, Gouraud-shaded and texture-shaded 3D models for many image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.6 Surface geometry obtained with BR3D and BR3D+2D to improve low-textured cheeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.7 Surface initialization, BR3D and BR3D+2D . . . . . . . . . . . . . . . . . . . . 132 6.8 Surface initialization, BR3D and BR3D+2D . . . . . . . . . . . . . . . . . . . . 132 6.9 Surfaces obtained with the BR2D method after 400 and 1000 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.10 Surfaces obtained with curvature and Laplacian smoothed BR3D methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.11 Surfaces computed using (a) BR3D method with only quasi-dense 3D points, (b) BR3D+S method with a combination of the quasi-dense 3D points and the silhouettes, (c) BR3D+2D method with a combination of the quasi-dense 3D points and the image photo-consistency, and (d) S method with only the silhouettes. . . . . . . 135
232
List of Figures
7.1
7.2 7.3
7.4
7.5
7.6
7.7
8.1
8.2 8.3 8.4
8.5
From left to right: one of the 40 images captured by a handheld camera under natural conditions; the recovered hair rendered with the recovered diffuse color; a fraction of the longest recovered hair fibers rendered with the recovered diffuse color to show the hair threads; the recovered hair rendered with an artificial constant color. . 137 Overview of our hair modeling system. . . . . . . . . . . . . . . . . . . . . . . . . . 139 a. Hair volume, marked in violet, is approximated from a visual hull Smax and an inward offset surface Smin of the visual hull. b. The visibility of a fiber point P is determined from that of the closest point Pmax on the hair surface. The projection of P onto an image is not a point, but a small image area that is the projection of a ball around P with radius P Pmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Example of a typical man’s short and dark hair. In each row, one of the original images on the left, the recovered hair rendered with the recovered diffuse colors in the middle, and rendered with a curvature map on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Example of a difficult curly hair style. One of the original image on the left. The recovered hair rendered with the recovered diffuse colors in the middle, rendered with a curvature map on the right. . . . . 145 Example of a typical woman’s style. One of the original images on the top left. The recovered hair rendered with the recovered diffuse colors on the top right, rendered with a fraction of long fibers on the bottom left, rendered with a curvature map on the bottom right. . . 146 Example of a typical long wavy hair style. One of the original images on the top left. The recovered hair rendered with recovered diffuse colors on the top right, rendered with a constant hair color using the self-shadowing algorithm on the bottom left, rendered with the recovered diffuse colors using the self-shadowing algorithm on the bottom right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Nephthytis plant. An input image out of 35 images on the left, and recovered model rendered at the same viewpoint as the image on the left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Overview of our tree modeling system. . . . . . . . . . . . . . . . . . . . . . . . . . 150 Overview of our model-based leaf extraction and reconstruction. . . . . 151 Spectrum of plants and trees based on relative leaf size: On the left end of the spectrum, the size of the leaves relative to the plant is large. On the right end, the size of the leaves is small with respect to the entire tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Bare tree example. From left to right: one of the source images, superimposed branch-only tree model, and branch-only tree model rendered at a different viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
List of Figures
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
8.14
8.15
233
Branch reconstruction for two different trees. The left column shows the skeletons associated with visible branches while the right are representative replication blocks. (a,b) are for the fig tree in Figure 8.18, and (c,d) are for the potted flower tree in Figure 8.13. (Only the main branch of the flower tree is clearly visible.) . . . . . . . . 156 An example of totally occluded branch structure by the constraint growth. It started from a vertical segment on the bottom left with three subtrees given on the bottom right. . . . . . . . . . . . . . . . . . . . . . . . . 158 Branch structure editing. The editable areas are shown: 2D area (left), 3D space (right). The user modifies the radii of the circles or spheres (shown in red) to change the thicknesses of branches. . . . . . . 159 Segmentation and clustering. (a) Matted leaves from source image. (b) Regions created after the mean shift filtering. (c) The first 30 color-coded clusters. (d) 17 textured clusters (textures from source images). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Benefit of jointly using 3D and 2D information. (a) The projection of visible 3D points (in yellow) and connecting edges (in red) are superimposed on an input image. Using only 3D distance resulted in coarse segmentation of the leaves. (b) The projection of segmented 3D points with only the connecting edges superimposed on the gradient image (in white). A different color is used to indicate a different group of connecting edges. Using both 3D and 2D image gradient information resulted in segmentation of leaflets. (c) Automatically generated leaflets are shown as solid-colored regions. The user drew the rough boundary region (thick orange line) to assist segmentation, which relabels the red points and triggers a graph update. The leaflet boundary is then automatically extracted (dashed curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Leaf reconstruction for poinsettia (Figure 8.18). (a) Reconstructed flat leaves using 3D points. The generic leaf model is shown at top right. (b) Leaves after deforming using image boundary and closest 3D points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Large tree example. Two of the source images on the left column. Reconstructed model rendered at the same viewpoint on the right column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Potted flower tree example. (a) One of the source images, (b) reconstructed branches, (c) complete tree model, and (d) model seen from a different viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Image-based modeling of a tree. From left to right: A source image (out of 18 images), reconstructed branch structure rendered at the same viewpoint, tree model rendered at the same viewpoint, and tree model rendered at a different viewpoint. . . . . . . . . . . . . . . . . . . . . . 170 Medium-sized tree example. From left to right: (a) One of the source images. (b) Reconstructed tree model. (c) Model rendered at a different viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
234
List of Figures
8.16 An indoor tree. (a) an input image (out of 45). (b) the recovered model, inserted into a simple synthetic scene. . . . . . . . . . . . . . . . . . . . 171 8.17 Schefflera plant. Top left: an input image (out of 40 images). Top right: recovered model with synthesized leaves rendered at the same viewpoint as the top left. Bottom right: recovered model from images only. The white rectangles show where the synthesized leaves were added in the top right. Bottom left: recovered model after some geometry editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.18 Image-based modeling of poinsettia plant. (a) An input image out of 35 images, (b) recovered model rendered at the same viewpoint as (a), (c) recovered model rendered at a different viewpoint, (d) recovered model with modified leaf textures. . . . . . . . . . . . . . . . . . . . . 173 9.1
A few fac¸ade modeling examples: some input images in the bottom, the recovered model rendered in the middle row and on the top left, and two zoomed sections of the recovered model rendered in the top middle and on the top right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.2 Overview of the semi-automatic approach to image-based fac¸ade modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.3 A simple fac¸ade can be initialized from a flat rectangle (a), a cylindrical portion (b) or a developable surface (c). . . . . . . . . . . . . . . . 180 9.4 Interactive texture refinement: (a) drawn strokes on the object to indicate removal. (b) the object is removed. (c) automatically inpainting. (d) some green lines drawn to guide the structure. (e) better result achieved with the guide lines. . . . . . . . . . . . . . . . . . . . . . . 183 9.5 Structure preserving subdivision. The hidden structure of the fac¸ade is extracted out to form a grid in (b). Such hypothesis are evaluated according to the edge support in (c), and the fac¸ade is recursive subdivided into several regions in (d). Since there are no enough support between Region A, B, C, D, E, F, G, H, they are all merged into one single region M in (e). . . . . . . . . . . . . . . . . . . . . . . 183 9.6 Merging support evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.7 A DAG for repetitive pattern representation. . . . . . . . . . . . . . . . . . . . . . 187 9.8 Five operations for subdivision refinement: the left figure is the original subdivision layout shown in red and the user sketched stroke shown in green, while the right figure is the result subdivision layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.9 Markov Random Field on the fac¸ade surface. . . . . . . . . . . . . . . . . . . . . 189 9.10 Three typical fac¸ade examples: (a) One input view. (b) The 3D points from SFM. (c) The automatic fac¸ade partition (the group of repetitive patterns is color-coded) on the initial textured flat fac¸ade. (d) The user-refined final partition. (e) The re-estimated smoothed fac¸ade depth. (f) The user-refined final depth map. (g) The fac¸ade geometry. (h) The textured fac¸ade model. . . . . . . . . . . . . . . . . . . . . . . . 191
List of Figures
235
9.11 The modeling of a Chapel Hill street from 616 images: two input images on the top left, the recovered model rendered in the bottom row, and two zoomed sections of the recovered model rendered in the middle and on the right of the top row. . . . . . . . . . . . . . . . . . . . . . . . 193 9.12 Modeling of the Hennepin avenue in Minneapolis from 281 images: some input images in the bottom row, the recovered model rendered in the middle row, and three zoomed sections of the recovered model rendered in the top row. . . . . . . . . . . . . . . . . . . . . . . . . 195 9.13 Atypical fac¸ades examples: the geometry on the left and the textured model on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 10.1 An example of automatically reconstructed buildings in Pittsburgh from the images shown on the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.2 Overview of our automatic street-side modeling approach. . . . . . . . . . 201 10.3 Some samples of input images captured at ground level. . . . . . . . . . . . 202 10.4 Reconstructed quasi-dense 3D points and camera poses. . . . . . . . . . . . 202 10.5 Reconstructed 3D vertical lines in red with the 3D points. . . . . . . . . . . 203 10.6 Recognition and segmentation. (a) One input image. (b) The over-segmented patches. (c) The recognition per pixel. (d) The segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 10.7 Graph topology for multi-view semantic segmentation. . . . . . . . . . . . . 207 10.8 Building block partition. Different blocks are shown by different colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 10.9 The rectification results. (a) Global vertical rectification. (b) Local horizontal rectification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 10.10Inverse orthographic composition. (a) Depth map in input image space. (b) Partial orthographic depth map from one view. (c) Partial orthographic texture from one view. (d) Composed orthographic depth map (unreliably estimated pixels are in yellow). (e) Composed orthographic texture. (f) Composed orthographic building region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.11Structure analysis and regularization for modeling. (a) The fac¸ade segmentation. (b) The data cost of boundary regularization. The cost is color-coded from high at red to low at blue via green as the middle. (c) The regularized depth map. (d) The texture-mapped fac¸ade. (e) The texture-mapped block. (f) The block geometry. . . . . . 214 10.12Repetitive pattern rediscovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 10.13A example of MRF to optimize fac¸ade upper boundary. . . . . . . . . . . . . 218 10.14Texture optimization. (a) The original orthographic texture image. (b) The optimized texture image. (c) A direct texture composition. The optimized texture image in (b) is more clear than the original orthographic texture image in (a), and has no texture from occluding objects, such as the one contained in (c). . . . . . . . . . . . . . . . 219
236
List of Figures
10.15Modeling examples of various blocks (Part 1). (a) The orthographic texture. (b) The orthographic color-coded depth map (yellow pixel is unreliable). (c) The fac¸ade segmentation. (d) The regularized depth map. (e) The geometry. (f) The textured model. . . . . . . . . . . . . . 220 10.16The top figure is a street-side view automatically generated from the images shown below. The bottom figure is the same street-side view interactively generated using the method presented in the previous chapter for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 10.17One view of a city street model automatically generated from the images shown on the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 10.18Another example of automatically reconstructed buildings in Pittsburgh from the images shown on the bottom. . . . . . . . . . . . . . . . . . 223
References
1. P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283–310, 1989. 2. M. Armstrong, A. Zisserman, and R. Hartley. Self-calibration from image triplets. In B. Buxton and R. Cipolla, editors, Proceedings of the 4th European Conference on Computer Vision, Cambridge, England, volume 1064 of Lecture Notes in Computer Science, pages 3–16. Springer-Verlag, April 1996. 3. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM, 45(6):891– 923, 1998. 4. O. Barinova, V. Konushin, A. Yakubenko, H. Lim, and A. Konushin. Fast automatic singleview 3-d reconstruction of urban scenes. European Conference on Computer Vision, pages 100–113, 2008. 5. J. Barron, D. Fleet, and S. Beauchemin. Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43–77, 1994. 6. P. Beardsley, P. Torr, and A. Zisserman. 3D model acquisition from extended image sequences. In B. Buxton and R. Cipolla, editors, Proceedings of the 4th European Conference on Computer Vision, Cambridge, England, volume 1065 of Lecture Notes in Computer Science, pages 683–695. Springer-Verlag, April 1996. 7. A.C. Berg, F. Grabler, and J. Malik. Parsing images of architectural scenes. IEEE International Conference on Computer Vision, pages 1–8, 2007. 8. F. Bertails, C. M´enier, and M-P. Cani. A practical self-shadowing algorithm for interactive hair animation. In Proc. Graphics Interface, May 2005. 9. P.J. Besl and N.D. McKay. A method for registration of 3D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992. 10. Y. Boykov, O. Veksler, and R. Zabih. Disparity component matching for visual correspondence. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Santa Barbara, California, USA, pages 470–475, 1998. 11. Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001. 12. T.P. Breckon and R.B. Fisher. Non-parametric 3d surface completion. 3DIM ’05: Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling, pages 573–580, 2005. 13. L. Bretzner and T. Lindeberg. Use your hand as a 3d mouse, or, relative orientation from extended sequences of sparse point and line correspondences using the affine trifocal tensor. In Proceedings of the 5th European Conference on Computer Vision, Freiburg, Germany, pages 142–157, 1998. 14. G. Brostow, I. Essa, D. Steedly, and V. Kwatra. Novel skeletal representation for articulated creatures. European Conference on Computer Vision, pages 66–78, 2004.
237
238
References
15. D.C. Brown. The bundle adjustment – progress and prospects. International Archive of Photogrammetry, 21, 1976. Update of “Evolution, Application and Potential of the Bundle Method of Photogrammetric Triangulation”, ISP Symposium Commission III, Stuttgart, 1974. 16. T. Buchanan. The twisted cubic and camera calibration. Computer Vision, Graphics and Image Processing, 42(1):130–132, April 1988. 17. J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986. 18. S. Carlsson. Duality of reconstruction and positioning from projective views. In Workshop on Representation of Visual Scenes, Cambridge, Massachusetts, USA, pages 85–92, June 1995. 19. S. Carlsson and D. Weinshall. Dual computation of projective shape and camera positions from multiple images. International Journal of Computer Vision, 27(3):227–241, May 1998. 20. V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. International Journal of Computer Vision, 22(1):61–79, 1997. 21. V. Caselles, R. Kimmel, G. Sapiro, and C. Sbert. Minimal surfaces based object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):394–398, April 1997. 22. J.T. Chang, J. Jin, and Y. Yu. A practical model for hair mutual interactions. In Proc. of SIGGRAPH conference. ACM, 2002. 23. H.H. Chen. Pose determination from line-to-plane correspondence: Existence condition and closed-form solutions. In Proceedings of the 3rd International Conference on Computer Vision, Osaka, Japan, pages 374–378, 1990. 24. D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002. 25. N. Cornelis, B. Leibe, K. Cornelis, and L. Van Gool. 3D urban scene modeling integrating recognition and reconstruction. International Journal of Computer Vision, 78(2):121–141, 2008. 26. D. Cox, J. Little, and D. O’Shea. Ideals, Varieties, and Algorithms. Springer, 1998. 27. I.J. Cox, S. Hingorani, B.M. Maggs, and S.B. Rao. Stereo without regularization, October 1992. 28. A. Criminisi, P. Perez, and K. Toyama. Object removal by exemplar-based inpainting. In Proceedings of IEEE Computer Vision and Pattern Recognition, volume 2, pages 721–728, Madison, Wisconsin, June 2003. 29. B. Curless and M. Levoy. A volumetric method for building complex models from range images. In Proceedings of SIGGRAPH, New Orleans, LA, pages 303–312, 1996. 30. M. de Berg, O. Cheong, M. Van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer, Berlin, 3rd ed. edition, 2008. 31. P. de Reffye, C. Edelin, J. Francon, M. Jaeger, and C. Puech. Plant models faithful to botanical structure and development. In SIGGRAPH, pages 151–158, August 1988. 32. P.E. Debevec, C.J. Taylor, and J. Malik. Modeling and rendering architecture from photographs: a hybrid geometry-and image-based approach. In SIGGRAPH ’96, New Orleans, August 1996. 33. M. Demazure. Sur deux probl`emes de reconstruction. Technical report, INRIA, 1988. 34. H. Derek, A.A. Efros, and H. Martial. Automatic photo pop-up. Proceeding of SIGGRAPH conference, 24:577–584, 2005. 35. F. Devernay and O.D. Faugeras. Automatic calibration and removal of distortion from scenes of structured environments. In Proceedings of the SPIE Conference on Investigate and Trial Image Processing, San Diego, California, USA, volume 2567. SPIE - Society of Photo-Optical Instrumentation Engineers, July 1995. 36. M. Dhome, M. Richetin, J.T. Laprest´e, and G. Rives. Determination of the attitude of 3D objects from a single perspective view. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(12):1265–1278, December 1989. 37. U.R. Dhond and J.K. Aggarwal. Structure from stereo – a review. IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489–1510, November 1989.
References
239
38. A. Dick, P.H.S. Torr, and R. Cipolla. Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 2:111–134, 2004. 39. A.A. Efros and T.K. Leung. Texture synthesis by non-parametric sampling. IEEE International Conference on Computer Vision, pages 1033–1038, 1999. 40. L. Falkenhagen. Depth estimation from stereoscopic image pairs assuming piecewise continuous surfaces. In Y. Paker and S. Wilbur, editors, Proceedings of the Workshop on Image Processing for Broadcast and Video Production, Hamburg, Germany, Series on Workshops in Computing, pages 115–127. Springer-Verlag, November 1994. 41. O. Faugeras. On the motion of 3D curves and its relationship to optical flow. Rapport de Recherche 1183, INRIA, Sophia–Antipolis, France, March 1990. 42. O. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig? 1992. 43. O. Faugeras. Three-Dimensional Computer Vision - A Geometric Viewpoint. Artificial intelligence. The MIT Press, Cambridge, MA, USA, Cambridge, MA, 1993. 44. O. Faugeras. Stratification of three-dimensional vision: Projective, affine and metric representations. Journal of the Optical Society of America, 12:465–484, 1995. 45. O. Faugeras and M. Hebert. The representation, recognition, and locating of 3D objects. The International Journal of Robotics Research, 5:27–52, 1986. 46. O. Faugeras and R. Keriven. Complete dense stereovision using level set methods. In Proceedings of the 5th European Conference on Computer Vision, Freiburg, Germany, pages 379–393, 1998. 47. O. Faugeras, Q. Luong, and T. Papadopoulo. The Geometry of Multiple Images. The MIT Press, Cambridge, MA, USA, 2001. 48. O. Faugeras and S. Maybank. Motion from point matches: Multiplicity of solutions. International Journal of Computer Vision, 3(4):225–246, 1990. 49. O. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspondences between n images. In Proceedings of the 5th International Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 951–956, June 1995. 50. O. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspondences between n images. Technical Report 2665, INRIA, October 1995. 51. O. Faugeras, L. Quan, and P. Sturm. Self-calibration of a 1d projective camera and its application to the self-calibration of a 2d projective camera. In Proceedings of the 5th European Conference on Computer Vision, Freiburg, Germany, pages 36–52, June 1998. 52. O. Faugeras and G. Toscani. Camera calibration for 3D computer vision. In Proceedings of International Workshop on Machine Vision and Machine Intelligence, Tokyo, Japan, 1987. 53. P. Felzenszwalb and D. Huttenlocher. Efficient Graph-Based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. 54. M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381 – 395, June 1981. 55. A.W. Fitzgibbon and A. Zisserman. Automatic camera recovery for closed or open image sequences. In European Conference on Computer Vision, pages 311–326, june 1998. 56. R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, Ltd., second edition, 1987. 57. W. F¨orstner. Reliability analysis of parameter estimation in linear models with applications to mensuration problems in computer vision. Computer Vision, Graphics and Image Processing, 40:273–310, 1987. 58. W. F¨orstner. A framework for low level feature extraction. In Proceedings of the 3rd European Conference on Computer Vision, Stockholm, Sweden, pages 383–394, 1994. 59. D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2003. 60. C. Frueh and A. Zakhor. Automated reconstruction of building facades for virtual walk-thrus. ACM Transaction on Graphics (Proceeding of SIGGRAPH conference), pages 1–1, 2003. 61. P. Fua. Combining stereo and monocular information to compute dense depth maps that preserve discontinuities. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991.
240
References
62. P. Fua. Parametric models are versatile: The case of model based optimization. In ISPRS WG III/2 Joint Workshop, Stockholm, Sweden, September 1995. 63. P. Fua. From multiple stereo views to multiple 3d surfaces. International Journal of Computer Vision, 24(1):19–35, 1997. 64. Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. In Proceedings of IEEE Computer Vision and Pattern Recognition, 2007. 65. S. Ganapathy. Decomposition of transformation matrices for robot vision. In Proc. of IEEE conference on Robotics and Automation, pages 130–139, 1984. 66. A. Van Gelder and J. Wilhelms. An interactive fur modeling technique. In Proc. Graphics Interface, pages 181–188, 1997. 67. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721– 741, November 1984. 68. M. Goesele, N. Snavely, B. Curless, S.M. Seitz, and H. Hoppe. Multi-view stereo for community photo collections. IEEE International Conference on Computer Vision, pages 1–8, 2007. 69. J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, volume 2, pages 3–17. Springer-Verlag, May 2002. 70. S. Grabli, F. Sillion, S.R. Marschner, and J.E. Lengyel. Image-based hair capture by inverse lighting. Graphics Interface, pages 51–58, 2002. 71. P. Gros and L. Quan. Projective invariants for vision. Technical Report RT 90 IMAG - 15 LIFIA, LIFIA – IRIMAG, Grenoble, France, December 1992. 72. F. Han and S.-C. Zhu. Bayesian reconstruction of 3d shapes and scenes from a single image. In Proc. IEEE Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, pages 12–20, October 2003. 73. R.M. Haralick, C. Lee, K. Ottenberg, and M. N¨olle. Analysis and solutions of the three point perspective pose estimation problem. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Maui, Hawaii, USA, pages 592–598, 1991. 74. C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conference, pages 147–151, 1988. 75. R. Hartley. Lines and points in three views - an integrated approach. Technical report, G.E. CRD, 1994. 76. R. Hartley. In defence of the 8-point algorithm. In Proceedings of the 5th International Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 1064–1070, June 1995. 77. R.I. Hartley. Euclidean reconstruction from uncalibrated views. In Proceeding of the DARPA – ESPRIT workshop on Applications of Invariants in Computer Vision, Azores, Portugal, pages 187–202, October 1993. 78. R.I. Hartley. Lines and points in three views and the trifocal tensor. International Journal of Computer Vision, 22(2):125–140, 1997. 79. R.I. Hartley, R. Gupta, and T. Chang. Stereo from uncalibrated cameras. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Urbana-Champaign, Illinois, USA, pages 761–764, 1992. 80. R.I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, June 2000. 81. A. Hengel, A. Dick, T. Thormhlen, B. Ward, and P. Torr. VideoTrace: rapid interactive scene modelling from video. Proceeding of SIGGRAPH conference, 26:86, 2007. 82. A. Heyden. Geometry and Algebra of Multiple Projective Transformations. PhD thesis, Lund Institute of Technology, 1995. 83. A. Heyden. Reduced multilinear constraints - theory and experiments. International Journal of Computer Vision, 1(30):5–26, 1998. 84. R.J. Holt and A. N. Netravali. Uniqueness of solutions to three perspective views of four points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(3):303–307, 1995.
References
241
85. H. Hoppe, T. Derose, T. Duchamp, J. McDonalt, and W. Stuetzle. Surface reconstruction from unorganized points. Computer Graphics, 26:71–77, 1992. 86. S.D. Hordley. Scene illumination estimation: past, present, and future. Color research and application, 31(4):303–314, 2006. 87. B.K.P. Horn. Closed form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A, 4(4):629–642, 1987. 88. B.K.P. Horn. Projective geometry considered harmful. Technical report, MIT, 1999. 89. T.S. Huang and O.D. Faugeras. Some properties of the E matrix in two-view motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(12):1310–1312, December 1989. 90. T. Ijiri, O. Owada, M. Okabe, and T. Igarashi. Floral diagrams and inflorescences: Interactive flower modeling using botanical structural constraints. ACM Transactions on Graphics (SIGGRAPH), 24(3):720–726, July 2005. 91. S.S. Intille and A.F. Bobick. Disparity-space images and large occlusion stereo. In Proceedings of the 3rd European Conference on Computer Vision, Stockholm, Sweden, pages 179–186. Springer-Verlag, 1994. 92. E. Izquierdo and S. Kruse. Image analysis for 3d modeling, rendering and virtual view generation. Computer Vision and Image Understanding, 71(2):231–253, August 1998. 93. M. Urban J. Matas, O. Chum and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In British Machine Vision Conference, pages 384–393, 2002. 94. P. Tan P. Zhao E. Ofek J. Xiao, T. Fang and L. Quan. Image-based fac¸ade modeling. ACM Transactions on Graphics, 27(5):161:1–161:10, 2008. 95. P. Zhao M. Lhuillier J. Xiao, T. Fang and L. Quan. Image-based street-side city modeling. ACM Transactions on Graphics, 28(5):114:1–114:12, 2009. 96. G. Jiang, L. Quan, and H.T. Tsui. Circular motion geometry using minimal data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):721–731, 2004. 97. G. Jiang, H.T. Tsui, L. Quan, and A. Zisserman. Geometry of single axis motion using conic fitting. IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear, 2003. 98. S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi. Gradient flows and geometric active contour models. In Proceedings of the 5th International Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 810–815, June 1995. 99. T.-Y. Kim and U. Neumann. Interactive multi-resolution hair modeling and editing. In Proc. of SIGGRAPH conference. ACM, 2002. 100. J. Koenderink and A. van Doorn. Affine structure from motion. Journal of the Optical Society of America A, 8(2):377–385, 1991. 101. J.J. Koenderink. The structure of images. Biological Cybernetics, 50:363–396, 1984. 102. J.J. Koenderink. Solid Shape. The MIT Press, Cambridge, MA, USA, 1990. 103. J.J. Koenderink and A.J. van Doorn. Affine structure from motion. Technical report, Utrecht University, Utrecht, The Netherlands, October 1989. 104. V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, volume 3, May 2002. 105. A. Koschan. What is new in computational stereo since 1989 : A survey on stereo papers. Technical report, Department of Computer Science, University of Berlin, August 1993. 106. J. Krames. Zur Ermittlung eines Objektes aus zwei Perspektiven (Ein Beitrag zur Theorie ¨ der ,,gef¨ahrlichen Orter”). Monatshefte f¨ur Mathematik und Physik, 49:327–354, 1941. 107. K.N. Kutulakos and S.M. Seitz. A theory of shape by space carving. In Proceedings of the 7th International Conference on Computer Vision, Kerkyra, Greece, volume 1, pages 307–314, 1999. 108. A. Laurentini. The visual hull concept for silhouette based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2):150–162, 1994. 109. S. Laveau. G´eom´etrie d’un syst`eme de N cam´eras. Th´eorie, estimation, et applications. ´ Th`ese de doctorat, Ecole Polytechnique, May 1996.
242
References
110. R.K. Lenz and R.Y. Tsai. Techniques for calibration of the scale factor and image center for high accuracy 3D machine vision metrology. In Proceedings of IEEE International Conference on Robotics and Automation, Raleigh, North Carolina, USA, pages 68–75, 1987. 111. M. Lhuillier. Efficient dense matching for textured scenes using region growing. In Proceedings of the ninth British Machine Vision Conference, Southampton, England, pages 700–709, 1998. 112. M. Lhuillier and L. Quan. Image interpolation by joint view triangulation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Fort Collins, Colorado, USA, volume 2, pages 139–145, 1999. 113. M. Lhuillier and L. Quan. Edge-constrained joint view triangulation for image interpolation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Hilton Head Island, South Carolina, USA, volume 2, pages 218–224, June 2000. 114. M. Lhuillier and L. Quan. Robust dense matching using local and global geometric constraints. In Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, volume 1, pages 968–972, 2000. ICPR’2000 Piero Zamperoni Best Student Paper Award. 115. M. Lhuillier and L. Quan. Image-based rendering by match propagation and joint view triangualtion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1140– 1146, 2002. 116. M. Lhuillier and L. Quan. Quasi-dense reconstruction from image sequence. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, volume 2, pages 125–139, 2002. 117. M. Lhuillier and L. Quan. Image interpolation by match propagation and joint view triangulation. IEEE Transactions on Video and Circuits, Special Issue on Image-Based Modeling and Rendering, to appear, 2003. 118. M. Lhuillier and L. Quan. Surface reconstruction by integrating 3d and 2d data of multiple views. In Proceedings of the 9th International Conference on Computer Vision, Nice, France, October 2003. 119. M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):418– 433, 2005. 120. Y. Li, J. Sun, C.K. Tang, and H.Y. Shum. Lazy snapping. SIGGRAPH 2004, Los Angeles, USA, 23(3):303–308, 2004. 121. T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):79–116, 1998. 122. Y. Liu and T.S. Huang. A linear algorithm for motion estimation using straight line correspondences. Computer Vision, Graphics and Image Processing, 44(1):35–57, October 1988. 123. Y. Liu, T.S. Huang, and O.D. Faugeras. Determination of camera location from 2D to 3D line and point. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1):28– 37, January 1990. 124. H.C. Longuet-Higgins. A computer program for reconstructing a scene from two projections. Nature, 293:133–135, September 1981. 125. H.C. Longuet-Higgins. A method of obtaining the relative positions of 4 points from 3 perspective projections. In Proceedings of the British Machine Vision Conference, Glasgow, Scotland, pages 86–94, 1991. 126. M.I.A. Lourakis and A.A. Argyros. The design and implementation of a generic sparse bundle adjustment software package based on the levenberg-marquardt algorithm. Technical Report 340, Institute of Computer Science - FORTH, Heraklion, Crete, Greece, Aug. 2004. Available from http://www.ics.forth.gr/˜lourakis/sba. 127. D. Lowe. Distinctive image feature from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. 128. D.G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Norwell, Massachusets, 1985.
References
243
129. B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, 1981. 130. Q.T. Luong. Matrice fondamentale et autocalibration en vision par ordinateur. Th`ese de doctorat, Universit´e de Paris-Sud, Orsay, France, December 1992. 131. Q.T. Luong and T. Vieville. Canonic representations for the geometries of multiple projective views. Technical report, University of California, Berkeley, EECS, Cory Hall 211-215, University of California, Berkeley, CA 94720, October 1993. 132. Y. Ma, S. Soatto, J. Kosecka, and S. Sastry. An Invitation to 3-D Vision: From Images to Geometric Models. 2001. 133. R. Malladi, J.A. Sethian, and B.C. Vemuri. Shape modeling with front propagation: A level set approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2):158– 175, February 1995. 134. D. Marr. Vision. W.H. Freeman and Company, San Francisco, California, USA, 1982. 135. S. Marschner, H. Jensen, M. Cammarano, S. Worley, and P. Hanrahan. Light scattering from human hair fibers. ACM Transactions on Graphics, 3(22):780–791, 2003. 136. W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan. Image-based visual hulls. In SIGGRAPH 2000, New Orleans, USA, pages 427–437, 2000. 137. S. Maybank. Theory of Reconstruction from Image Motion. Springer-Verlag, 1993. 138. S.J. Maybank and O.D. Faugeras. A theory of self calibration of a moving camera. International Journal of Computer Vision, 8(2):123–151, 1992. 139. S.J. Maybank and A. Shashua. Ambiguity in reconstruction from images of six points. In Proceedings of the 6th International Conference on Computer Vision, Bombay, India, pages 703–708, 1998. 140. P. F. McLauchlan. Gauge independence in optimization algorithms for 3D vision. In Proceedings of the Vision Algorithms Workshop, Dublin, Ireland, 2000. 141. R. Mech and P. Prusinkiewicz. Visual models of plants interacting with their environment. In SIGGRAPH, pages 397–410, 1996. 142. G. Medioni, M.S. Lee, and C.K. Tang. A Computational Framework for Segmentation and Grouping. Elsevier, 2000. 143. K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, 2001. 144. R. Mohr, L. Morin, C. Inglebert, and L. Quan. Geometric solutions to some 3D vision problems. In J.L. Crowley, E. Granum, and R. Storer, editors, Integration and Control in Real Time Active Vision, ESPRIT BRA Series. Springer-Verlag, 1991. 145. R. Mohr, L. Quan, and F. Veillon. Relative 3D reconstruction using multiple uncalibrated images. The International Journal of Robotics Research, 14(6):619–632, 1995. 146. R. Mohr, L. Quan, F. Veillon, and B. Boufama. Relative 3D reconstruction using multiple uncalibrated images. Technical Report RT 84-I-IMAG LIFIA 12, LIFIA – IRIMAG, 1992. 147. H. Moravec. Obstable avoidance and navigation in the real world by a seeing robot rover. Technical report CMU-RI-tr-3, Carnegie Mellon University, 1981. 148. H.P. Moravec. Towards automatic visual obstacle avoidance. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, Cambridge, Massachusetts, USA, page 584, August 1977. 149. D. Morris. Gauge Freedoms and Uncertainty Modeling for 3D Computer Vision. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, March 2001. 150. P. M¨uller, P. Wonka, S. Haegler, A. Ulmer, and L. Van Gool. Procedural modeling of buildings. ACM Transactions on Graphics, 25(3):614–623, 2006. 151. P. M¨uller, G. Zeng, P. Wonka, and L. Van Gool. Image-based procedural modeling of fac¸ades. Proceeding of SIGGRAPH conference, 26:85, 2007. 152. J.L. Mundy and A. Zisserman, editors. Geometric Invariance in Computer Vision. The MIT Press, Cambridge, MA, USA, 1992. 153. M. Nakajima, K.W. Ming, and H.Takashi. Generation of 3d hair model from multiple pictures. IEEE Computer Graphics and Application, 12:169–183, 1998.
244
References
154. P.J. Narayanan, P.W. Rander, and T. Kanade. Constructing virtual worlds using dense stereo. In Proceedings of the 5th European Conference on Computer Vision, Freiburg, Germany, pages 3–10, 1998. 155. D. Nist´er. Automatic Dense Reconstruction from Uncalibrated Video Sequences. Ph.d. thesis, NADA, KTH, Sweden, 2001. 156. D. Nist´er. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770, 2004. 157. H. Noser, S. Rudolph, and P. Stucki. Physics-enhanced l-systems. In Procs. 9th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, volume 2, pages 214–221, 2001. 158. H. Noser and D. Thalmann. Simulating life of virtual plants, fishes and butterflies. In N. Magnenat-Thalmann and D. Thalmann, editors, Artificial Life and Virtual Reality. John Wiley and Sons, Ltd., 1994. 159. Byong Mok Oh, Max Chen, Julie Dorsey, and Fr´edo Durand. Image-based modeling and photo editing. In Eugene Fiume, editor, SIGGRAPH 2001, Computer Graphics Proceedings, pages 433–442. ACM Press / ACM SIGGRAPH, 2001. 160. M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353–363, April 1993. 161. A. Oliva and A. Torralba. Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research, 155:23–36, 2006. PMID: 17027377. 162. S. Osher and J.A. Sethian. Fronts propagating with curvature-dependent speed: Algorithms based on hamilton-jacobi formulations. Journal of Computational Physics, 79:12–49, 1988. 163. G.P. Otto and T.K. Chau. A region-growing algorithm for matching of terrain images. Image and Vision Computing, 7(2):83–94, 1989. 164. S. Paris, F. Sillion, and L. Quan. A volumetric reconstruction method from multiple calibrated views using global graph cut optimization. International Journal of Computer Vision, 2004. 165. J. Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. Proceeding of AAAI National Conference on AI, pages 133–136, 1982. 166. Patrick P´erez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Trans. Graph., 22(3):313–318, 2003. 167. J. Philip. A non-iterative algorithm for determining all essential matrices corresponding to five point pairs. Photogrammetric Record, 15(88):589–599, October 1996. 168. S.B. Pollard, J.E.W. Mayhew, and J.P. Frisby. PMF: A stereo correspondence algorithm using a disparity gradient constraint. Perception, 14:449–470, 1985. 169. M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. In Proceedings of the 6th International Conference on Computer Vision, Bombay, India, pages 90–95, January 1998. 170. M. Pollefeys, R. Koch, M. Vergauwen, and L. Van Gool. Metric 3d surface reconstruction from uncalbrated image sequences. In R. Koch and L. Van Gool, editors, European Work´ shop, SMILE98, pages 139–154. Springer-Verlag, 1998. 171. M. Pollefeys, D. Nist´er, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stew´enius, R. Yang, G. Welch, and H. Towles. Detailed real-time urban 3d reconstruction from video. International Journal of Comptuer Vision, 78(2–3):143–167, 2007. 172. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C The Art of Scientific Computing. Cambridge University Press, 2nd edition, 1992. 173. P. Pritchett and A. Zisserman. Wide baseline stereo matching. In Proceedings of the 6th International Conference on Computer Vision, Bombay, India, pages 754–760. IEEE Computer Society Press, January 1998. 174. P. Prusinkiewicz, M. James, and R. Mech. Synthetic topiary. In SIGGRAPH, pages 351–358, July 1994. 175. P. Prusinkiewicz, L. Muendermann, R. Karwowski, and B. Lane. The use of positional information in the modeling of plants. In SIGGRAPH, pages 289–300, August 2001.
References
245
176. L. Quan. Invariants of 6 points from 3 uncalibrated images. In J.O. Eklundh, editor, Proceedings of the 3rd European Conference on Computer Vision, Stockholm, Sweden, volume II, pages 459–470. Springer-Verlag, 1994. 177. L. Quan. Algebraic relationship between the bilinear and the trilinear constraints of three uncalibrated images. Research notes, also as Technical Report INRIA no. 3345 in 1998, 1995. 178. L. Quan. Invariants of six points and projective reconstruction from three uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):34–46, January 1995. 179. L. Quan. Conic reconstruction and correspondence from two views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):151–160, February 1996. 180. L. Quan. Self-calibration of an affine camera from multiple views. International Journal of Computer Vision, 19(1):93–105, May 1996. 181. L. Quan and T. Kanade. Affine structure from line correspondences with uncalibrated affine cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(8):834–845, August 1997. 182. L. Quan and Z.D. Lan. Linear n-point camera pose determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8):774–780, August 1999. 183. L. Quan and R. Mohr. Determining perspective structures using hierarchical Hough transform. Pattern Recognition Letters, 9(4):279–286, 1989. 184. L. Quan, P. Tan, G. Zeng, L. Yuan, J. Wang, and S.B. Kang. Image-based plant modeling. ACM Transactions on Graphics, 25(3):599–604, 2006. 185. L. Quan, J. Wang, P. Tan, and L. Yuan. Image-based modeling by joint segmentation. International Journal of Computer Vision, 75(1):135–150, 2007. 186. G.M. Qu´enot. The orthogonal algorithm for optical flow detection using dynamic programming. In International Conference on Acoustics, Speech and Signal Processing, 1992. 187. A. Reche-Martinez, I. Martin, and G. Drettakis. Volumetric reconstruction and interactive rendering of trees from photographs. ACM Transactions on Graphics (SIGGRAPH), 23(3):720–727, August 2004. 188. P.J. Rousseeuw and A.M. Leroy. Robust Regression and Outlier Detection, volume XIV. John Wiley & Sons, Ltd., New York, 1987. 189. T. Sakaguchi. Botanical tree structure modeling based on real image set. In SIGGRAPH 1998 (Tech. Sketch), page 272, June 1998. 190. T. Sakaguchi and J. Ohya. Modeling and animation of botanical trees for interactive virtual environments. In Procs. ACM Symposium on Virtual Reality Software and Technology, pages 139–146, December 1999. 191. A. Saxena, M. Sun, and A.Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, 2009. 192. D. Scharstein. Stereo vision for view synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition, San Francisco, California, USA, pages 852–858, June 1996. 193. D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1/2/3):7–42, 2002. 194. G. Schindler, P. Krishnamurthy, and F. Dellaert. Line-Based structure from motion for urban environments. In 3D Data Processing, Visualization, and Transmission, Third International Symposium on, pages 846–853, 2006. 195. S.M. Seitz and C.R. Dyer. Photorealistic scene reconstruction by voxel coloring. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Puerto Rico, USA, pages 1067–1073. IEEE Computer Society Press, June 1997. 196. J.G. Semple and G.T. Kneebone. Algebraic Projective Geometry. Oxford Science Publication, 1952. 197. J.A. Sethian. Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge, UK, 1999.
246
References
198. L.S. Shapiro, A. Zisserman, and M. Brady. 3D motion recovery via affine epipolar geometry. International Journal of Computer Vision, 16(2):147–182, 1995. 199. A. Shashua. Algebraic functions for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):779–789, August 1995. 200. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. 201. J. Shi and C. Tomasi. Good features to track. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, Washington, USA, pages 593–600, 1994. 202. I. Shlyakhter, M. Rozenoer, J. Dorsey, and S. Teller. Reconstructing 3d tree models from instrumented photographs. IEEE Computer Graphics and Applications, 21(3):53–61, May/June 2001. 203. J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost for image understanding: Multi-Class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23, 2009. 204. H.Y. Shum, S.C. Chan, and S.B. Kang. Image-Based Rendering. Springer-Verlag, 2007. 205. S.N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Pollefeys. Interactive 3D architectural modeling from unordered photo collections. ACM Transactions on Graphics, 27(5):1– 10, 2008. 206. C.C. Slama, editor. Manual of Photogrammetry, Fourth Edition. American Society of Photogrammetry and Remote Sensing, Falls Church, Virginia, USA, 1980. 207. N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d. In SIGGRAPH Conference Proceedings, pages 835–846, New York, NY, USA, 2006. ACM Press. 208. M. Spetsakis and J. Aloimonos. Structure from motion using line correspondences. International Journal of Computer Vision, 4:171–183, 1990. 209. M. Spetsakis and J. Aloimonos. A unified theory of structure from motion. In Proceedings of DARPA Image Understanding Workshop, pages 271–283, 1990. 210. I. Stamosand and P.K. Allen. Geometry and texture recovery of scenes of large scale. Computer Vision and Image Understanding, 88(2):94–118, 2002. 211. I. Stewart. D’o`u a e´ t´e prise la photo ? Pour la science, 148:106–111, February 1990. 212. H. Stew´enius, C. Engels, and D. Nist´er. Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing, (60):284–294, 2006. 213. G. Strang and K. Borre. Linear algebra, geodesy, and GPS. Wellesley–Cambridge Press, 1997. 214. P. Sturm. Vision 3D non calibr´ee : contributions a` la reconstruction projective et e´ tude des mouvements critiques pour l’auto-calibrage. Th`ese de doctorat, Institut National Polytechnique de Grenoble, December 1997. 215. P. Sturm and S. Maybank. On plane-based camera calibration: A general algorithm, singularities, applications. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Fort Collins, Colorado, USA, pages 432–437, June 1999. 216. R. Sturm. Das Problem der Projektivit¨at und seine Anwendung auf die Fl¨achen Zweiten Grades. Math. Ann., 1:533–574, 1869. 217. J. Sun, L. Yuan, J. Jia, and H-Y. Shum. Image completion with structure propagation. ACM Transactions on Graphics, 24:861–868, 2005. 218. I.E. Sutherland. Three-dimensional input by tablet. Proceedings of the IEEE, 62:453–461, 1974. 219. R. Szeliski. Rapid octree construction from image sequences. Computer Vision, Graphics and Image Processing, 58(1):23–32, July 1993. 220. R. Szeliski and H.Y. Shum. Motion estimation with quadtree splines. In Proceedings of the 5th International Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 757–763. IEEE Computer Society Press, June 1995. 221. R. Szeliski, D. Tonnesen, and D. Terzopoulos. Modelling surfaces of arbitrary topology with dynamic particles. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New York, USA, pages 82–87, June 1993.
References
247
222. P. Tan, G. Zeng, J. Wang, S.B. Kang, and L. Quan. Image-based tree modeling. ACM Transactions on Graphics, 26:87, 2007. 223. C.K. Tang and G. Medioni. Curvature-augmented tensor voting for shape inference from noisy 3d data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):858– 864, June 2002. 224. D. Tell and S. Carlsson. Wide baseline point matching using affine invariants computed from intensity profiles. In Proceedings of the 6th European Conference on Computer Vision, Dublin, Ireland, pages 814–828, 2000. 225. E.H. Thompson. Space resection: Failure cases. Photogrammetric Record, X(27):201–204, 1966. 226. C. Tomasi and T. Kanade. Detection and tracking of point features. Technical report CMUCS-91-132, Carnegie Mellon University, 1991. 227. C. Tomasi and T. Kanade. Factoring image sequences into shape and motion. In Proceedings of the IEEE Workshop on Visual Motion, Princeton, New Jersey, pages 21–28, Los Alamitos, California, USA, October 1991. IEEE Computer Society Press. 228. P.H.S. Torr and D.W. Murray. The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision, 24(3):271–300, 1997. 229. P.H.S. Torr and A. Zisserman. Robust parameterization and computation of the trifocal tensor. Image and Vision Computing, (15):591–605, 1997. 230. A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(5):854–869, 2007. 231. B. Triggs. Matching constraints and the joint image. In E. Grimson, editor, Proceedings of the 5th International Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 338–343. IEEE, IEEE Computer Society Press, June 1995. 232. B. Triggs. Autocalibration and the absolute quadric. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Puerto Rico, USA, pages 609–614. IEEE Computer Society Press, June 1997. 233. B. Triggs. Routines for relative pose of two calibrated cameras from 5 points. Technical report, INRIA, 2000. 234. B. Triggs. Detecting keypoints wiht stable position, orientation and scale under illumination changes. In European Conference on Computer Vision. Springer-Verlag, 2004. 235. B. Triggs, P.F. McLauchlan, R.I. Hartley, and A. Fitzgibbon. Bundle ajustment — a modern synthesis. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, volume 1883 of Lecture Notes in Computer Science, pages 298–372. SpringerVerlag, 2000. 236. C. Tripp. Where is the camera? The Mathematical Gazette, pages 8–14, 1987. 237. T. Tuytelaars and L. Van Gool. Wide baseline stereo based on local, affinely invariant regions. In British Machine Vision Conference, pages 412–422, 2000. 238. W. Van Haevre and P. Bekaert. A simple but effective algorithm to model the competition of virtual plants for light and space. In Procs. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG’03), Bory, Czech Republic, 2003. 239. G. Vogiatzis, P.H.S. Torr, S.M. Seitz, and R. Cipolla. Reconstructing relief surfaces. Image and Vision Computing, 26(3):397–404, 2008. 240. L. Wang, W. Wang, J. Dorsey, X. Yang, B. Guo, and H.Y. Shum. Real-time rendering of plant leaves. SIGGRAPH 2005, Los Angeles, USA, 24(3):712–719, 2005. 241. K. Ward and M.C. Lin. Adaptive grouping and subdivision for simulating hair dynamics. In Proc. of Pacific Graphics, 2003. 242. J. Weber and J. Penn. Creation and rendering of realistic trees. In SIGGRAPH, pages 119– 127, August 1995. 243. Y. Wei, E. Ofek, L. Quan, and H-Y. Shum. Modeling hair from multiple views. ACM Transaction on Graphics (Proceeding of SIGGRAPH conference), pages 816 – 820, 2005.
248
References
244. Y. Weiss and W.T. Freeman. On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2):723– 735, 2001. 245. J. Weng, Y. Liu, T.S. Huang, and N. Ahuja. Estimating motion/structure from line correspondences: A robust linear algorithm and uniqueness theorems. In Proceedings of the Conference on Computer Vision and Pattern Recognition, San Diego, California, USA, pages 387–392, June 1988. 246. T. Werner and A. Zisserman. Model selection for automated architectural reconstruction from multiple views. British Machine Vision Conference, pages 53–62, 2002. 247. R.T. Whitaker. A level-set approach to 3d reconstruction from range data. International Journal of Computer Vision, 29(3):203–231, 1998. 248. J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visual dictionary. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1800–1807 Vol. 2, 2005. 249. P. Wonka, M. Wimmer, F. Sillion, and W. Ribarsky. Instant architecture. ACM Transactions on Graphics, 4:669–677, 2003. 250. Robert J. Woodham. Photometric method for determining surface orientation from multiple images. pages 513–531, 1989. 251. B.P. Wrobel. Minimum solutions for orientation. In Proc. of the Workshop on Calibration and Orientation of Cameras in Computer Vision, Washington D.C., USA. Springer-Verlag, August 1992. 252. H. Xu, N. Gossett, and B. Chen. Knowledge and heuristic-based modeling of laser-scanned trees. ACM Transactions on Graphics, 26(4):303C308, 2007. 253. L. Zebedin, A. Claus, , B. Gruber-Geymayer, and K. Karner. Towards 3d map generation from digital aerial images. International Journal of Photogrammetry and Remote Sensing, 60:413–427, 2006. 254. Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, November 2000. 255. Z. Zhang, R. Deriche, O.D. Faugeras, and Q.T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence, 78(1-2):87–120, 1995. Appeared in October 1995, also INRIA Research Report No.2273, May 1994. 256. Z. Zhang, Z. Liu, D. Adler, M.F. Cohen, E. Hanson, and Y. Shan. Robust and rapid generation of animated faces from video images: A model-based modeling approach. Technical Report MSR-TR-2001-101, Microsoft Research, 2001. 257. H.K. Zhao, S. Osher, and R. Fedkiw. Fast surface reconstruction using the level set method. In Proc. of IEEE Workshop on Variational and Level Set Methods in Computer Vision, 2001.
Index
absolute conic, 19 absolute orientation, 41 absolute points, 19 absolute quadric, 69 affine camera, 73 affine space, 8 aperture problem, 79 aspect ratio, 32 auto-calibration, 68 axis, 15 bilinear constraint, 67 blob detector, 82 branch recovery, 155 building modeling, 201 building partition, 209 building segmentation, 205 bundle adjustment, 87 bundle-adjustment, 87 calibration, 37 camera-centered coordinate frame, 31 canonical coordinates, 13 Chasles’s theorem, 35 circular point, 17 city modeling, 201 collinearity constraint, 30 collineation, 11 companion matrix, 21 congruency, 18, 52 conic, 14 conic envelope, 14 conjugate points, 14 constrained growth, 159 coplanarity constraint, 30, 43 corner detector, 81 critical configuration, 36, 38, 41, 52, 62
cross-ratio, 12 Demazure constraints, 49 Descartes, 8 difference of Gaussian, 83 DLT calibration, 38 DOG, 83 dual absolute quadric, 69 duality, 62 edge detection, 78 eigenvalue, 25 eight-point algorithm, 46 elimination, 24 epipolar geometry, 43 epipolar line, 43 epipole, 43 essential matrix, 48 Euclidean space, 9 fac¸ade, 182 fac¸ade augmentation, 190 fac¸ade decomposition, 186 fac¸ade modeling, 179 Faugeras-Toscani calibration, 38 finite point, 9 five intrinsic paramters, 32 five-point algorithm, 52 functional, 124 fundamental matrix, 43 gauge, 87 Gauss-Newton, 87 Gaussian, 79 Google Earth, 180, 202 Gr¨obner basis, 23 graded lexicographic order, 24
249
250
Index
graded reverse lexicographic order, 24 Grassmanian, 15
Newton-like, 86 normal covariance matrix, 112
hair modeling, 139 hair orientation, 143 hair volume, 141 harmonic division, 14 Hessian, 80 homogeneous coordinate, 9, 10 homography, 11 horopter, 42 Huang-Faugeras constraint, 49 hyperboloid, 15
orthogonal ruled quadric, 52 orthographic, 73 outlier, 88
ideal, 23 image of the absolute conic, 34 implicit surface, 124 incremental sparse SFM algorithm, 93 inlier, 90 inverse orthographic composition, 213 invisible branch, 157 Jacobian, 86 joint segmentation, 215 KLT, 80 Kruppa equation, 70 Lambertian, 78 Laplacian, 82 Laplacian of Gaussian, 82 leaf, 161 leaf extraction, 164 leaf reconstruction, 168 Least median of squares, 89 least median of squares, 89 least squares, 88 level-set method, 124 Levenberg-Marquardt, 87 Levenberg-Marquardt method, 86 lexicographic order, 23 line conic, 14 line reconstruction, 204 line transfer, 54, 55 linear algebra, 8 LMS, 89 LOG, 82 LS, 88 match propagation, 94 match propagation algorithm, 96 minimal surface, 123 monomial, 23 motion estimation, 42
parametrization, 87 pencil of lines, 14 pencil of planes, 15, 52 pinhole camera, 30 Pl¨uker, 15 plant, 154 point at infinity, 9 point of interest, 80 point transfer, 55 polar, 14 pole, 14 polynomial, 23 polynomial multiplication, 26 polynomial ring, 23 pose, 39 principal point, 31 projective transformation, 11 quadric, 15 quadrifocal constraint, 67 quadrilinear constraint, 67 quasi-dense SFM, 110 quasi-dense three-view algorithm, 109 quasi-dense two-view algorithm, 105 quasi-Hessian, 80 quaternion, 41 quotient space, 10 random sample consensus, 89 RANSAC, 89, 90 reconstruction, 86 relative orientation, 47 repetitive pattern, 188 repetitive pattern rediscovery, 218 resultant, 40 retina, 31 rigid transformation, 19 ring, 23 robust, 89 ruled surface, 15 scale space, 82 semantic segmentation, 207 seven-point algorithm, 45 SFM, 86 SIFT, 83 similarity transformation, 19
Index six-point algorithm, 61 skew, 32 space resection, 39 sparse, 87 sparse SFM, 92 sparse three-view algorithm, 91 sparse two-view algorithm, 91 standard basis, 23 Steiner’s theorem, 14 stereo vision, 42 structure from motion, 86 surface modeling, 121 surface reconstruction, 121 Sylvester, 22 three-point pose, 41 topology, 12, 14, 16 tree, 154 tree modeling, 151
251 trifocal tensor, 56 trilinear constraint, 67 twisted cubic, 15 unconstrained growth, 158 unstructured sparse SFM algorithm, 93 vanishing point, 9 variable focal length, 70 variation, 124 variety, 23 vector space, 8 vertex, 14 Virtual Earth, 180, 202 visible branch, 155 weak perspective, 73 world coordinate frame, 33